LVCSR with Transformer Language Models

Eugen Beck, Ralf Schlüter, Hermann Ney

Neural network language models (LMs) based on self-attention have recently outperformed the previous state of the art, LSTM LMs. Transformer LMs today are often used as a postprocessing step in lattice or n-best list rescoring. In this work the main focus is on using them in one-pass recognition. We show that by a simple reduction of redundant computations in batched self-attention we can obtain a 15% reduction in overall RTF on a well-tuned system. We also show that through proper initialization the layer normalization inside the residual blocks can be removed, yielding a further increase in forwarding speed. This is done under the constraint of staying close to state-of-the-art in terms of word-error rate (5.4% on LibriSpeech test-other) and achieving a real-time factor of around 1. Last but not least we also present an approach to speed up classic push-forward rescoring by mixing it with n-best list rescoring to better utilize the inherent parallelizability of Transformer language models, cutting the time needed for rescoring in half.

 DOI: 10.21437/Interspeech.2020-1164

Cite as: Beck, E., Schlüter, R., Ney, H. (2020) LVCSR with Transformer Language Models. Proc. Interspeech 2020, 1798-1802, DOI: 10.21437/Interspeech.2020-1164.

  author={Eugen Beck and Ralf Schlüter and Hermann Ney},
  title={{LVCSR with Transformer Language Models}},
  booktitle={Proc. Interspeech 2020},