Improved Hybrid Streaming ASR with Transformer Language Models

Pau Baquero-Arnal, Javier Jorge, Adrià Giménez, Joan Albert Silvestre-Cerdà, Javier Iranzo-Sánchez, Albert Sanchis, Jorge Civera, Alfons Juan

Streaming ASR is gaining momentum due to its wide applicability, though it is still unclear how best to come close to the accuracy of state-of-the-art off-line ASR systems when the output must come within a short delay after the incoming audio stream. Following our previous work on streaming one-pass decoding with hybrid ASR systems and LSTM language models, in this work we report further improvements by replacing LSTMs with Transformer models. First, two key ideas are discussed so as to run these models fast during inference. Then, empirical results on LibriSpeech and TED-LIUM are provided showing that Transformer language models lead to improved recognition rates on both tasks. ASR systems obtained in this work can be seamlessly transferred to a streaming setup with minimal quality losses. Indeed, to the best of our knowledge, no better results have been reported on these tasks when assessed under a streaming setup.

 DOI: 10.21437/Interspeech.2020-2770

Cite as: Baquero-Arnal, P., Jorge, J., Giménez, A., Silvestre-Cerdà, J.A., Iranzo-Sánchez, J., Sanchis, A., Civera, J., Juan, A. (2020) Improved Hybrid Streaming ASR with Transformer Language Models. Proc. Interspeech 2020, 2127-2131, DOI: 10.21437/Interspeech.2020-2770.

  author={Pau Baquero-Arnal and Javier Jorge and Adrià Giménez and Joan Albert Silvestre-Cerdà and Javier Iranzo-Sánchez and Albert Sanchis and Jorge Civera and Alfons Juan},
  title={{Improved Hybrid Streaming ASR with Transformer Language Models}},
  booktitle={Proc. Interspeech 2020},