Low Latency End-to-End Streaming Speech Recognition with a Scout Network

Chengyi Wang, Yu Wu, Liang Lu, Shujie Liu, Jinyu Li, Guoli Ye, Ming Zhou


The attention-based Transformer model has achieved promising results for speech recognition (SR) in the offline mode. However, in the streaming mode, the Transformer model usually incurs significant latency to maintain its recognition accuracy when applying a fixed-length look-ahead window in each encoder layer. In this paper, we propose a novel low-latency streaming approach for Transformer models, which consists of a scout network and a recognition network. The scout network detects the whole word boundary without seeing any future frames, while the recognition network predicts the next subword by utilizing the information from all the frames before the predicted boundary. Our model achieves the best performance (2.7/6.4 WER) with only an average of 639 ms latency on the test-clean and test-other data sets of Librispeech.


 DOI: 10.21437/Interspeech.2020-1292

Cite as: Wang, C., Wu, Y., Lu, L., Liu, S., Li, J., Ye, G., Zhou, M. (2020) Low Latency End-to-End Streaming Speech Recognition with a Scout Network. Proc. Interspeech 2020, 2112-2116, DOI: 10.21437/Interspeech.2020-1292.


@inproceedings{Wang2020,
  author={Chengyi Wang and Yu Wu and Liang Lu and Shujie Liu and Jinyu Li and Guoli Ye and Ming Zhou},
  title={{Low Latency End-to-End Streaming Speech Recognition with a Scout Network}},
  year=2020,
  booktitle={Proc. Interspeech 2020},
  pages={2112--2116},
  doi={10.21437/Interspeech.2020-1292},
  url={http://dx.doi.org/10.21437/Interspeech.2020-1292}
}