Streaming Chunk-Aware Multihead Attention for Online End-to-End Speech Recognition

Shiliang Zhang, Zhifu Gao, Haoneng Luo, Ming Lei, Jie Gao, Zhijie Yan, Lei Xie

Recently, streaming end-to-end automatic speech recognition (E2E-ASR) has gained more and more attention. Many efforts have been paid to turn the non-streaming attention-based E2E-ASR system into streaming architecture. In this work, we propose a novel online E2E-ASR system by using Streaming Chunk-Aware Multihead Attention (SCAMA) and a latency control memory equipped self-attention network (LC-SAN-M). LC-SAN-M uses chunk-level input to control the latency of encoder. As to SCAMA, a jointly trained predictor is used to control the output of encoder when feeding to decoder, which enables decoder to generate output in streaming manner. Experimental results on the open 170-hour AISHELL-1 and an industrial-level 20000-hour Mandarin speech recognition tasks show that our approach can significantly outperform the MoChA-based baseline system under comparable setup. On the AISHELL-1 task, our proposed method achieves a character error rate (CER) of 7.39%, to the best of our knowledge, which is the best published performance for online ASR.

 DOI: 10.21437/Interspeech.2020-1972

Cite as: Zhang, S., Gao, Z., Luo, H., Lei, M., Gao, J., Yan, Z., Xie, L. (2020) Streaming Chunk-Aware Multihead Attention for Online End-to-End Speech Recognition. Proc. Interspeech 2020, 2142-2146, DOI: 10.21437/Interspeech.2020-1972.

  author={Shiliang Zhang and Zhifu Gao and Haoneng Luo and Ming Lei and Jie Gao and Zhijie Yan and Lei Xie},
  title={{Streaming Chunk-Aware Multihead Attention for Online End-to-End Speech Recognition}},
  booktitle={Proc. Interspeech 2020},