Streaming Transformer-Based Acoustic Models Using Self-Attention with Augmented Memory

Chunyang Wu, Yongqiang Wang, Yangyang Shi, Ching-Feng Yeh, Frank Zhang


Transformer-based acoustic modeling has achieved great success for both hybrid and sequence-to-sequence speech recognition. However, it requires access to the full sequence, and the computational cost grows quadratically with respect to the input sequence length. These factors limit its adoption for streaming applications. In this work, we proposed a novel augmented memory self-attention, which attends on a short segment of the input sequence and a bank of memories. The memory bank stores the embedding information for all the processed segments. On the librispeech benchmark, our proposed method outperforms all the existing streamable transformer methods by a large margin and achieved over 15% relative error reduction, compared with the widely used LC-BLSTM baseline. Our findings are also confirmed on some large internal datasets.


 DOI: 10.21437/Interspeech.2020-2079

Cite as: Wu, C., Wang, Y., Shi, Y., Yeh, C., Zhang, F. (2020) Streaming Transformer-Based Acoustic Models Using Self-Attention with Augmented Memory. Proc. Interspeech 2020, 2132-2136, DOI: 10.21437/Interspeech.2020-2079.


@inproceedings{Wu2020,
  author={Chunyang Wu and Yongqiang Wang and Yangyang Shi and Ching-Feng Yeh and Frank Zhang},
  title={{Streaming Transformer-Based Acoustic Models Using Self-Attention with Augmented Memory}},
  year=2020,
  booktitle={Proc. Interspeech 2020},
  pages={2132--2136},
  doi={10.21437/Interspeech.2020-2079},
  url={http://dx.doi.org/10.21437/Interspeech.2020-2079}
}