Self-and-Mixed Attention Decoder with Deep Acoustic Structure for Transformer-Based LVCSR

Xinyuan Zhou, Grandee Lee, Emre Yılmaz, Yanhua Long, Jiaen Liang, Haizhou Li


Transformer has shown impressive performance in automatic speech recognition. It uses an encoder-decoder structure with self-attention to learn the relationship between high-level representation of source inputs and embedding of target outputs. In this paper, we propose a novel decoder structure that features a self-and-mixed attention decoder (SMAD) with a deep acoustic structure (DAS) to improve the acoustic representation of Transformer-based LVCSR. Specifically, we introduce a self-attention mechanism to learn a multi-layer deep acoustic structure for multiple levels of acoustic abstraction. We also design a mixed attention mechanism that learns the alignment between different levels of acoustic abstraction and its corresponding linguistic information simultaneously in a shared embedding space. The ASR experiments on Aishell-1 show that the proposed structure achieves CERs of 4.8% on the dev set and 5.1% on the test set, which are the best reported results on this task to the best of our knowledge.


 DOI: 10.21437/Interspeech.2020-2556

Cite as: Zhou, X., Lee, G., Yılmaz, E., Long, Y., Liang, J., Li, H. (2020) Self-and-Mixed Attention Decoder with Deep Acoustic Structure for Transformer-Based LVCSR. Proc. Interspeech 2020, 5016-5020, DOI: 10.21437/Interspeech.2020-2556.


@inproceedings{Zhou2020,
  author={Xinyuan Zhou and Grandee Lee and Emre Yılmaz and Yanhua Long and Jiaen Liang and Haizhou Li},
  title={{Self-and-Mixed Attention Decoder with Deep Acoustic Structure for Transformer-Based LVCSR}},
  year=2020,
  booktitle={Proc. Interspeech 2020},
  pages={5016--5020},
  doi={10.21437/Interspeech.2020-2556},
  url={http://dx.doi.org/10.21437/Interspeech.2020-2556}
}