Cross Attention with Monotonic Alignment for Speech Transformer

Yingzhu Zhao, Chongjia Ni, Cheung-Chi Leung, Shafiq Joty, Eng Siong Chng, Bin Ma

Transformer, a state-of-the-art neural network architecture, has been used successfully for different sequence-to-sequence transformation tasks. This model architecture disperses the attention distribution over entire input to learn long-term dependencies, which is important for some sequence-to-sequence tasks, such as neural machine translation and text summarization. However, automatic speech recognition (ASR) has a characteristic to have monotonic alignment between text output and speech input. Techniques like Connectionist Temporal Classification (CTC), RNN Transducer (RNN-T) and Recurrent Neural Aligner (RNA) build on top of this monotonic alignment and use local encoded speech representations for corresponding token prediction. In this paper, we present an effective cross attention biasing technique in transformer that takes monotonic alignment between text output and speech input into consideration by making use of cross attention weights. Specifically, a Gaussian mask is applied on cross attention weights to limit the input speech context range locally given alignment information. We further introduce a regularizer for alignment regularization. Experiments on LibriSpeech dataset find that our proposed model can obtain improved output-input alignment for ASR, and yields 14.5%–25.0% relative word error rate (WER) reductions.

 DOI: 10.21437/Interspeech.2020-1198

Cite as: Zhao, Y., Ni, C., Leung, C., Joty, S., Chng, E.S., Ma, B. (2020) Cross Attention with Monotonic Alignment for Speech Transformer. Proc. Interspeech 2020, 5031-5035, DOI: 10.21437/Interspeech.2020-1198.

  author={Yingzhu Zhao and Chongjia Ni and Cheung-Chi Leung and Shafiq Joty and Eng Siong Chng and Bin Ma},
  title={{Cross Attention with Monotonic Alignment for Speech Transformer}},
  booktitle={Proc. Interspeech 2020},