Transformer with Bidirectional Decoder for Speech Recognition

Xi Chen, Songyang Zhang, Dandan Song, Peng Ouyang, Shouyi Yin

Attention-based models have made tremendous progress on end-to-end automatic speech recognition (ASR) recently. However, the conventional transformer-based approaches usually generate the sequence results token by token from left to right, leaving the right-to-left contexts unexploited. In this work, we introduce a bidirectional speech transformer to utilize the different directional contexts simultaneously. Specifically, the outputs of our proposed transformer include a left-to-right target, and a right-to-left target. In inference stage, we use the introduced bidirectional beam search method, which can not only generate left-to-right candidates but also generate right-to-left candidates, and determine the best hypothesis by the score.

To demonstrate our proposed speech transformer with a bidirectional decoder (STBD), we conduct extensive experiments on the AISHELL-1 dataset. The results of experiments show that STBD achieves a 3.6% relative CER reduction (CERR) over the unidirectional speech transformer baseline. Besides, the strongest model in this paper called STBD-Big can achieve 6.64% CER on the test set, without language model rescoring and any extra data augmentation strategies.1

 DOI: 10.21437/Interspeech.2020-2677

Cite as: Chen, X., Zhang, S., Song, D., Ouyang, P., Yin, S. (2020) Transformer with Bidirectional Decoder for Speech Recognition. Proc. Interspeech 2020, 1773-1777, DOI: 10.21437/Interspeech.2020-2677.

  author={Xi Chen and Songyang Zhang and Dandan Song and Peng Ouyang and Shouyi Yin},
  title={{Transformer with Bidirectional Decoder for Speech Recognition}},
  booktitle={Proc. Interspeech 2020},