Self-Distillation for Improving CTC-Transformer-Based ASR Systems

Takafumi Moriya, Tsubasa Ochiai, Shigeki Karita, Hiroshi Sato, Tomohiro Tanaka, Takanori Ashihara, Ryo Masumura, Yusuke Shinohara, Marc Delcroix

We present a novel training approach for encoder-decoder-based sequence-to-sequence (S2S) models. S2S models have been used successfully by the automatic speech recognition (ASR) community. The important key factor of S2S is the attention mechanism as it captures the relationships between input and output sequences. The attention weights inform which time frames should be attended to for predicting the output labels. In previous work, we proposed distilling S2S knowledge into connectionist temporal classification (CTC) based models by using the attention characteristics to create pseudo-targets for an auxiliary cross-entropy loss term. This approach can significantly improve CTC models. However, it remained unclear whether our proposal could be used to improve S2S models. In this paper, we extend our previous work to create a strong S2S model, i.e. Transformer with CTC (CTC-Transformer). We utilize Transformer outputs and the source attention weights for making pseudo-targets that contain both the posterior and the timing information of each Transformer output. These pseudo-targets are used to train the shared encoder of the CTC-Transformer through the use of direct feedback from the Transformer-decoder and thus obtain more informative representations. Experiments on public and private datasets to perform various tasks demonstrate that our proposal is also effective for enhancing S2S model training. In particular, on a Japanese ASR task, our best system outperforms the previous state-of-the-art alternative.

 DOI: 10.21437/Interspeech.2020-1223

Cite as: Moriya, T., Ochiai, T., Karita, S., Sato, H., Tanaka, T., Ashihara, T., Masumura, R., Shinohara, Y., Delcroix, M. (2020) Self-Distillation for Improving CTC-Transformer-Based ASR Systems. Proc. Interspeech 2020, 546-550, DOI: 10.21437/Interspeech.2020-1223.

  author={Takafumi Moriya and Tsubasa Ochiai and Shigeki Karita and Hiroshi Sato and Tomohiro Tanaka and Takanori Ashihara and Ryo Masumura and Yusuke Shinohara and Marc Delcroix},
  title={{Self-Distillation for Improving CTC-Transformer-Based ASR Systems}},
  booktitle={Proc. Interspeech 2020},