Mask CTC: Non-Autoregressive End-to-End ASR with CTC and Mask Predict

Yosuke Higuchi, Shinji Watanabe, Nanxin Chen, Tetsuji Ogawa, Tetsunori Kobayashi


We present Mask CTC, a novel non-autoregressive end-to-end automatic speech recognition (ASR) framework, which generates a sequence by refining outputs of the connectionist temporal classification (CTC). Neural sequence-to-sequence models are usually autoregressive: each output token is generated by conditioning on previously generated tokens, at the cost of requiring as many iterations as the output length. On the other hand, non-autoregressive models can simultaneously generate tokens within a constant number of iterations, which results in significant inference time reduction and better suits end-to-end ASR model for real-world scenarios. In this work, Mask CTC model is trained using a Transformer encoder-decoder with joint training of mask prediction and CTC. During inference, the target sequence is initialized with the greedy CTC outputs and low-confidence tokens are masked based on the CTC probabilities. Based on the conditional dependence between output tokens, these masked low-confidence tokens are then predicted conditioning on the high-confidence tokens. Experimental results on different speech recognition tasks show that Mask CTC outperforms the standard CTC model (e.g., 17.9% → 12.1% WER on WSJ) and approaches the autoregressive model, requiring much less inference time using CPUs (0.07 RTF in Python implementation). All of our codes are publicly available at https://github.com/espnet/espnet


 DOI: 10.21437/Interspeech.2020-2404

Cite as: Higuchi, Y., Watanabe, S., Chen, N., Ogawa, T., Kobayashi, T. (2020) Mask CTC: Non-Autoregressive End-to-End ASR with CTC and Mask Predict. Proc. Interspeech 2020, 3655-3659, DOI: 10.21437/Interspeech.2020-2404.


@inproceedings{Higuchi2020,
  author={Yosuke Higuchi and Shinji Watanabe and Nanxin Chen and Tetsuji Ogawa and Tetsunori Kobayashi},
  title={{Mask CTC: Non-Autoregressive End-to-End ASR with CTC and Mask Predict}},
  year=2020,
  booktitle={Proc. Interspeech 2020},
  pages={3655--3659},
  doi={10.21437/Interspeech.2020-2404},
  url={http://dx.doi.org/10.21437/Interspeech.2020-2404}
}