Joint Decoding of CTC Based Systems for Speech Recognition

Jiaqi Guo, Yongbin You, Yanmin Qian, Kai Yu

Connectionist temporal classification (CTC) has been successfully used in speech recognition. It learns the alignments between speech frames and label sequences automatically without explicit pre-generated frame-level labels. While this property is convenient for shortening the training pipeline, it may become a potential disadvantage for the frame-level system combination due to inaccurate alignments. In this paper, a novel Dynamic Time Warping (DTW) based position calibration algorithm is proposed for joint decoding on two CTC based acoustic models. Furthermore, joint decoding for CTC and conventional hybrid NN-HMM models is also explored. Experiments on a large vocabulary Mandarin speech recognition task show that the proposed joint decoding of both CTC based and CTC-Hybrid based systems can achieve a significant and consistent character error rate reduction.

 DOI: 10.21437/Interspeech.2019-2026

Cite as: Guo, J., You, Y., Qian, Y., Yu, K. (2019) Joint Decoding of CTC Based Systems for Speech Recognition. Proc. Interspeech 2019, 2205-2209, DOI: 10.21437/Interspeech.2019-2026.

  author={Jiaqi Guo and Yongbin You and Yanmin Qian and Kai Yu},
  title={{Joint Decoding of CTC Based Systems for Speech Recognition}},
  booktitle={Proc. Interspeech 2019},