Semi-Supervised End-to-End ASR via Teacher-Student Learning with Conditional Posterior Distribution

Zi-qiang Zhang, Yan Song, Jian-shu Zhang, Ian McLoughlin, Li-Rong Dai


Encoder-decoder based methods have become popular for automatic speech recognition (ASR), thanks to their simplified processing stages and low reliance on prior knowledge. However, large amounts of acoustic data with paired transcriptions is generally required to train an effective encoder-decoder model, which is expensive, time-consuming to be collected and not always readily available. However unpaired speech data is abundant, hence several semi-supervised learning methods, such as teacher-student (T/S) learning and pseudo-labeling, have recently been proposed to utilize this potentially valuable resource. In this paper, a novel T/S learning with conditional posterior distribution for encoder-decoder based ASR is proposed. Specifically, the 1-best hypotheses and the conditional posterior distribution from the teacher are exploited to provide more effective supervision. Combined with model perturbation techniques, the proposed method reduces WER by 19.2% relatively on the LibriSpeech benchmark, compared with a system trained using only paired data. This outperforms previous reported 1-best hypothesis results on the same task.


 DOI: 10.21437/Interspeech.2020-1574

Cite as: Zhang, Z., Song, Y., Zhang, J., McLoughlin, I., Dai, L. (2020) Semi-Supervised End-to-End ASR via Teacher-Student Learning with Conditional Posterior Distribution. Proc. Interspeech 2020, 3580-3584, DOI: 10.21437/Interspeech.2020-1574.


@inproceedings{Zhang2020,
  author={Zi-qiang Zhang and Yan Song and Jian-shu Zhang and Ian McLoughlin and Li-Rong Dai},
  title={{Semi-Supervised End-to-End ASR via Teacher-Student Learning with Conditional Posterior Distribution}},
  year=2020,
  booktitle={Proc. Interspeech 2020},
  pages={3580--3584},
  doi={10.21437/Interspeech.2020-1574},
  url={http://dx.doi.org/10.21437/Interspeech.2020-1574}
}