Knowledge Distillation for End-to-End Monaural Multi-Talker ASR System

Wangyou Zhang, Xuankai Chang, Yanmin Qian

End-to-end models for monaural multi-speaker automatic speech recognition (ASR) have become an important and interesting approach when dealing with the multi-talker mixed speech under cocktail party scenario. However, there is still a large performance gap between the multi-speaker and single-speaker speech recognition systems. In this paper, we propose a novel framework that integrates teacher-student training with the attention-based end-to-end ASR model, which can do the knowledge distillation from the single-talker ASR system to multi-talker one effectively. First the objective function is revised to combine the knowledge from both single-talker and multi-talker labels. Then we extend the original single attention to speaker parallel attention modules in the teacher-student training based end-to-end framework to boost the performance more. Moreover, a curriculum learning strategy on the training data with an ordered signal-to-noise ratios (SNRs) is designed to obtain a further improvement. The proposed methods are evaluated on two-speaker mixed speech generated from the WSJ0 corpus, which is commonly used for this task recently. The experimental results show that the newly proposed knowledge transfer architecture with an end-to-end model can significantly improve the system performance for monaural multi-talker speech recognition, and more than 15% relative WER reduction is achieved against the traditional end-to-end model.

 DOI: 10.21437/Interspeech.2019-3192

Cite as: Zhang, W., Chang, X., Qian, Y. (2019) Knowledge Distillation for End-to-End Monaural Multi-Talker ASR System. Proc. Interspeech 2019, 2633-2637, DOI: 10.21437/Interspeech.2019-3192.

  author={Wangyou Zhang and Xuankai Chang and Yanmin Qian},
  title={{Knowledge Distillation for End-to-End Monaural Multi-Talker ASR System}},
  booktitle={Proc. Interspeech 2019},