Learning Contextual Language Embeddings for Monaural Multi-Talker Speech Recognition

Wangyou Zhang, Yanmin Qian


End-to-end multi-speaker speech recognition has been a popular topic in recent years, as more and more researches focus on speech processing in more realistic scenarios. Inspired by the hearing mechanism of human beings, which enables us to concentrate on the interested speaker from the multi-speaker mixed speech by utilizing both audio and context knowledge, this paper explores the contextual information to improve the multi-talker speech recognition. In the proposed architecture, the novel embedding learning model is designed to accurately extract the contextual embedding from the multi-talker mixed speech directly. Then two advanced training strategies are further proposed to improve the new model. Experimental results show that our proposed method achieves a very large improvement on multi-speaker speech recognition, with ~25% relative WER reduction against the baseline end-to-end multi-talker ASR model.


 DOI: 10.21437/Interspeech.2020-2015

Cite as: Zhang, W., Qian, Y. (2020) Learning Contextual Language Embeddings for Monaural Multi-Talker Speech Recognition. Proc. Interspeech 2020, 304-308, DOI: 10.21437/Interspeech.2020-2015.


@inproceedings{Zhang2020,
  author={Wangyou Zhang and Yanmin Qian},
  title={{Learning Contextual Language Embeddings for Monaural Multi-Talker Speech Recognition}},
  year=2020,
  booktitle={Proc. Interspeech 2020},
  pages={304--308},
  doi={10.21437/Interspeech.2020-2015},
  url={http://dx.doi.org/10.21437/Interspeech.2020-2015}
}