Combination of End-to-End and Hybrid Models for Speech Recognition

Jeremy H.M. Wong, Yashesh Gaur, Rui Zhao, Liang Lu, Eric Sun, Jinyu Li, Yifan Gong

Recent studies suggest that it may now be possible to construct end-to-end Neural Network (NN) models that perform on-par with, or even outperform, hybrid models in speech recognition. These models differ in their designs, and as such, may exhibit diverse and complementary error patterns. A combination between the predictions of these models may therefore yield significant gains. This paper studies the feasibility of performing hypothesis-level combination between hybrid and end-to-end NN models. The end-to-end NN models often exhibit a bias in their posteriors toward short hypotheses, and this may adversely affect Minimum Bayes’ Risk (MBR) combination methods. MBR training and length normalisation can be used to reduce this bias. Models are trained on Microsoft’s 75 thousand hours of anonymised data and evaluated on test sets with 1.8 million words. The results show that significant gains can be obtained by combining the hypotheses of hybrid and end-to-end NN models together.

 DOI: 10.21437/Interspeech.2020-2141

Cite as: Wong, J.H., Gaur, Y., Zhao, R., Lu, L., Sun, E., Li, J., Gong, Y. (2020) Combination of End-to-End and Hybrid Models for Speech Recognition. Proc. Interspeech 2020, 1783-1787, DOI: 10.21437/Interspeech.2020-2141.

  author={Jeremy H.M. Wong and Yashesh Gaur and Rui Zhao and Liang Lu and Eric Sun and Jinyu Li and Yifan Gong},
  title={{Combination of End-to-End and Hybrid Models for Speech Recognition}},
  booktitle={Proc. Interspeech 2020},