Large-Scale End-to-End Multilingual Speech Recognition and Language Identification with Multi-Task Learning

Wenxin Hou, Yue Dong, Bairong Zhuang, Longfei Yang, Jiatong Shi, Takahiro Shinozaki


In this paper, we report a large-scale end-to-end language-independent multilingual model for joint automatic speech recognition (ASR) and language identification (LID). This model adopts hybrid CTC/attention architecture and achieves word error rate (WER) of 52.8 and LID accuracy of 93.5 on 42 languages with around 5000 hours of training data. We also compare the effects of using subword-level or character-level vocabulary for large-scale multilingual tasks. Furthermore, we transfer the pre-trained model to 14 low-resource languages. Results show that the pre-trained model achieves significantly better results than non-pretrained baselines on both language-specific and multilingual low-resource ASR tasks in terms of WER, which is reduced by 28.1% and 11.4% respectively.


 DOI: 10.21437/Interspeech.2020-2164

Cite as: Hou, W., Dong, Y., Zhuang, B., Yang, L., Shi, J., Shinozaki, T. (2020) Large-Scale End-to-End Multilingual Speech Recognition and Language Identification with Multi-Task Learning. Proc. Interspeech 2020, 1037-1041, DOI: 10.21437/Interspeech.2020-2164.


@inproceedings{Hou2020,
  author={Wenxin Hou and Yue Dong and Bairong Zhuang and Longfei Yang and Jiatong Shi and Takahiro Shinozaki},
  title={{Large-Scale End-to-End Multilingual Speech Recognition and Language Identification with Multi-Task Learning}},
  year=2020,
  booktitle={Proc. Interspeech 2020},
  pages={1037--1041},
  doi={10.21437/Interspeech.2020-2164},
  url={http://dx.doi.org/10.21437/Interspeech.2020-2164}
}