Monolingual Data Selection Analysis for English-Mandarin Hybrid Code-Switching Speech Recognition

Haobo Zhang, Haihua Xu, Van Tung Pham, Hao Huang, Eng Siong Chng


In this paper, we conduct data selection analysis in building an English-Mandarin code-switching (CS) speech recognition (CSSR) system, which is aimed for a real CSSR contest in China. The overall training sets have three subsets, i.e., a code-switching data set, an English (LibriSpeech) and a Mandarin data set respectively. The code-switching data are Mandarin dominated. First of all, it is found using the overall data yields worse results, and hence data selection study is necessary. Then to exploit monolingual data, we find data matching is crucial. Mandarin data is closely matched with the Mandarin part in the code-switching data, while English data is not. However, Mandarin data only helps on those utterances that are significantly Mandarin-dominated. Besides, there is a balance point, over which more monolingual data will divert the CSSR system, degrading results. Finally, we analyze the effectiveness of combining monolingual data to train a CSSR system with the HMM-DNN hybrid framework. The CSSR system can perform within-utterance code-switch recognition, but it still has a margin with the one trained on code-switching data.


 DOI: 10.21437/Interspeech.2020-1582

Cite as: Zhang, H., Xu, H., Pham, V.T., Huang, H., Chng, E.S. (2020) Monolingual Data Selection Analysis for English-Mandarin Hybrid Code-Switching Speech Recognition. Proc. Interspeech 2020, 2392-2396, DOI: 10.21437/Interspeech.2020-1582.


@inproceedings{Zhang2020,
  author={Haobo Zhang and Haihua Xu and Van Tung Pham and Hao Huang and Eng Siong Chng},
  title={{Monolingual Data Selection Analysis for English-Mandarin Hybrid Code-Switching Speech Recognition}},
  year=2020,
  booktitle={Proc. Interspeech 2020},
  pages={2392--2396},
  doi={10.21437/Interspeech.2020-1582},
  url={http://dx.doi.org/10.21437/Interspeech.2020-1582}
}