Data Augmentation for Code-Switch Language Modeling by Fusing Multiple Text Generation Methods

Xinhui Hu, Qi Zhang, Lei Yang, Binbin Gu, Xinkang Xu


To deal with the problem of data scarce in training language model (LM) for code-switching (CS) speech recognition, we proposed an approach to obtain augmentation texts from three different viewpoints. The first one is to enhance monolingual LM by selecting corresponding sentences for existing conversational corpora; The second one is based on replacements using syntactic constraint for a monolingual Chinese corpus, with the helps of an aligned word list obtained from a pseudo-parallel corpus, and part-of-speech (POS) of words; The third one is to use text generation based on a pointer-generator network with copy mechanism, using a real CS text data for training. All sentences from these approaches show improvement for CS LMs, and they are finally fused into an LM for CS ASR tasks.

Evaluations on LMs built by the above augmented data were conducted on two Mandarin-English CS speech sets DTANG, and SEAME. The perplexities were greatly reduced with all kinds of augmented texts, and speech recognition performances were steadily improved. The mixed word error rate (MER) of DTANG and SEAME evaluation dataset got relative reduction by 9.10% and 29.73%, respectively.


 DOI: 10.21437/Interspeech.2020-2219

Cite as: Hu, X., Zhang, Q., Yang, L., Gu, B., Xu, X. (2020) Data Augmentation for Code-Switch Language Modeling by Fusing Multiple Text Generation Methods. Proc. Interspeech 2020, 1062-1066, DOI: 10.21437/Interspeech.2020-2219.


@inproceedings{Hu2020,
  author={Xinhui Hu and Qi Zhang and Lei Yang and Binbin Gu and Xinkang Xu},
  title={{Data Augmentation for Code-Switch Language Modeling by Fusing Multiple Text Generation Methods}},
  year=2020,
  booktitle={Proc. Interspeech 2020},
  pages={1062--1066},
  doi={10.21437/Interspeech.2020-2219},
  url={http://dx.doi.org/10.21437/Interspeech.2020-2219}
}