Tone Learning in Low-Resource Bilingual TTS

Ruolan Liu, Xue Wen, Chunhui Lu, Xiao Chen


We present a system for low-resource multi-speaker cross-lingual text-to-speech synthesis. In particular, we train with monolingual English and Mandarin speakers and synthesize every speaker in both languages. The Mandarin training data is limited to 15 minutes of speech by a female Mandarin speaker. We identify accent carry-over and mispronunciation in low-resource language as two major challenges in this scenario, and address these issues by tone preservation mechanisms and data augmentation, respectively. We apply these techniques to a recent strong multi-lingual baseline and achieve higher ratings in intelligibility and target accent, but slightly lower ratings in cross-lingual speaker similarity.


 DOI: 10.21437/Interspeech.2020-2180

Cite as: Liu, R., Wen, X., Lu, C., Chen, X. (2020) Tone Learning in Low-Resource Bilingual TTS. Proc. Interspeech 2020, 2952-2956, DOI: 10.21437/Interspeech.2020-2180.


@inproceedings{Liu2020,
  author={Ruolan Liu and Xue Wen and Chunhui Lu and Xiao Chen},
  title={{Tone Learning in Low-Resource Bilingual TTS}},
  year=2020,
  booktitle={Proc. Interspeech 2020},
  pages={2952--2956},
  doi={10.21437/Interspeech.2020-2180},
  url={http://dx.doi.org/10.21437/Interspeech.2020-2180}
}