Cross-Lingual, Multi-Speaker Text-To-Speech Synthesis Using Neural Speaker Embedding

Mengnan Chen, Minchuan Chen, Shuang Liang, Jun Ma, Lei Chen, Shaojun Wang, Jing Xiao

Neural network-based model for text-to-speech (TTS) synthesis has made significant progress in recent years. In this paper, we present a cross-lingual, multi-speaker neural end-to-end TTS framework which can model speaker characteristics and synthesize speech in different languages. We implement the model by introducing a separately trained neural speaker embedding network, which can represent the latent structure of different speakers and language pronunciations. We train the speech synthesis network bilingually and prove the possibility of synthesizing Chinese speaker’s English speech and vice versa. We explore different methods to fit a new speaker using only a few speech samples. The experimental results show that, with only several minutes of audio from a new speaker, the proposed model can synthesize speech bilingually and acquire decent naturalness and similarity for both languages.

 DOI: 10.21437/Interspeech.2019-1632

Cite as: Chen, M., Chen, M., Liang, S., Ma, J., Chen, L., Wang, S., Xiao, J. (2019) Cross-Lingual, Multi-Speaker Text-To-Speech Synthesis Using Neural Speaker Embedding. Proc. Interspeech 2019, 2105-2109, DOI: 10.21437/Interspeech.2019-1632.

  author={Mengnan Chen and Minchuan Chen and Shuang Liang and Jun Ma and Lei Chen and Shaojun Wang and Jing Xiao},
  title={{Cross-Lingual, Multi-Speaker Text-To-Speech Synthesis Using Neural Speaker Embedding}},
  booktitle={Proc. Interspeech 2019},