Multi-Lingual Multi-Speaker Text-to-Speech Synthesis for Voice Cloning with Online Speaker Enrollment

Zhaoyu Liu, Brian Mak


Recent studies in multi-lingual and multi-speaker text-to-speech synthesis proposed approaches that use proprietary corpora of performing artists and require fine-tuning to enroll new voices. To reduce these costs, we investigate a novel approach for generating high-quality speeches in multiple languages of speakers enrolled in their native language. In our proposed system, we introduce tone/stress embeddings which extend the language embedding to represent tone and stress information. By manipulating the tone/stress embedding input, our system can synthesize speeches in native accent or foreign accent. To support online enrollment of new speakers, we condition the Tacotron-based synthesizer on speaker embeddings derived from a pre-trained x-vector speaker encoder by transfer learning. We introduce a shared phoneme set to encourage more phoneme sharing compared with the IPA. Our MOS results demonstrate that the native speech in all languages is highly intelligible and natural. We also find L2-norm normalization and ZCA-whitening on x-vectors are helpful to improve the system stability and audio quality. We also find that the WaveNet performance is seemingly language-independent: the WaveNet model trained with any of the three supported languages in our system can be used to generate speeches in the other two languages very well.


 DOI: 10.21437/Interspeech.2020-1464

Cite as: Liu, Z., Mak, B. (2020) Multi-Lingual Multi-Speaker Text-to-Speech Synthesis for Voice Cloning with Online Speaker Enrollment. Proc. Interspeech 2020, 2932-2936, DOI: 10.21437/Interspeech.2020-1464.


@inproceedings{Liu2020,
  author={Zhaoyu Liu and Brian Mak},
  title={{Multi-Lingual Multi-Speaker Text-to-Speech Synthesis for Voice Cloning with Online Speaker Enrollment}},
  year=2020,
  booktitle={Proc. Interspeech 2020},
  pages={2932--2936},
  doi={10.21437/Interspeech.2020-1464},
  url={http://dx.doi.org/10.21437/Interspeech.2020-1464}
}