Efficient Neural Speech Synthesis for Low-Resource Languages Through Multilingual Modeling

Marcel de Korte, Jaebok Kim, Esther Klabbers


Recent advances in neural TTS have led to models that can produce high-quality synthetic speech. However, these models typically require large amounts of training data, which can make it costly to produce a new voice with the desired quality. Although multi-speaker modeling can reduce the data requirements necessary for a new voice, this approach is usually not viable for many low-resource languages for which abundant multi-speaker data is not available. In this paper, we therefore investigated to what extent multilingual multi-speaker modeling can be an alternative to monolingual multi-speaker modeling, and explored how data from foreign languages may best be combined with low-resource language data. We found that multilingual modeling can increase the naturalness of low-resource language speech, showed that multilingual models can produce speech with a naturalness comparable to monolingual multi-speaker models, and saw that the target language naturalness was affected by the strategy used to add foreign language data.


 DOI: 10.21437/Interspeech.2020-2664

Cite as: Korte, M.D., Kim, J., Klabbers, E. (2020) Efficient Neural Speech Synthesis for Low-Resource Languages Through Multilingual Modeling. Proc. Interspeech 2020, 2967-2971, DOI: 10.21437/Interspeech.2020-2664.


@inproceedings{Korte2020,
  author={Marcel de Korte and Jaebok Kim and Esther Klabbers},
  title={{Efficient Neural Speech Synthesis for Low-Resource Languages Through Multilingual Modeling}},
  year=2020,
  booktitle={Proc. Interspeech 2020},
  pages={2967--2971},
  doi={10.21437/Interspeech.2020-2664},
  url={http://dx.doi.org/10.21437/Interspeech.2020-2664}
}