An Unsupervised Method to Select a Speaker Subset from Large Multi-Speaker Speech Synthesis Datasets

Pilar Oplustil Gallegos, Jennifer Williams, Joanna Rownicka, Simon King


Large multi-speaker datasets for TTS typically contain diverse speakers, recording conditions, styles and quality of data. Although one might generally presume that more data is better, in this paper we show that a model trained on a carefully-chosen subset of speakers from LibriTTS provides significantly better quality synthetic speech than a model trained on a larger set. We propose an unsupervised methodology to find this subset by clustering per-speaker acoustic representations.


 DOI: 10.21437/Interspeech.2020-2567

Cite as: Gallegos, P.O., Williams, J., Rownicka, J., King, S. (2020) An Unsupervised Method to Select a Speaker Subset from Large Multi-Speaker Speech Synthesis Datasets. Proc. Interspeech 2020, 1758-1762, DOI: 10.21437/Interspeech.2020-2567.


@inproceedings{Gallegos2020,
  author={Pilar Oplustil Gallegos and Jennifer Williams and Joanna Rownicka and Simon King},
  title={{An Unsupervised Method to Select a Speaker Subset from Large Multi-Speaker Speech Synthesis Datasets}},
  year=2020,
  booktitle={Proc. Interspeech 2020},
  pages={1758--1762},
  doi={10.21437/Interspeech.2020-2567},
  url={http://dx.doi.org/10.21437/Interspeech.2020-2567}
}