Corpus design for expressive speech: impact of the utterance length

Meysam Shamsi, Jonathan Chevelu, Nelly Barbot, Damien Lolive


Voice corpus plays a crucial role in the quality of the synthetic speech generation, specially under a length constraint. Creating a new voice is costly and the recording script selection for an expressive TTS task is generally considered as an optimization problem in order to achieve a rich and parsimonious corpus. In order to vocalize a given book using a TTS system, we investigate two script selection approaches. Based on preliminary observations, we simply propose to select shortest utterances of the book and compare the achievements of this method with state of the art ones for two books, with different utterance lengths and styles, using two kinds of concatenation based TTS systems. The study of the TTS costs indicates that selecting the shortest utterances could result in better synthetic quality, which is confirmed by a perceptual test. By investigating usual criteria for corpus design in literature like unit coverage or distribution similarity of units, it turns out that they are not pertinent metrics in the framework of this study.


 DOI: 10.21437/SpeechProsody.2020-195

Cite as: Shamsi, M., Chevelu, J., Barbot, N., Lolive, D. (2020) Corpus design for expressive speech: impact of the utterance length. Proc. 10th International Conference on Speech Prosody 2020, 955-959, DOI: 10.21437/SpeechProsody.2020-195.


@inproceedings{Shamsi2020,
  author={Meysam Shamsi and Jonathan Chevelu and Nelly Barbot and Damien Lolive},
  title={{Corpus design for expressive speech: impact of the utterance length}},
  year=2020,
  booktitle={Proc. 10th International Conference on Speech Prosody 2020},
  pages={955--959},
  doi={10.21437/SpeechProsody.2020-195},
  url={http://dx.doi.org/10.21437/SpeechProsody.2020-195}
}