Characteristics of Text-to-Speech and Other Corpora

Erica Cooper, Emily Li, Julia Hirschberg

"Extensive TTS corpora exist for commercial systems created for high-resource languages such as Mandarin, English, and Japanese. Speakers recorded for these corpora are typically instructed to maintain constant f0, energy, and speaking rate and are recorded in ideal acoustic environments, producing clean, consistent audio. We have been developing TTS systems from ""found"" data collected for other purposes (e.g. training ASR systems) or available on the web (e.g. news broadcasts, audiobooks) to produce TTS systems for low-resource languages (LRLs) which do not currently have expensive, commercial systems. This study investigates whether traditional TTS speakers do exhibit significantly less variation and better speaking characteristics than speakers in ""found"" genres. By examining characteristics of f0, energy, speaking rate, articulation, NHR, jitter, and shimmer in ""found” genres and comparing these to traditional TTS corpora, we have found that TTS recordings are indeed characterized by low mean pitch, standard deviation of energy, speaking rate, and level of articulation, and low mean and standard deviations of shimmer and NHR; in a number of respects these are quite similar to some ""found” genres. By identifying similarities and differences, we are able to identify objective methods for selecting ""found"" data to build TTS systems for LRLs."

 DOI: 10.21437/SpeechProsody.2018-140

Cite as: Cooper, E., Li, E., Hirschberg, J. (2018) Characteristics of Text-to-Speech and Other Corpora. Proc. 9th International Conference on Speech Prosody 2018, 690-694, DOI: 10.21437/SpeechProsody.2018-140.

  author={Erica Cooper and Emily Li and Julia Hirschberg},
  title={Characteristics of Text-to-Speech and Other Corpora},
  booktitle={Proc. 9th International Conference on Speech Prosody 2018},