Third ESCA/COCOSDA Workshop on Speech Synthesis

November 26-29, 1998
Jenolan Caves House, Blue Mountains, NSW, Australia

Which is More Important in a Concatenative Text to Speech System - Pitch, Duration, or Spectral Discontinuity?

M. Plumpe, S. Meredith

Microsoft Research, Redmond, WA, USA

This paper focuses on experimental evaluations designed to determine the relative quality of the components of the Whistler TTS engine. Eight different systems were compared pairwise to determine a rank ordering as well as a measure of the quality difference between the systems. The most interesting aspect of the results is that the simple unit duration scheme used in Whistler was found to be very good, both when it was used in combination with natural acoustics and pitch as well as when it was taken in combination with synthetic pitch. The synthetic pitch was found to be the aspect of the system that results in greatest quality degradation.


Full Paper (with 8 sound examples linked from within the paper)

Bibliographic reference.  Plumpe, M. / Meredith, S. (1998): "Which is More Important in a Concatenative Text to Speech System - Pitch, Duration, or Spectral Discontinuity?", In SSW3-1998, 231-236.