5th International Conference on Spoken Language Processing

Sydney, Australia
November 30 - December 4, 1998

Exploration of Acoustic Correlates in Speaker Selection for Concatenative Synthesis

Ann K. Syrdal, Alistair Conkie, Yannis Stylianou

AT&T Labs -- Research, USA

It is often difficult to determine the suitability of a speaker to serve as a model for concatenative text-to-speech synthesis. The perceived quality of a speaker's natural voice is not necessarily predictive of its synthetic quality. The selection of female and male speakers on whom to base two synthetic voices for the new AT&T text-to-speech system was made empirically. Brief readings of identical text materials were recorded from professional speakers. Small-scale TTS systems were constructed with a minimal diphone inventory, suitable for synthesizing a limited number of test sentences. Synthesized sentences and their naturally spoken references were presented to listeners in a formal listening evaluation. In addition, a variety of acoustic measurements of the speakers were made in order to determine which acoustic characteristics correlated with subjective synthesis quality. The results have implications both for speaker selection and for improving concatenative synthesis methods.

