Third ESCA/COCOSDA Workshop on Speech Synthesis
November 26-29, 1998
Concatenative synthesis is widely used in TTS to generate synthetic speech with high quality and relatively natural-sounding prosody. Whatever the type of synthesis unit used, (diphone, phoneme, etc.), a large speech database is usually needed to ensure the phonetic and phonemic variation of the units in a rich variety of contexts. In the CHATR synthesis system, unit selection finds the most appropriate phoneme sequence for an input text by using a criterion of minimizing a) joint discontinuity and b) mismatch in target prosody. However, in the current unit selection module, only an objective distance function is used, and the pitch and duration are not modified to match the target prosody.
We address two issues in this paper: (1) How to derive a perceptual discontinuity function to determine the perceptually significant amount of discontinuity between two candidate units, while (2) taking into account the constraints of possible prosodic modification (pitch/duration scaling using signal processing). Both the techniques are tested with the unit selection and synthesis modules and the changes in voice quality and prosody are evaluated.
Bibliographic reference. Ding, Wen / Fujisawa, Ken / Campbell, Nick (1998): "Improving Speech Synthesis of CHATR Using a Perceptual Discontinuity Function and Constraints of Prosodic Modification", In SSW3-1998, 191-194.