Second ESCA/IEEE Workshop on Speech Synthesis

September 12-15, 1994
Mohonk Mountain House, New Paltz, NY, USA

Prosody and the Selection of Units for Concatenation Synthesis

Nick Campbell

ATR Interpreting Telecommunications Research Laboratories, Soraku-gun, Kyoto, Japan

The ATR μ-talk non-uniform unit system of concatenative synthesis has been shown to produce very high quality synthetic speech, but is slow and expensive in memory. Furthermore, it was designed for Japanese and is not directly applicable to other languages. This paper shows how the μ-talk principle can be generalised for multi-lingual synthesis, and describes methods for database pruning and faster unit selection that overcome the main criticisms levelled against the Japanese version. To reduce selection time, we substitute prosodic selection criteria for the acoustic measures, and show that these result in faster unit selection that minimises post- processing of the speech waveform and thus reduces distortion in the output speech. To reduce database size, we generate a rectangular array of non-uniform segments to a predetermined depth. This preserves sparse units and maximises tokens of the common sounds of the language.

Full Paper

Bibliographic reference.  Campbell, Nick (1994): "Prosody and the selection of units for concatenation synthesis", In SSW2-1994, 61-64.