Third ESCA/COCOSDA Workshop on Speech Synthesis

November 26-29, 1998
Jenolan Caves House, Blue Mountains, NSW, Australia

Modeling Segmental Durations for Japanese Text-To-Speech Synthesis

Jennifer J. Venditti (1,2), Jan P. H. van Santen (1)

(1) Bell Labs - Lucent Technologies , Murray Hill, NJ, USA
(2) Ohio State University, USA

Accurate estimation of segmental durations is crucial for naturalsounding text-to-speech (TTS) synthesis. This paper presents a model of segmental duration used in the Bell Labs Japanese TTS system. We describe the constraints on vowel devoicing, and effects of factors such as phone identity, surrounding phone identities, accentuation, syllabic structure, and phrasal position on the duration of both consonants and vowels. A Sum-of-Products approach is used to model key interactions observed in the data, and to predict values of factor combinations not found in the speech database. We report overall observed-predicted correlations of 0.88 for vowels (RMSdev: 16.8ms) and 0.94 for consonants (RMSdev: 12.5ms).

Bibliographic reference.  Venditti, Jennifer J. / Santen, Jan P. H. van (1998): "Modeling Segmental Durations for Japanese Text-To-Speech Synthesis", In SSW3-1998, 31-36.