Third ESCA/COCOSDA Workshop on Speech Synthesis
November 26-29, 1998
Accurate estimation of segmental durations is crucial for naturalsounding text-to-speech (TTS) synthesis. This paper presents a model of segmental duration used in the Bell Labs Japanese TTS system. We describe the constraints on vowel devoicing, and effects of factors such as phone identity, surrounding phone identities, accentuation, syllabic structure, and phrasal position on the duration of both consonants and vowels. A Sum-of-Products approach is used to model key interactions observed in the data, and to predict values of factor combinations not found in the speech database. We report overall observed-predicted correlations of 0.88 for vowels (RMSdev: 16.8ms) and 0.94 for consonants (RMSdev: 12.5ms).
Bibliographic reference. Venditti, Jennifer J. / Santen, Jan P. H. van (1998): "Modeling Segmental Durations for Japanese Text-To-Speech Synthesis", In SSW3-1998, 31-36.