First European Conference on Speech Communication and Technology

Paris, France
September 27-29, 1989

Syllable-Level Duration Determination

W. Nick Campbell

Speech & Knowledge Based Systems Research, IBM UK Scientific Centre, Winchester, England

Accurate prediction of duration in a text-to-speech system is essential to natural-sounding intonation. Klatt [1] proposed a set of phoneme-based rules to perform this task, but an adaptation of the rule-set to British English [2] accounted for only 68% of the variance in the duration observed in a 4000-syllable test text. Modification of these rules to incorporate foot-level effects [3,4] improved the prediction slightly to account for 71% of the variance. A similar degree of prediction can be attained, with minimum reference to segment specifics, by modelling duration at the level of the syllable, with sensitivity to stress, position in phrase and foot, and number of segments in onset, peak and coda. This supposes that micro-durational features such as shortening of segments in clusters, and lengthening of vowels to cue voicing, operate at a phonetic level, within the constraints of a syllable frame, and that higher-level features determine factors of lengthening or compression for the framework into which they are to fit. In support of this view, a connectionist implementation, of eight input features, one layer of hidden units and one analog output unit, that accounts for an equivalent 70% of the variance in the duration is described.

Full Paper

Bibliographic reference.  Campbell, W. Nick (1989): "Syllable-level duration determination", In EUROSPEECH-1989, 2698-2701.