Accurate prediction of duration in a text-to-speech system is essential to natural-sounding intonation. Klatt  proposed a set of phoneme-based rules to perform this task, but an adaptation of the rule-set to British English  accounted for only 68% of the variance in the duration observed in a 4000-syllable test text. Modification of these rules to incorporate foot-level effects [3,4] improved the prediction slightly to account for 71% of the variance. A similar degree of prediction can be attained, with minimum reference to segment specifics, by modelling duration at the level of the syllable, with sensitivity to stress, position in phrase and foot, and number of segments in onset, peak and coda. This supposes that micro-durational features such as shortening of segments in clusters, and lengthening of vowels to cue voicing, operate at a phonetic level, within the constraints of a syllable frame, and that higher-level features determine factors of lengthening or compression for the framework into which they are to fit. In support of this view, a connectionist implementation, of eight input features, one layer of hidden units and one analog output unit, that accounts for an equivalent 70% of the variance in the duration is described.
Bibliographic reference. Campbell, W. Nick (1989): "Syllable-level duration determination", In EUROSPEECH-1989, 2698-2701.