5th International Conference on Spoken Language Processing

Sydney, Australia
November 30 - December 4, 1998

Unsupervised Training of Phone Duration and Energy Models for Text-to-Speech Synthesis

Paul C. Bagshaw

France Telecom, CNET., France

A new model of phone duration and energy is presented. These parameters are modelled in two stages. The first stage builds a statistics tree that contains phone duration and energy mean and standard deviation values at each node. The branches of the tree are characterised by a set of factors related to phonetic context. The second stage considers phone duration and energy to be modified by two syllable-level prosodic coefficients. The duration and energy of the phones of a syllable are influenced to differing degrees by these coefficients. Weights are associated with the different phone positions in a syllable. A simulated annealing technique is used to find the set of weights that allow the prosodic coefficients to be calculated for all syllables and, in turn, minimise the error in predicting the phone duration and energy during synthesis. They are predicted with a mean squared error of 15.4ms and 6.8dB respectively. During synthesis, the syllable-level prosodic coefficients are predicted by regression trees from linguistic information. Manual prosodic labelling is not required at any stage.

Full Paper

Bibliographic reference.  Bagshaw, Paul C. (1998): "Unsupervised training of phone duration and energy models for text-to-speech synthesis", In ICSLP-1998, paper 0132.