Speech Prosody 2010

Chicago, IL, USA
May 10-14, 2010

Analysis of Duration Prediction Accuracy in HMM-Based Speech Synthesis

Hanna Silén (1), Elina Helander (1), Jani Nurminen (2), Moncef Gabbouj (1)

(1) Department of Signal Processing, Tampere University of Technology, Tampere, Finland
(2) Nokia Devices R&D, Tampere, Finland

Appropriate phoneme durations are essential for high quality speech synthesis. In hidden Markov model-based text-tospeech (HMM-TTS), durations are typically modeled statistically using state duration probability distributions and duration prediction for unseen contexts. Use of rich context features enables synthesis without high-level linguistic knowledge. In this paper we analyze the accuracy of state duration modeling against phone duration modeling using simple prediction techniques. In addition to the decision tree-based techniques, regression techniques for rich context features with high collinearity are discussed and evaluated.

Full Paper

Bibliographic reference.  Silén, Hanna / Helander, Elina / Nurminen, Jani / Gabbouj, Moncef (2010): "Analysis of duration prediction accuracy in HMM-based speech synthesis", In SP-2010, paper 510.