Speech Prosody 2010

Chicago, IL, USA
May 10-14, 2010

Usages of an External Duration Model for HMM-based Speech Synthesis.

Javier Latorre (1,2), Sabine Buchholz (1), Masami Akamine (2)

(1) Toshiba Research Europe, UK; (2) Toshiba Corporate Research & Development Center, Japan

In this paper we analyze three different approaches to improving the quality of an HMM-based speech synthesizer by means of an external duration model. The first approach uses the external duration model in a standard way to define the phone duration during synthesis. The second is a novel approach that uses the phone duration to create additional context features for the decision trees clustering. The third is a combination of the previous two approaches. A subjective evaluation showed a quality improvement with respect to the baseline for all three approaches, although for differing reasons. The standard approach produces an improvement in the duration estimation. The second approach degrades the duration estimation but improves the logF0 and aperiodicity by better modeling of their dependencies with respect to the duration. Finally, the combined approach benefits from the improvements of the other two and yields the best result of ca. 16% higher preference than the baseline among native English speakers.

Index Terms: speech synthesis, prosody, duration, HMMbased, external duration model

Full Paper

Bibliographic reference.  Latorre, Javier / Buchholz, Sabine / Akamine, Masami (2010): "Usages of an external duration model for HMM-based speech synthesis.", In SP-2010, paper 073.