EUROSPEECH 2001 Scandinavia
7th European Conference on Speech Communication and Technology

Aalborg, Denmark
September 3-7, 2001


Text-To-Speech Synthesis with Arbitrary Speaker's Voice from Average Voice

Masatsune Tamura (1), Takashi Masuko (1), Keiichi Tokuda (2), Takao Kobayashi (1)

(1) Tokyo Institute of Technology, Japan
(2) Nagoya Institute of Technology, Japan

This paper describes a technique for synthesizing speech with any desired voice. The technique is based on an HMM-based text-to-speech (TTS) system and MLLR adaptation algorithm. To generate speech of an arbitrarily given target speaker, speaker-independent speech units, i.e., average voice models, is adapted to the target speaker using MLLR framework. In addition to spectrum and pitch adaptation, we derive an algorithm for adaptation of state duration. We demonstrate that a few sentences uttered by a target speaker are sufficient to adapt not only voice characteristics but also prosodic features. Synthetic speech generated from adapted models using only four sentences is very close to that from speaker dependent models trained using a large amount of speech data.

Full Paper

Bibliographic reference.  Tamura, Masatsune / Masuko, Takashi / Tokuda, Keiichi / Kobayashi, Takao (2001): "Text-to-speech synthesis with arbitrary speaker's voice from average voice", In EUROSPEECH-2001, 345-348.