Sixth ISCA Workshop on Speech Synthesis

Bonn, Germany
August 22-24, 2007

Unit-Selection Text-to-Speech Synthesis using an Asynchronous Interpolation Model

Alexander B. Kain, Jan P. H. van Santen

Center for Spoken Language Understanding (CSLU), OGI School of Science & Engineering at OHSU, Beaverton, OR, USA; and
BioSpeech, Inc., Lake Oswego, OR, USA

We describe the Asynchronous Interpolation Model, which represents speech as a composition of several different types of feature streams that are computed using asynchronous interpolation of neighboring basis vectors, according to transition weights. When applied to the acoustic inventory of a concatenative Text-to-Speech synthesizer, the model eliminates concatenation errors and affords opportunities for high rates of compression and voice transformation. We propose a particular instance of the model that uses formant frequency values and formant-normalized complex spectra as two types of streams, in conjunction with a unit-selection synthesizer. During analysis, basis vectors and transition weights were estimated automatically, using three different labeling schemes and dynamic programming methods. An evaluation of the intelligibility and quality of the synthesized speech showed significant improvements over a standard, size-matched compression scheme. The proposed method was also able to convincingly transform speaker characteristics through replacement of basis vectors.

Full Paper   Poster (pdf)
Sound examples:   01   02   03   04   05   06   07   08  

Bibliographic reference.  Kain, Alexander B. / Santen, Jan P. H. van (2007): "Unit-selection text-to-speech synthesis using an asynchronous interpolation model", In SSW6-2007, 172-177.