Sixth ISCA Workshop on Speech Synthesis

Bonn, Germany
August 22-24, 2007

Joint Analysis of Speech Frames for Synthesis Based on Lossy Tube Models

Karl Schnell, Arild Lacroix

Institute of Applied Physics, Goethe-University Frankfurt, Germany

This paper discusses a model-based synthesis approach focused on the estimation of model parameters. For the treated approach, tube models are used for analysis and synthesis of speech units. In comparison to the standard lossless tube model, an extended tube model is used which includes the frequency dependent vocal tract losses. The parameters of the tube models are estimated by minimizing the spectral error between the tube model and a speech segment. For the analysis of speech units, the time evolution of the parameters is taken into account. For that purpose, the speech segments are analyzed jointly which ensures smooth parameter trajectories. The investigations show that, especially for extended tube models, the joint analysis of frames improves the quality of the synthesized speech signals. Additionally, the differences of the results obtained by the standard and the extended tube model are discussed.

Full Paper

Sound Examples  

Parametric synthesis example of the German word "Langeweile" [laN@waIl@] (boredom). Model-based diphone synthesis by lossy tube model; excitation is independent of analyzed diphones. Sampling rate: 16 kHz.
Example 1   Synthesis: impulse train excitation; analysis of diphones: with averaging of tube areas during gradient-based optimization.
Example 2   Synthesis: excitation is residual-based obtained from schwa sound; analysis of diphones: with averaging of tube areas during gradient-based optimization.
Example 3   Synthesis: excitation is residual-based obtained from schwa sound; analysis of diphones: without averaging of tube areas during gradient-based optimization.

Bibliographic reference.  Schnell, Karl / Lacroix, Arild (2007): "Joint analysis of speech frames for synthesis based on lossy tube models", In SSW6-2007, 52-57.