Speech Prosody 2006

Dresden, Germany
May 2-5, 2006

Expressive Speech Synthesis: Evaluation of a Voice Quality Centered Coder on the Different Acoustic Dimensions

Nicolas Audibert (1), Damien Vincent (2), Véronique Aubergé (1), Olivier Rosec (2)

(1) Institut de la Communication Parlée, CNRS UMR 5009, Grenoble, France (2) France Telecom, R&D Division, Lannion, France

Expressive speech is intrinsically multi-dimensional. Each acoustic dimension has specific weights depending on the nature of the expressed affects. The quantity of expressive information carried by each dimension separately (using Praat algorithms), as well as the processing implied to carry it (global value vs. contour) has been perceptively measured for a set of natural mono-syllabic utterances (Audibert et al, 2005). It has been shown that no parameter alone is able to carry the whole emotion information, F0 contours or global values revealed to bring more information on positive expressions, voice quality and duration conveyed more information on negative expressions, and the intensity contours did not bring any significant information when used alone. These selected stimuli, expressing anxiety, disappointment, disgust, disquiet, joy, resignation and sadness were resynthesized with an LF-ARX algorithm, and evaluated in the same perceptive protocol extended to the three voice quality parameters (source, filter and residue). The comparison of results between natural, TD-PSOLA resynthesized and LF-ARX resynthesized stimuli (1) globally confirms the relative weights of each dimension (2) diagnoses local minor artifacts of resynthesis (3) validates the efficiency of the LF-ARX algorithm (4) measures the relative importance of each of the three LF-ARX parameters.

