Ninth International Conference on Spoken Language Processing

Pittsburgh, PA, USA
September 17-21, 2006

Modeling the Acoustic Correlates of Expressive Elements in Text Genres for Expressive Text-to-Speech Synthesis

Hongwu Yang (1,2), Helen M. Meng (2), Lianhong Cai (1)

(1) Tsinghua University, China; (2) Chinese University of Hong Kong, China

This paper proposes a novel approach for describing the expressive elements in text genres and modeling their acoustic correlates for expressive text-to-speech synthesis (TTS). We apply the threedimensional PAD (pleasure-displeasure, arousal-nonarousal and dominance-submissiveness) model in describing expressivity. In particular, we define a set of principles for annotating the P and A values of prosodic words found in texts from the tourist information domain. These text passages may be categorized into the descriptive genre (e.g. describing a beautiful scenic spot), the informative genre (e.g. presenting the opening hours of a museum) and the procedural genre (e.g. offering bus routes to a landmark). We choose the prosodic word as the basic unit for analysis since it bridges textual input with (synthetic) speech output. Analysis of contrastive (neutral versus expressive) recordings uncovers the acoustic correlates of annotated P and A values. This enables us to develop a non-linear model that can transform neutral speech to resemble expressive speech, according to the P and A values of the input text. Perceptual evaluation of the speech outputs shows that over 70% of the prosodic words carry appropriate expressivity.

Full Paper

