Ninth International Conference on Spoken Language Processing

Pittsburgh, PA, USA
September 17-21, 2006

Constructing Stylistic Synthesis Databases from Audio Books

Yong Zhao (1), Di Peng (2), Lijuan Wang (3), Min Chu (1), Yining Chen (1), Peng Yu (1), Jun Guo (2)

(1) Microsoft Research Asia, China; (2) Beijing University of Posts & Telecommunications, China; (3) Tsinghua University, China

In this paper, we explore how to construct stylistic TTS databases from audio books, in which a storyteller performs multiple roles. The goal is to identify and build a set of speech corpora, each of which not only portrays a representative voice style performed by the speaker, but also has sufficient sentences to synthesize natural speech using unit selection approach. We solve the problem in two procedures: first, by representing each role with Gaussian Mixture Models (GMM), all speech data are partitioned into a number of voice style clusters with a criterion that maximizes the likelihood of all utterances with respect to roles’ speaker models; then, pruning in terms of both acoustic and prosodic measures is followed to purify the clusters. The resulting 4 voice styles are subjectively interpreted as Neutral, Young, Elder and Adult, respectively. Perceptual experiments show that the proposed approach can synthesize speech with the recognizable voice styles with an average 72.5% identification rate, and the synthesized speech sounds better than those synthesized with utterances from a single role.

Full Paper

Bibliographic reference.  Zhao, Yong / Peng, Di / Wang, Lijuan / Chu, Min / Chen, Yining / Yu, Peng / Guo, Jun (2006): "Constructing stylistic synthesis databases from audio books", In INTERSPEECH-2006, paper 1559-Wed2A3O.3.