Fourth ISCA ITRW on Speech Synthesis

August 29 - September 1, 2001
Perthshire, Scotland

Definition of a Training Set for Unit Selection-Based Speech Synthesis

Karlheinz Stöber (1), Petra Wagner (1), Esther Klabbers (2), and Wolfgang Hess (1)

(1) Institute for Communication Research and Phonetics (IKP), University of Bonn, Germany
(2) IPO, Center for User-System Interaction, Eindhoven, The Netherlands

The definition of cost terms in unit selection based synthesis is a difficult task. Usually cost terms are based upon cOmmon phonetic knowledge of the developers and subsequent perceptual experiments. The dataset used for supervised learning, well known from pattern recognition, could be a useful way to arrive at a more formal analysis of the different factors influencing the selection of units.

As a first step toward this aim we present an objective distance measure which is used to sort the units contained in the corpus in relation to a given natural unit and prove its relevance to human perception. To avoid too much attention of the listeners to discontinuities caused by concatenation, we will also present a waveform-based smoothing algorithm.

It is experimentaily shown that the sorting criterion and the human perception match in most cases. Furthermore it can be detected that similarity between natural and synthetic speech is better if phoneme-based units are used, but naturalness increases with the concatenation of larger units.

Full Paper

Bibliographic reference.  Stöber, Karlheinz / Wagner, Petra / Klabbers, Esther / Hess, Wolfgang (2001): "Definition of a training set for unit selection-based speech synthesis", In SSW4-2001, paper 118.

Acoustic Examples (WAV format)

There are two pairs of sentences. All examples are generated without prosodic manipulations by the presented procedure.
Sentence 1
Original unmanipulated speech   Raw concatenation of phonemes   Raw concatenation of phonemes and smoothed concatenation boundaries   Raw concatenation of word or subword units
Sentence 1
Original unmanipulated speech   Raw concatenation of phonemes   Raw concatenation of phonemes and smoothed concatenation boundaries   Raw concatenation of word or subword units