Third ESCA/COCOSDA Workshop on Speech Synthesis
November 26-29, 1998
This paper describes an experimental AT&T concatenative synthesis system using unit selection, for which the basic synthesis units are diphones. The synthesizer may use any of the data from a large database of utterances. Since there are in general multiple instances of each concatenative unit, the system performs dynamic unit selection. Selection among candidates is done dynamically at synthesis, in a manner that is based on and extends unit selection implemented in the CHATR synthesis system . Selected units may be either phones or diphones, and they can be synthesized by a variety of methods, including PSOLA , HNM , and simple unit concatenation. The AT&T system, with CHATR unit selection, was implemented within the framework of the Festival Speech Synthesis System . The voice database amounted to approximately one and one-half hours of speech and was constructed from read text taken from three sources. The first source was a portion of the 1989 Wall Street Journal material from the Penn Treebank Project, so that the most frequent diphones were well represented. Complete diphone coverage was assured by the second text, which was designed for diphone databases . A third set of data consisted of recorded prompts for telephone service applications. Subjective formal listening tests were conducted to compare speech quality for several options that exist in the AT&T synthesizer, including synthesis methods and choices of fundamental units. These tests showed that unit selection techniques can be successfully applied to diphone synthesis.
Bibliographic reference. Beutnagel, Mark / Conkie, Alistair / Syrdal, Ann K. (1998): "Diphone Synthesis Using Unit Selection", In SSW3-1998, 185-190.