Third ESCA/COCOSDA Workshop on Speech Synthesis

November 26-29, 1998
Jenolan Caves House, Blue Mountains, NSW, Australia

Comparative Evaluation of Letter-to-Sound Conversion Techniques for English Text-to-Speech Synthesis

Robert I. Damper (1), Y. Marchand (1), M. J. Adamson (1), Kjell Gustafson (2)

(1) Department of Electronics and Computer Science, University of Southampton, Southampton SO17 1BJ, UK
(2)Department of Speech, Music and Hearing, KTH, S-100 44 Stockholm, Sweden - 1112

Dictionary look-up is the primary strategy for deriving pronunciations for input words in a text-to-speech (TTS) system. This strategy is accurate for dictionary words, but it is not complete: it is impossible to list exhaustively all input words. The proper treatment of 'unknown' words is currently an unsolved problem in TTS synthesis. There are many competing techniques for letter-to-sound conversion and the system developer must make a rational selection among them. However, it is unclear how di erent techniques should be properly compared. In this paper, we re- port a comparative assessment of the competitor methods of letter-to-sound rules, pronunciation by analogy, feedforward neural networks and a k-nearest neighbour method, with respect to their success at automatic phonemisation. This is achieved by using standardised scoring methods, test lexicon and phoneme inventories. The problem of standardising the phoneme set ('harmonisation') is deceptive: this is much harder than at first appears. The principal finding is that (contrary to the weight of opinion expressed in the literature) data-driven techniques outperform knowledge-based methods by a very significant margin.

Full Paper

