4th International Conference on Spoken Language Processing

Philadelphia, PA, USA
October 3-6, 1996

A Mandarin Text-to-Speech System

Shaw-Hwa Hwang, Sin-Horng Chen, Yih-Ru Wang

Department of Communication Engineering and Center For Telecommunications Research, National Chiao Tung University, Hsinchu, Taiwan

In this paper, the implementation of a high-performance Mandarin TTS system is presented. The system is composed of four main parts: text analysis (TA), prosodic information generation (PIG), waveform table (WT) of 411 base-syllables, and PSOLA-based waveform synthesis (P-SOLA). In TA, a statistical model based method is first employed to automatically tag the input text to obtain the word sequence and the associated part-of-speech (POS) sequence. A lexicon containing about 80000 words is used in the tagging process. Then the corresponding base-syllable sequence is found and used to get from WT the basic wave-form sequence. Some linguistic features used in PIG are also extracted in TA. In PIG, a four-layer recurrent neural network (RNN) is employed to generate some prosodic information including pitch contour, energy level, initial duration and final duration of syllable as well as inter-syllable pause duration. Finally, in PSOLA the basic waveform sequence is modified using the prosodic information to generate output synthetic speech. The whole system is implemented by software on a PC/AT 486 with a 16-bit Sound Blaster add-on card. Only 3.2 Mbyte memory space is required. It can synthesize speech in real-time for any input Chinese text. Informal listening tests by many native Chinese living in Taiwan confirmed that the synthetic speech sounded very fluent and natural.

Full Paper

Bibliographic reference.  Hwang, Shaw-Hwa / Chen, Sin-Horng / Wang, Yih-Ru (1996): "A Mandarin text-to-speech system", In ICSLP-1996, 1421-1424.