This paper presents an approach using statistical models for the implementation of Chinese text-to-speech system. Since Chinese is a syllablic and tonal language, the synthesis of Chinese speech can be accomplished by connecting the syllables in sequence and imbedding the prosodic rules. In this study, the database of text-reading speech provided by a male college student is analyzed to obtain the statistical models of prosodic features, such as the pitch patterns of the syllables, the durations of the syllables, and the declination of the pitch level in a sentence. During the synthesis procedure, the optimal pitch pattern sequence and duration sequence are determined for a given sentence by using Viterbi algorithm. The intonation of a sentence is implemented by shifting the pitch level. The silence pauses are imposed according to several simple pause rules. An experimental system is implemented on an IBM-PC compatible computer equipped with a TMS-32010-based DSP board. A perceptural test is conducted to evaluate the intelligibility and natualness of the proposed method.
Bibliographic reference. Chang, Yueh-Chin / Lee, Yi-Fan / Shia, Bang-Er / Wang, Hsiao-Chuan (1991): "Statistical models for the Chinese text-to-speech system", In EUROSPEECH-1991, 337-340.