Ninth International Conference on Spoken Language Processing

Pittsburgh, PA, USA
September 17-21, 2006

A Multi-Space Distribution (MSD) Approach to Speech Recognition of Tonal Languages

Huanliang Wang (1), Yao Qian (2), Frank K. Soong (2), Jian-Lai Zhou (2), Jiqing Han (1)

(1) Harbin Institute of Technology, China; (2) Microsoft Research Asia, China

Tone plays an important role in recognizing spoken tonal languages like Chinese. However, the F0 contour discontinuity between voiced and unvoiced segments has traditionally been a bottleneck in modeling tone contour for automatic speech recognition and synthesis and various heuristic approaches were proposed to get around the problem. The Multi-Space Distribution (MSD) was proposed by Tokuda and applied to HMM-based speech synthesis, which models the two probability spaces, discrete for unvoiced region and continuous for voiced F0 contour, in a linearly weighted mixture. We extend the MSD to tone modeling for speech recognition applications. Specifically, modeling tones in speaker-independent, spoken Chinese is formulated and tested in a Mandarin speech database. The tone features and spectral features are further separated into two streams and stream-dependent models are built to cluster the two features into separated decision trees. The recognition results show that the ultimate performance of tonal syllable error rate can be improved from toneless baseline system to the MSD based stream-dependent system, 50.5% to 36.1% and 46.3% to 35.1%, for the two systems resulted from using two different phone sets. The absolute tonal syllable error rate improvement of the new approach is 5.5% and 6.1%, comparing with the conventional tone modeling.

Full Paper

Bibliographic reference.  Wang, Huanliang / Qian, Yao / Soong, Frank K. / Zhou, Jian-Lai / Han, Jiqing (2006): "A multi-space distribution (MSD) approach to speech recognition of tonal languages", In INTERSPEECH-2006, paper 1473-Mon1BuP.6.