INTERSPEECH 2006 - ICSLP
Tone plays an important role in recognizing spoken tonal languages like Chinese. However, the F0 contour discontinuity between voiced and unvoiced segments has traditionally been a bottleneck in modeling tone contour for automatic speech recognition and synthesis and various heuristic approaches were proposed to get around the problem. The Multi-Space Distribution (MSD) was proposed by Tokuda et.al. and applied to HMM-based speech synthesis, which models the two probability spaces, discrete for unvoiced region and continuous for voiced F0 contour, in a linearly weighted mixture. We extend the MSD to tone modeling for speech recognition applications. Specifically, modeling tones in speaker-independent, spoken Chinese is formulated and tested in a Mandarin speech database. The tone features and spectral features are further separated into two streams and stream-dependent models are built to cluster the two features into separated decision trees. The recognition results show that the ultimate performance of tonal syllable error rate can be improved from toneless baseline system to the MSD based stream-dependent system, 50.5% to 36.1% and 46.3% to 35.1%, for the two systems resulted from using two different phone sets. The absolute tonal syllable error rate improvement of the new approach is 5.5% and 6.1%, comparing with the conventional tone modeling.
Bibliographic reference. Wang, Huanliang / Qian, Yao / Soong, Frank K. / Zhou, Jian-Lai / Han, Jiqing (2006): "A multi-space distribution (MSD) approach to speech recognition of tonal languages", In INTERSPEECH-2006, paper 1473-Mon1BuP.6.