Speech Prosody 2010
Chicago, IL, USA
The HMM-based speech synthesis system can produce high quality synthetic speech with flexible modeling of spectral and prosodic parameters. In this approach, short term spectra, fundamental frequency (F0) and duration are generated by multi-stream HMMs separately. However the quality of synthetic speech degrades when feature vectors used in training are noisy. Among all noisy features, pitch tracking errors and corresponding flawed voiced/unvoiced (VU) decisions are the two key factors in voice quality problems. Pitch tracking errors occur more often in Mandarin vowels of Tone 3 and Tone 4, because the pitch of these vowels can be very low and sometimes treated as aperiodic signal. On the other hand, F0 values in unvoiced regions, such as consonants, are normally defined as unavailable; it is then impossible to use standard HMMs for F0 modeling. Currently a preferred method to solve this is to use a multi-space distribution HMM (MSDHMM). In this approach, discrete distributions are used for modeling the VU decision and continuous Gaussian distributions are used for F0 modeling within the voiced regions. Due to this assumption of undefined F0 values in unvoiced regions and the special structure of MSDHMM, the generated F0 values are limited in accuracy. In this paper, an F0 generation process model is used to estimate F0 values in the region of pitch tracking errors, as well as in unvoiced regions. A prior knowledge of VU is imposed in each Mandarin phoneme and then used for VU decision. Thus the F0 can be modeled within the standard HMM framework. Index Terms: Mandarin speech synthesis, Generation process model, F0 contour, HMM-based speech synthesis
Bibliographic reference. Wang, Miaomiao / Hirose, Keikichi / Minematsu, Nobuaki (2010): "Generation of fundamental frequency contours of Mandarin in HMM-based speech synthesis using generation process model", In SP-2010, paper 098.