Second International Conference on Spoken Language Processing (ICSLP'92)
Banff, Alberta, Canada
A speech synthesis system is developed which directly compiles phoneme wavelet segments selected from a wavelet dictionary containing over 45,000 entries to yield high quality synthesized voice. In ICSLP'90, we proposed the wavelet selection and wavelet concatenate methods used in our system. To realize the system, we establish prosody pattern setting by rules and a wavelet modification procedure to achieve the design goals. Phoneme duration is set according to phoneme environment, and phoneme power is controlled by both pitch frequency and phoneme environment. Tests show the average errors in vowel duration and consonant duration are 28.8msec and 16.8msec respectively, and the vowel power average error is 2.93dB. Wavelet pitch frequency is controlled by an approach based on the pitch synchronous overlap-add method. To avoid abrupt changes in voice spectrum and wavelet shape, an interpolation operation is carried out between voiced wavelets. The synthesized speech has high intelligibility and naturalness, while the original speaker quality is retained.
Bibliographic reference. Hirokawa, Tomohisa / Itoh, Kenzo / Sato, Hirokazu (1992): "High quality speech synthesis based on wavelet compilation of phoneme segments", In ICSLP-1992, 567-570.