Second International Conference on Spoken Language Processing (ICSLP'92)

Banff, Alberta, Canada
October 13-16, 1992

Segmental Power Control for Japanese Speech Synthesis

Kenzo Itoh, Tomohisa Hirokawa, Hirokazu Sato

Speech and Acoustics Laboratory, NTT Human Interface Laboratories, Kanagawa, Japan

This paper proposes a segmental power control method for speech synthesis by rule. The innovation of this method lies in its use of the phoneme environment characteristics and the relationship between speech power and pitch frequency. First, the Permissible Threshold (PT) for power modification is measured by subjective experiments using phoneme power manipulated speech material. As a result, it is concluded that the PT of phoneme power modification is 4.1 dB. This experimental result is significant when discussing power control and gives a criterion for power control accuracy. Next, the relationship between speech power and pitch frequency is analyzed using a very large speech data base. The results show that the relationship between phoneme segmental power and pitch frequency is affected by the kind of phoneme, the adjoining phonemes, rising or falling pitch conditions, and initial or final position of sentence. Finally, we propose that the segmental speech power should be controlled by the pitch and phoneme environment. This new method yields an averaged root mean square error between real and estimated speech power of 2.17 dB. This value indicates that 94% of the estimated power values are within the Permissible Threshold of human perception.

