HMM-based synthesis of creaky voice

Tuomo Raitio, John Kane, Thomas Drugman, Christer Gobl

Creaky voice, also referred to as vocal fry, is a voice quality frequently produced in many languages, in both read and conversational speech. To enhance the naturalness of speech synthesis, these latter should be able to generate speech in all its expressive diversity, including creaky voice. The present study looks to exploit our recent developments, including creaky voice detection, prediction of creaky voice from context, and rendering of creaky excitation, into a fully functioning and automatic HMM-based synthesis system. HMM-based synthetic creaky voices are built and evaluated in subjective listening tests, which show that the best synthetic creaky voices are rated more natural and more creaky compared to a conventional voice. A non-creaky voice is also successfully transformed to use creak by modifying the F0 contour and excitation of the predicted creaky parts. The transformed voice is rated equal in terms of naturalness and clearly more creaky compared to the original voice.

doi: 10.21437/Interspeech.2013-542

Cite as: Raitio, T., Kane, J., Drugman, T., Gobl, C. (2013) HMM-based synthesis of creaky voice. Proc. Interspeech 2013, 2316-2320, doi: 10.21437/Interspeech.2013-542

