Word emphasis prediction is an important part of expressive prosody generation in modern Text-To-Speech (TTS) systems. We present a method for predicting emphasized words for expressive TTS, based on a Deep Neural Network (DNN). We show that the presented method outperforms machine learning methods based on hand-crafted features in terms of objective metrics such as precision and recall. Using a listening test, we further demonstrate that the contribution of the predicted emphasized words to the expressiveness of the synthesized speech is subjectively perceivable.
DOI: 10.21437/Interspeech.2018-1159
Cite as: Mass, Y., Shechtman, S., Mordechay, M., Hoory, R., Sar Shalom, O., Lev, G., Konopnicki, D. (2018) Word Emphasis Prediction for Expressive Text to Speech. Proc. Interspeech 2018, 2868-2872, DOI: 10.21437/Interspeech.2018-1159.
@inproceedings{Mass2018, author={Yosi Mass and Slava Shechtman and Moran Mordechay and Ron Hoory and Oren {Sar Shalom} and Guy Lev and David Konopnicki}, title={Word Emphasis Prediction for Expressive Text to Speech}, year=2018, booktitle={Proc. Interspeech 2018}, pages={2868--2872}, doi={10.21437/Interspeech.2018-1159}, url={http://dx.doi.org/10.21437/Interspeech.2018-1159} }