Dynamic Prosody Generation for Speech Synthesis Using Linguistics-Driven Acoustic Embedding Selection

Shubhi Tyagi, Marco Nicolis, Jonas Rohnke, Thomas Drugman, Jaime Lorenzo-Trueba


Recent advances in Text-to-Speech (TTS) have improved quality and naturalness to near-human capabilities. But something which is still lacking in order to achieve human-like communication is the dynamic variations and adaptability of human speech in more complex scenarios. This work attempts to solve the problem of achieving a more dynamic and natural intonation in TTS systems, particularly for stylistic speech such as the newscaster speaking style. We propose a novel way of exploiting linguistic information in VAE systems to drive dynamic prosody generation. We analyze the contribution of both semantic and syntactic features. Our results show that the approach improves the prosody and naturalness for complex utterances as well as in Long Form Reading (LFR).


 DOI: 10.21437/Interspeech.2020-1411

Cite as: Tyagi, S., Nicolis, M., Rohnke, J., Drugman, T., Lorenzo-Trueba, J. (2020) Dynamic Prosody Generation for Speech Synthesis Using Linguistics-Driven Acoustic Embedding Selection. Proc. Interspeech 2020, 4407-4411, DOI: 10.21437/Interspeech.2020-1411.


@inproceedings{Tyagi2020,
  author={Shubhi Tyagi and Marco Nicolis and Jonas Rohnke and Thomas Drugman and Jaime Lorenzo-Trueba},
  title={{Dynamic Prosody Generation for Speech Synthesis Using Linguistics-Driven Acoustic Embedding Selection}},
  year=2020,
  booktitle={Proc. Interspeech 2020},
  pages={4407--4411},
  doi={10.21437/Interspeech.2020-1411},
  url={http://dx.doi.org/10.21437/Interspeech.2020-1411}
}