Expressive Speech Synthesis Using Sentiment Embeddings

Igor Jauk, Jaime Lorenzo-Trueba, Junichi Yamagishi, Antonio Bonafonte

In this paper we present a DNN based speech synthesis system trained on an audiobook including sentiment features predicted by the Stanford sentiment parser. The baseline system uses DNN to predict acoustic parameters based on conventional linguistic features, as they have been used in statistical parametric speech synthesis. The predicted parameters are transformed into speech using a conventional high-quality vocoder. In the proposed system the conventional linguistic features are enriched using sentiment features. Different sentiment representations have been considered, combining sentiment probabilities with hierarchical distance and context. After preliminary analysis a listening experiment is conducted, where participants evaluate the different systems. The results show the usefulness of the proposed features and reveal differences between expert and non-expert TTS user.

 DOI: 10.21437/Interspeech.2018-2467

Cite as: Jauk, I., Lorenzo-Trueba, J., Yamagishi, J., Bonafonte, A. (2018) Expressive Speech Synthesis Using Sentiment Embeddings. Proc. Interspeech 2018, 3062-3066, DOI: 10.21437/Interspeech.2018-2467.

  author={Igor Jauk and Jaime Lorenzo-Trueba and Junichi Yamagishi and Antonio Bonafonte},
  title={Expressive Speech Synthesis Using Sentiment Embeddings},
  booktitle={Proc. Interspeech 2018},