Introducing Prosodic Speaker Identity for a Better Expressive Speech Synthesis Control

Aghilas Sini, Sébastien Le Maguer, Damien Lolive, Elisabeth Delais-Roussarie


To have more control over Text-to-Speech (TTS) synthesis and to improve expressivity, it is necessary to disentangle prosodic information carried by the speaker’s voice identity from the one belonging to linguistic properties. In this paper, we propose to analyze how information related to speaker voice identity affects a Deep Neural Network (DNN) based multi-speaker speech synthesis model. To do so, we feed the network with a vector encoding speaker information in addition to a set of basic linguistic features. We then compare three main speaker coding configurations: a) simple one-hot vector describing the speaker gender and identifier; b) an embedding vector extracted from a speaker recognition pre-trained model; c) a prosodic vector which summarizes information such as melody, intensity, and duration. To measure the impact of the input feature vector, we investigate the representation of the latent space at the output of the first layer of the network. The aim is to have an overview of our data representation and model behavior. Furthermore, we conducted a subjective assessment to validate the result. Results show that the prosodic identity of the speaker is captured by the model and therefore allows the user to control more precisely synthesis.


 DOI: 10.21437/SpeechProsody.2020-191

Cite as: Sini, A., Maguer, S.L., Lolive, D., Delais-Roussarie, E. (2020) Introducing Prosodic Speaker Identity for a Better Expressive Speech Synthesis Control. Proc. 10th International Conference on Speech Prosody 2020, 935-939, DOI: 10.21437/SpeechProsody.2020-191.


@inproceedings{Sini2020,
  author={Aghilas Sini and Sébastien Le Maguer and Damien Lolive and Elisabeth Delais-Roussarie},
  title={{Introducing Prosodic Speaker Identity for a Better Expressive Speech Synthesis Control}},
  year=2020,
  booktitle={Proc. 10th International Conference on Speech Prosody 2020},
  pages={935--939},
  doi={10.21437/SpeechProsody.2020-191},
  url={http://dx.doi.org/10.21437/SpeechProsody.2020-191}
}