Naturalness Enhancement with Linguistic Information in End-to-End TTS Using Unsupervised Parallel Encoding

Alex Peiró-Lilja, Mireia Farrús


State-of-the-art end-to-end speech synthesis models have reached levels of quality close to human capabilities. However, there is still room for improvement in terms of naturalness, related to prosody, which is essential for human-machine interaction. Therefore, part of current research has shift its focus on improving this aspect with many solutions, which mainly involve prosody adaptability or control. In this work, we explored a way to include linguistic features into the sequence-to-sequence Tacotron2 system to improve the naturalness of the generated voice. That is, making the prosody of the synthesis looking more like the real human speaker. Specifically we embedded with an additional encoder part-of-speech tags and punctuation mark locations of the input text to condition Tacotron2 generation. We propose two different architectures for this parallel encoder: one based on a stack of convolutional plus recurrent layers, and another formed by a stack of bidirectional recurrent plus linear layers. To evaluate the similarity between real read-speech and synthesis, we carried out an objective test using signal processing metrics and a perceptual test. The presented results show that we achieved an improvement in naturalness.


 DOI: 10.21437/Interspeech.2020-1788

Cite as: Peiró-Lilja, A., Farrús, M. (2020) Naturalness Enhancement with Linguistic Information in End-to-End TTS Using Unsupervised Parallel Encoding. Proc. Interspeech 2020, 3994-3998, DOI: 10.21437/Interspeech.2020-1788.


@inproceedings{Peiró-Lilja2020,
  author={Alex Peiró-Lilja and Mireia Farrús},
  title={{Naturalness Enhancement with Linguistic Information in End-to-End TTS Using Unsupervised Parallel Encoding}},
  year=2020,
  booktitle={Proc. Interspeech 2020},
  pages={3994--3998},
  doi={10.21437/Interspeech.2020-1788},
  url={http://dx.doi.org/10.21437/Interspeech.2020-1788}
}