Improving the Prosody of RNN-Based English Text-To-Speech Synthesis by Incorporating a BERT Model

Tom Kenter, Manish Sharma, Rob Clark


The prosody of currently available speech synthesis systems can be unnatural due to the systems only having access to the text, possibly enriched by linguistic information such as part-of-speech tags and parse trees. We show that incorporating a BERT model in an RNN-based speech synthesis model — where the BERT model is pretrained on large amounts of unlabeled data, and fine-tuned to the speech domain — improves prosody. Additionally, we propose a way of handling arbitrarily long sequences with BERT. Our findings indicate that small BERT models work better than big ones, and that fine-tuning the BERT part of the model is pivotal for getting good results.


 DOI: 10.21437/Interspeech.2020-1430

Cite as: Kenter, T., Sharma, M., Clark, R. (2020) Improving the Prosody of RNN-Based English Text-To-Speech Synthesis by Incorporating a BERT Model. Proc. Interspeech 2020, 4412-4416, DOI: 10.21437/Interspeech.2020-1430.


@inproceedings{Kenter2020,
  author={Tom Kenter and Manish Sharma and Rob Clark},
  title={{Improving the Prosody of RNN-Based English Text-To-Speech Synthesis by Incorporating a BERT Model}},
  year=2020,
  booktitle={Proc. Interspeech 2020},
  pages={4412--4416},
  doi={10.21437/Interspeech.2020-1430},
  url={http://dx.doi.org/10.21437/Interspeech.2020-1430}
}