Improving Speech Synthesis with Discourse Relations

Adèle Aubin, Alessandra Cervone, Oliver Watts, Simon King

This paper explores whether adding Discourse Relation (DR) features improves the naturalness of neural statistical parametric speech synthesis (SPSS) in English. We hypothesize first — in the light of several previous studies — that DRs have a dedicated prosodic encoding. Secondly, we hypothesize that encoding DRs in a speech synthesizer’s input will improve the naturalness of its output. In order to test our hypotheses, we prepare a dataset of DR-annotated transcriptions of audiobooks in English. We then perform an acoustic analysis of the corpus which supports our first hypothesis that DRs are acoustically encoded in speech prosody. The analysis reveals significant correlation between specific DR categories and acoustic features, such as F0 and intensity. Then, we use the corpus to train a neural SPSS system in two configurations: a baseline configuration making use only of conventional linguistic features, and an experimental one where these are supplemented with DRs. Augmenting the inputs with DR features improves objective acoustic scores on a test set and leads to significant preference by listeners in a forced choice AB test for naturalness.

 DOI: 10.21437/Interspeech.2019-1945

Cite as: Aubin, A., Cervone, A., Watts, O., King, S. (2019) Improving Speech Synthesis with Discourse Relations. Proc. Interspeech 2019, 4470-4474, DOI: 10.21437/Interspeech.2019-1945.

  author={Adèle Aubin and Alessandra Cervone and Oliver Watts and Simon King},
  title={{Improving Speech Synthesis with Discourse Relations}},
  booktitle={Proc. Interspeech 2019},