Modelling Intonation in Spectrograms for Neural Vocoder based Text-to-Speech

Vincent Wan, Jonathan Shen, Hanna Siilen, Rob Clark


Intonation is characterized by rises and falls in pitch and energy. In previous work, we explicitly modelled these prosodic features using Clockwork Hierarchical Variational Autoencoders (CHiVE) to show we can generate multiple intonation contours for any text. However, recent advances in text-to-speech synthesis produce spectrograms which are inverted by neural vocoders to produce waveforms. Spectrograms encode intonation in a complex way; there is no simple, explicit representation analogous to pitch (fundamental frequency) and energy. In this paper, we extend CHiVE to model intonation within a spectrogram. Compared to the original model, the spectrogram extension gives better mean opinion scores in subjective listening tests. We show that the intonation in the generated spectrograms match the intonation represented by the generated pitch curves.


 DOI: 10.21437/SpeechProsody.2020-193

Cite as: Wan, V., Shen, J., Siilen, H., Clark, R. (2020) Modelling Intonation in Spectrograms for Neural Vocoder based Text-to-Speech. Proc. 10th International Conference on Speech Prosody 2020, 945-949, DOI: 10.21437/SpeechProsody.2020-193.


@inproceedings{Wan2020,
  author={Vincent Wan and Jonathan Shen and Hanna Siilen and Rob Clark},
  title={{Modelling Intonation in Spectrograms for Neural Vocoder based Text-to-Speech}},
  year=2020,
  booktitle={Proc. 10th International Conference on Speech Prosody 2020},
  pages={945--949},
  doi={10.21437/SpeechProsody.2020-193},
  url={http://dx.doi.org/10.21437/SpeechProsody.2020-193}
}