Learning Syllable-Level Discrete Prosodic Representation for Expressive Speech Generation

Guangyan Zhang, Ying Qin, Tan Lee


This paper presents an extension of the Tacotron 2 end-to-end speech synthesis architecture, which aims to learn syllable-level discrete prosodic representations from speech data. The learned representations can be used for transferring or controlling prosody in expressive speech generation. The proposed design starts with a syllable-level text encoder that encodes input text at syllable level instead of phoneme level. The continuous prosodic representation for each syllable is then extracted. A Vector-Quantised Variational Auto-Encoder (VQ-VAE) is used to discretize the learned continuous prosodic representations. The discrete representations are finally concatenated with text encoder output to achieve prosody transfer or control. Subjective evaluation is carried out on the syllable-level TTS system, and the effectiveness of prosody transfer. The results show that the proposed Syllable-level neural TTS system produce more natural speech than conventional phoneme-level TTS system. It is also shown that prosody transfer could be achieved and the latent prosody codes are explainable with relation to specific prosody variation.


 DOI: 10.21437/Interspeech.2020-2228

Cite as: Zhang, G., Qin, Y., Lee, T. (2020) Learning Syllable-Level Discrete Prosodic Representation for Expressive Speech Generation. Proc. Interspeech 2020, 3426-3430, DOI: 10.21437/Interspeech.2020-2228.


@inproceedings{Zhang2020,
  author={Guangyan Zhang and Ying Qin and Tan Lee},
  title={{Learning Syllable-Level Discrete Prosodic Representation for Expressive Speech Generation}},
  year=2020,
  booktitle={Proc. Interspeech 2020},
  pages={3426--3430},
  doi={10.21437/Interspeech.2020-2228},
  url={http://dx.doi.org/10.21437/Interspeech.2020-2228}
}