Semi-Supervised Prosody Modeling Using Deep Gaussian Process Latent Variable Model

Tomoki Koriyama, Takao Kobayashi

This paper proposes a semi-supervised speech synthesis framework in which prosodic labels of training data are partially annotated. When we construct a text-to-speech (TTS) system, it is crucial to use appropriately annotated prosodic labels. For this purpose, manually annotated ones would provide a good result, but it generally costs much time and patience. Although recent studies report that end-to-end TTS framework can generate natural-sounding prosody without using prosodic labels, this does not always appear in arbitrary languages such as pitch accent ones. Alternatively, we propose an approach to utilizing a latent variable representation of prosodic information. In the latent variable representation, we employ deep Gaussian process (DGP), a deep Bayesian generative model. In the proposed semi-supervised learning framework, the posterior distributions of latent variables are inferred from linguistic and acoustic features, and the inferred latent variables are utilized to train a DGP-based regression model of acoustic features. Experimental results show that the proposed framework can give a comparable performance with the case using fully-annotated speech data in subjective evaluation even if the prosodic information of pitch accent is limited.

 DOI: 10.21437/Interspeech.2019-2497

Cite as: Koriyama, T., Kobayashi, T. (2019) Semi-Supervised Prosody Modeling Using Deep Gaussian Process Latent Variable Model. Proc. Interspeech 2019, 4450-4454, DOI: 10.21437/Interspeech.2019-2497.

  author={Tomoki Koriyama and Takao Kobayashi},
  title={{Semi-Supervised Prosody Modeling Using Deep Gaussian Process Latent Variable Model}},
  booktitle={Proc. Interspeech 2019},