Nonparallel Emotional Speech Conversion Using VAE-GAN

Yuexin Cao, Zhengchen Liu, Minchuan Chen, Jun Ma, Shaojun Wang, Jing Xiao


This paper proposes a nonparallel emotional speech conversion (ESC) method based on Variational AutoEncoder-Generative Adversarial Network (VAE-GAN). Emotional speech conversion aims at transforming speech from one source emotion to that of a target emotion without changing the speaker’s identity and linguistic content. In this work, an encoder is trained to elicit the content-related representations from acoustic features. Emotion-related representations are extracted in a supervised manner. Then the transformation between emotion-related representations from different domains is learned using an improved cycle-consistent Generative Adversarial Network (CycleGAN). Finally, emotion conversion is performed by eliciting and recombining the content-related representations of the source speech and the emotion-related representations of the target emotion. Subjective evaluation experiments are conducted and the results show that the proposed method outperforms the baseline in terms of voice quality and emotion conversion ability.


 DOI: 10.21437/Interspeech.2020-1647

Cite as: Cao, Y., Liu, Z., Chen, M., Ma, J., Wang, S., Xiao, J. (2020) Nonparallel Emotional Speech Conversion Using VAE-GAN. Proc. Interspeech 2020, 3406-3410, DOI: 10.21437/Interspeech.2020-1647.


@inproceedings{Cao2020,
  author={Yuexin Cao and Zhengchen Liu and Minchuan Chen and Jun Ma and Shaojun Wang and Jing Xiao},
  title={{Nonparallel Emotional Speech Conversion Using VAE-GAN}},
  year=2020,
  booktitle={Proc. Interspeech 2020},
  pages={3406--3410},
  doi={10.21437/Interspeech.2020-1647},
  url={http://dx.doi.org/10.21437/Interspeech.2020-1647}
}