Non-Parallel Many-to-Many Voice Conversion with PSR-StarGAN

Yanping Li, Dongxiang Xu, Yan Zhang, Yang Wang, Binbin Chen

Voice Conversion (VC) aims at modifying source speaker’s speech to sound like that of target speaker while preserving linguistic information of given speech. StarGAN-VC was recently proposed, which utilizes a variant of Generative Adversarial Networks (GAN) to perform non-parallel many-to-many VC. However, the quality of generated speech is not satisfactory enough. An improved method named “PSR-StarGAN-VC” is proposed in this paper by incorporating three improvements. Firstly, perceptual loss functions are introduced to optimize the generator in StarGAN-VC aiming to learn high-level spectral features. Secondly, considering that Switchable Normalization (SN) could learn different operations in different normalization layers of model, it is introduced to replace Batch Normalization (BN) in StarGAN-VC. Lastly, Residual Network (ResNet) is applied to establish the mapping of different layers between the encoder and decoder of generator aiming to retain more semantic features when converting speech, and to reduce the difficulty of training. Experiment results on the VCC 2018 datasets demonstrate superiority of the proposed method in terms of naturalness and speaker similarity.

 DOI: 10.21437/Interspeech.2020-1310

Cite as: Li, Y., Xu, D., Zhang, Y., Wang, Y., Chen, B. (2020) Non-Parallel Many-to-Many Voice Conversion with PSR-StarGAN. Proc. Interspeech 2020, 781-785, DOI: 10.21437/Interspeech.2020-1310.

  author={Yanping Li and Dongxiang Xu and Yan Zhang and Yang Wang and Binbin Chen},
  title={{Non-Parallel Many-to-Many Voice Conversion with PSR-StarGAN}},
  booktitle={Proc. Interspeech 2020},