Non-Parallel Voice Conversion with Fewer Labeled Data by Conditional Generative Adversarial Networks

Minchuan Chen, Weijian Hou, Jun Ma, Shaojun Wang, Jing Xiao


Recent studies have shown remarkable success in voice conversion (VC) based on generative adversarial networks (GANs) without parallel data. In this paper, based on the conditional generative adversarial networks (CGANs), we propose a self- and semi-supervised method combined with mixup and data augmentation that allows non-parallel many-to-many voice conversion with fewer labeled data. In this method, the discriminator of CGANs learns to not only distinguish real/fake samples, but also classify attribute domains. We augment the discriminator with an auxiliary task to improve representation learning and introduce a training task to predict labels for the unlabeled samples. The proposed approach reduces the appetite for labeled data in voice conversion, which enables single generative network to implement many-to-many mapping between different voice domains. Experiment results show that the proposed method is able to achieve comparable voice quality and speaker similarity with only 10% of the labeled data.


 DOI: 10.21437/Interspeech.2020-2162

Cite as: Chen, M., Hou, W., Ma, J., Wang, S., Xiao, J. (2020) Non-Parallel Voice Conversion with Fewer Labeled Data by Conditional Generative Adversarial Networks. Proc. Interspeech 2020, 4716-4720, DOI: 10.21437/Interspeech.2020-2162.


@inproceedings{Chen2020,
  author={Minchuan Chen and Weijian Hou and Jun Ma and Shaojun Wang and Jing Xiao},
  title={{Non-Parallel Voice Conversion with Fewer Labeled Data by Conditional Generative Adversarial Networks}},
  year=2020,
  booktitle={Proc. Interspeech 2020},
  pages={4716--4720},
  doi={10.21437/Interspeech.2020-2162},
  url={http://dx.doi.org/10.21437/Interspeech.2020-2162}
}