Improving the Speaker Identity of Non-Parallel Many-to-Many Voice Conversion with Adversarial Speaker Recognition

Shaojin Ding, Guanlong Zhao, Ricardo Gutierrez-Osuna


Phonetic Posteriorgrams (PPGs) have received much attention for non-parallel many-to-many Voice Conversion (VC), and have been shown to achieve state-of-the-art performance. These methods implicitly assume that PPGs are speaker-independent and contain only linguistic information in an utterance. In practice, however, PPGs carry speaker individuality cues, such as accent, intonation, and speaking rate. As a result, these cues can leak into the voice conversion, making it sound similar to the source speaker. To address this issue, we propose an adversarial learning approach that can remove speaker-dependent information in VC models based on a PPG2speech synthesizer. During training, the encoder output of a PPG2speech synthesizer is fed to a classifier trained to identify the corresponding speaker, while the encoder is trained to fool the classifier. As a result, a more speaker-independent representation is learned. The proposed method is advantageous as it does not require pre-training the speaker classifier, and the adversarial speaker classifier is jointly trained with the PPG2speech synthesizer end-to-end. We conduct objective and subjective experiments on the CSTR VCTK Corpus under standard and one-shot VC conditions. Results show that the proposed method significantly improves the speaker identity of VC syntheses when compared with a baseline system trained without adversarial learning.


 DOI: 10.21437/Interspeech.2020-1033

Cite as: Ding, S., Zhao, G., Gutierrez-Osuna, R. (2020) Improving the Speaker Identity of Non-Parallel Many-to-Many Voice Conversion with Adversarial Speaker Recognition. Proc. Interspeech 2020, 776-780, DOI: 10.21437/Interspeech.2020-1033.


@inproceedings{Ding2020,
  author={Shaojin Ding and Guanlong Zhao and Ricardo Gutierrez-Osuna},
  title={{Improving the Speaker Identity of Non-Parallel Many-to-Many Voice Conversion with Adversarial Speaker Recognition}},
  year=2020,
  booktitle={Proc. Interspeech 2020},
  pages={776--780},
  doi={10.21437/Interspeech.2020-1033},
  url={http://dx.doi.org/10.21437/Interspeech.2020-1033}
}