Improved Zero-Shot Voice Conversion Using Explicit Conditioning Signals

Shahan Nercessian


In this paper, we propose a zero-shot voice conversion algorithm adding a number of conditioning signals to explicitly transfer prosody, linguistic content, and dynamics to conversion results. We show that the proposed approach improves overall conversion quality and generalization to out-of-domain samples relative to a baseline implementation of AutoVC, as the inclusion of conditioning signals can help reduce the burden on the model’s encoder to implicitly learn all of the different aspects involved in speech production. An ablation analysis illustrates the effectiveness of the proposed method.


 DOI: 10.21437/Interspeech.2020-1889

Cite as: Nercessian, S. (2020) Improved Zero-Shot Voice Conversion Using Explicit Conditioning Signals. Proc. Interspeech 2020, 4711-4715, DOI: 10.21437/Interspeech.2020-1889.


@inproceedings{Nercessian2020,
  author={Shahan Nercessian},
  title={{Improved Zero-Shot Voice Conversion Using Explicit Conditioning Signals}},
  year=2020,
  booktitle={Proc. Interspeech 2020},
  pages={4711--4715},
  doi={10.21437/Interspeech.2020-1889},
  url={http://dx.doi.org/10.21437/Interspeech.2020-1889}
}