Voice Conversion Using Speech-to-Speech Neuro-Style Transfer

Ehab A. AlBadawy, Siwei Lyu


An impressionist is the one who tries to mimic other people’s voices and their style of speech. Humans have mastered such a task throughout the years. In this work, we introduce a deep learning-based approach to do voice conversion with speech style transfer across different speakers. In our work, we use a combination of Variational Auto-Encoder (VAE) and Generative Adversarial Network (GAN) as the main components of our proposed model followed by a WaveNet-based vocoder. We use three objective metrics to evaluate our model using the ASVspoof 2019 for measuring the difficulty of differentiating between human and synthesized samples, content verification for transcription accuracy, and speaker encoding for identity verification. Our results show the efficacy of our proposed model in producing a high quality synthesized speech on Flickr8k audio corpus.


 DOI: 10.21437/Interspeech.2020-3056

Cite as: AlBadawy, E.A., Lyu, S. (2020) Voice Conversion Using Speech-to-Speech Neuro-Style Transfer. Proc. Interspeech 2020, 4726-4730, DOI: 10.21437/Interspeech.2020-3056.


@inproceedings{AlBadawy2020,
  author={Ehab A. AlBadawy and Siwei Lyu},
  title={{Voice Conversion Using Speech-to-Speech Neuro-Style Transfer}},
  year=2020,
  booktitle={Proc. Interspeech 2020},
  pages={4726--4730},
  doi={10.21437/Interspeech.2020-3056},
  url={http://dx.doi.org/10.21437/Interspeech.2020-3056}
}