ARVC: An Auto-Regressive Voice Conversion System Without Parallel Training Data

Zheng Lian, Zhengqi Wen, Xinyong Zhou, Songbai Pu, Shengkai Zhang, Jianhua Tao


Voice conversion (VC) is to convert the source speaker’s voice to sound like that of the target speaker without changing the linguistic content. Recent work shows that phonetic posteriorgrams (PPGs) based VC frameworks have achieved promising results in speaker similarity and speech quality. However, in practice, we find that the trajectory of some generated waveforms is not smooth, thus causing some voice error problems and degrading the sound quality of the converted speech. In this paper, we propose to advance the existing PPGs based voice conversion methods to achieve better performance. Specifically, we propose a new auto-regressive model for any-to-one VC, called Auto-Regressive Voice Conversion (ARVC). Compared with conventional PPGs based VC, ARVC takes previous step acoustic features as the inputs to produce the next step outputs via the auto-regressive structure. Experimental results on the CMU-ARCTIC dataset show that our method can improve the speech quality and speaker similarity of the converted speech.


 DOI: 10.21437/Interspeech.2020-1715

Cite as: Lian, Z., Wen, Z., Zhou, X., Pu, S., Zhang, S., Tao, J. (2020) ARVC: An Auto-Regressive Voice Conversion System Without Parallel Training Data. Proc. Interspeech 2020, 4706-4710, DOI: 10.21437/Interspeech.2020-1715.


@inproceedings{Lian2020,
  author={Zheng Lian and Zhengqi Wen and Xinyong Zhou and Songbai Pu and Shengkai Zhang and Jianhua Tao},
  title={{ARVC: An Auto-Regressive Voice Conversion System Without Parallel Training Data}},
  year=2020,
  booktitle={Proc. Interspeech 2020},
  pages={4706--4710},
  doi={10.21437/Interspeech.2020-1715},
  url={http://dx.doi.org/10.21437/Interspeech.2020-1715}
}