A Voice Conversion Mapping Function Based on a Stacked Joint-Autoencoder

Seyed Hamidreza Mohammadi, Alexander Kain

In this study, we propose a novel method for training a regression function and apply it to a voice conversion task. The regression function is constructed using a Stacked Joint-Autoencoder (SJAE). Previously, we have used a more primitive version of this architecture for pre-training a Deep Neural Network (DNN). Using objective evaluation criteria, we show that the lower levels of the SJAE perform best with a low degree of jointness, and higher levels with a higher degree of jointness. We demonstrate that our proposed approach generates features that do not suffer from the averaging effect inherent in back-propagation training. We also carried out subjective listening experiments to evaluate speech quality and speaker similarity. Our results show that the SJAE approach has both higher quality and similarity than a SJAE+DNN approach, where the SJAE is used for pre-training a DNN, and the fine-tuned DNN is then used for mapping. We also present the system description and results of our submission to Voice Conversion Challenge 2016.

DOI: 10.21437/Interspeech.2016-1437

Cite as

Mohammadi, S.H., Kain, A. (2016) A Voice Conversion Mapping Function Based on a Stacked Joint-Autoencoder. Proc. Interspeech 2016, 1647-1651.

author={Seyed Hamidreza Mohammadi and Alexander Kain},
title={A Voice Conversion Mapping Function Based on a Stacked Joint-Autoencoder},
booktitle={Interspeech 2016},