INTERSPEECH 2006 - ICSLP
The performance of voice conversion has been considerably improved through statistical modeling of spectral sequences. However, the converted speech still contains traces of artificial sounds. To alleviate this, it is necessary to statistically model a source sequence as well as a spectral sequence. In this paper, we introduce STRAIGHT mixed excitation to a framework of the voice conversion based on a Gaussian Mixture Model (GMM) on joint probability density of source and target features. We convert both spectral and source feature sequences based on Maximum Likelihood Estimation (MLE). Objective and subjective evaluation results demonstrate that the proposed source conversion produces strong improvements in both the converted speech quality and the conversion accuracy for speaker individuality.
Bibliographic reference. Ohtani, Yamato / Toda, Tomoki / Saruwatari, Hiroshi / Shikano, Kiyohiro (2006): "Maximum likelihood voice conversion based on GMM with STRAIGHT mixed excitation", In INTERSPEECH-2006, paper 1681-Thu1BuP.5.