Emotional Voice Conversion Using Neural Networks with Different Temporal Scales of F0 based on Wavelet Transform

Zhaojie Luo, Tetsuya Takiguchi, Yasuo Ariki


An artificial neural network is one of the most important models for training features of voice conversion (VC) tasks. Typically, neural networks (NNs) are very effective in processing nonlinear features, such as mel cepstral coefficients (MCC) which represent the spectrum features. However, a simple representation for fundamental frequency (F0) is not enough for neural networks to deal with an emotional voice, because the time sequence of F0 for an emotional voice changes drastically. Therefore, in this paper, we propose an effective method that uses the continuous wavelet transform (CWT) to decompose F0 into different temporal scales that can be well trained by NNs for prosody modeling in emotional voice conversion. Meanwhile, the proposed method uses deep belief networks (DBNs) to pretrain the NNs that convert spectral features. By utilizing these approaches, the proposed method can change the spectrum and the prosody for an emotional voice at the same time, and was able to outperform other state-of-the-art methods for emotional voice conversion.


DOI: 10.21437/SSW.2016-23

Cite as

Luo, Z., Takiguchi, T., Ariki, Y. (2016) Emotional Voice Conversion Using Neural Networks with Different Temporal Scales of F0 based on Wavelet Transform. Proc. 9th ISCA Speech Synthesis Workshop, 140-145.

Bibtex
@inproceedings{Luo+2016,
author={Zhaojie Luo and Tetsuya Takiguchi and Yasuo Ariki},
title={Emotional Voice Conversion Using Neural Networks with Different Temporal Scales of F0 based on Wavelet Transform},
year=2016,
booktitle={9th ISCA Speech Synthesis Workshop},
doi={10.21437/SSW.2016-23},
url={http://dx.doi.org/10.21437/SSW.2016-23},
pages={140--145}
}