ISCA Archive Interspeech 2013
ISCA Archive Interspeech 2013

Exemplar-based unit selection for voice conversion utilizing temporal information

Zhizheng Wu, Tuomas Virtanen, Tomi Kinnunen, Eng Siong Chng, Haizhou Li

Although temporal information of speech has been shown to play an important role in perception, most of the voice conversion approaches assume the speech frames are independent of each other, thereby ignoring the temporal information. In this study, we improve conventional unit selection approach by using exemplars which span multiple frames as base units, and also take temporal information constraint into voice conversion by using overlapping frames to generate speech parameters. This approach thus provides more stable concatenation cost and avoids discontinuity problem in conventional unit selection approach. The proposed method also keeps away from the over-smoothing problem in the mainstream joint density Gaussian mixture model (JD-GMM) based conversion method by directly using target speaker's training data for synthesizing the converted speech. Both objective and subjective evaluations indicate that our proposed method outperforms JD-GMM and conventional unit selection methods.

doi: 10.21437/Interspeech.2013-667

Cite as: Wu, Z., Virtanen, T., Kinnunen, T., Chng, E.S., Li, H. (2013) Exemplar-based unit selection for voice conversion utilizing temporal information. Proc. Interspeech 2013, 3057-3061, doi: 10.21437/Interspeech.2013-667

  author={Zhizheng Wu and Tuomas Virtanen and Tomi Kinnunen and Eng Siong Chng and Haizhou Li},
  title={{Exemplar-based unit selection for voice conversion utilizing temporal information}},
  booktitle={Proc. Interspeech 2013},