Parallel vs. Non-Parallel Voice Conversion for Esophageal Speech

Luis Serrano, Sneha Raman, David Tavarez, Eva Navas, Inma Hernaez

State of the art systems for voice conversion have been shown to generate highly natural sounding converted speech. Voice conversion techniques have also been applied to alaryngeal speech, with the aim of improving its quality or its intelligibility. In this paper, we present an attempt to apply a voice conversion strategy based on phonetic posteriorgrams (PPGs), which produces very high quality converted speech, to improve the characteristics of esophageal speech. The main advantage of this PPG based architecture lies in the fact that it is able to convert speech from any source, without the need to previously train the system with a parallel corpus. However, our results show that the PPG approach degrades the intelligibility of the converted speech considerably, especially when the input speech is already poorly intelligible. In this paper two systems are compared, an LSTM based one-to-one conversion system, which is referred to as the baseline, and the new system using phonetic posteriorgrams. Both spectral parameters and f0 are converted using DNN (Deep Neural Network) based architectures. Results from both objective and subjective evaluations are presented, showing that although ASR (Automated Speech Recognition) errors are reduced, original esophageal speech is still preferred by subjects.

 DOI: 10.21437/Interspeech.2019-2194

Cite as: Serrano, L., Raman, S., Tavarez, D., Navas, E., Hernaez, I. (2019) Parallel vs. Non-Parallel Voice Conversion for Esophageal Speech. Proc. Interspeech 2019, 4549-4553, DOI: 10.21437/Interspeech.2019-2194.

  author={Luis Serrano and Sneha Raman and David Tavarez and Eva Navas and Inma Hernaez},
  title={{Parallel vs. Non-Parallel Voice Conversion for Esophageal Speech}},
  booktitle={Proc. Interspeech 2019},