Expediting TTS Synthesis with Adversarial Vocoding

Paarth Neekhara, Chris Donahue, Miller Puckette, Shlomo Dubnov, Julian McAuley

Recent approaches in text-to-speech (TTS) synthesis employ neural network strategies to vocode perceptually-informed spectrogram representations directly into listenable waveforms. Such vocoding procedures create a computational bottleneck in modern TTS pipelines. We propose an alternative approach which utilizes generative adversarial networks (GANs) to learn mappings from perceptually-informed spectrograms to simple magnitude spectrograms which can be heuristically vocoded. Through a user study, we show that our approach significantly outperforms naïve vocoding strategies while being hundreds of times faster than neural network vocoders used in state-of-the-art TTS systems. We also show that our method can be used to achieve state-of-the-art results in unsupervised synthesis of individual words of speech.

 DOI: 10.21437/Interspeech.2019-3099

Cite as: Neekhara, P., Donahue, C., Puckette, M., Dubnov, S., McAuley, J. (2019) Expediting TTS Synthesis with Adversarial Vocoding. Proc. Interspeech 2019, 186-190, DOI: 10.21437/Interspeech.2019-3099.

  author={Paarth Neekhara and Chris Donahue and Miller Puckette and Shlomo Dubnov and Julian McAuley},
  title={{Expediting TTS Synthesis with Adversarial Vocoding}},
  booktitle={Proc. Interspeech 2019},