WG-WaveNet: Real-Time High-Fidelity Speech Synthesis Without GPU

Po-chun Hsu, Hung-yi Lee


In this paper, we propose WG-WaveNet, a fast, lightweight, and high-quality waveform generation model. WG-WaveNet is composed of a compact flow-based model and a post-filter. The two components are jointly trained by maximizing the likelihood of the training data and optimizing loss functions on the frequency domains. As we design a flow-based model that is heavily compressed, the proposed model requires much less computational resources compared to other waveform generation models during both training and inference time; even though the model is highly compressed, the post-filter maintains the quality of generated waveform. Our PyTorch implementation can be trained using less than 8 GB GPU memory and generates audio samples at a rate of more than 960 kHz on an NVIDIA 1080Ti GPU. Furthermore, even if synthesizing on a CPU, we show that the proposed method is capable of generating 44.1 kHz speech waveform 1.2 times faster than real-time. Experiments also show that the quality of generated audio is comparable to those of other methods. Audio samples are publicly available online.


 DOI: 10.21437/Interspeech.2020-1736

Cite as: Hsu, P., Lee, H. (2020) WG-WaveNet: Real-Time High-Fidelity Speech Synthesis Without GPU. Proc. Interspeech 2020, 210-214, DOI: 10.21437/Interspeech.2020-1736.


@inproceedings{Hsu2020,
  author={Po-chun Hsu and Hung-yi Lee},
  title={{WG-WaveNet: Real-Time High-Fidelity Speech Synthesis Without GPU}},
  year=2020,
  booktitle={Proc. Interspeech 2020},
  pages={210--214},
  doi={10.21437/Interspeech.2020-1736},
  url={http://dx.doi.org/10.21437/Interspeech.2020-1736}
}