Efficient WaveGlow: An Improved WaveGlow Vocoder with Enhanced Speed

Wei Song, Guanghui Xu, Zhengchen Zhang, Chao Zhang, Xiaodong He, Bowen Zhou

Neural vocoder, such as WaveGlow, has become an important component in recent high-quality text-to-speech (TTS) systems. In this paper, we propose Efficient WaveGlow (EWG), a flow-based generative model serving as an efficient neural vocoder. Similar to WaveGlow, EWG has a normalizing flow backbone where each flow step consists of an affine coupling layer and an invertible 1×1 convolution. To reduce the number of model parameters and enhance the speed without sacrificing the quality of the synthesized speech, EWG improves WaveGlow in three aspects. First, the WaveNet-style transform network in WaveGlow is replaced with an FFTNet-style dilated convolution network. Next, to reduce the computation cost, group convolution is applied to both audio and local condition features. At last, the local condition is shared among the transform network layers in each coupling layer. As a result, EWG can reduce the number of floating-point operations (FLOPs) required to generate one-second audio and the number of model parameters both by more than 12 times. Experimental results show that EWG can reduce real-world inference time cost by more than twice, without any obvious reduction in the speech quality.

 DOI: 10.21437/Interspeech.2020-2172

Cite as: Song, W., Xu, G., Zhang, Z., Zhang, C., He, X., Zhou, B. (2020) Efficient WaveGlow: An Improved WaveGlow Vocoder with Enhanced Speed. Proc. Interspeech 2020, 225-229, DOI: 10.21437/Interspeech.2020-2172.

  author={Wei Song and Guanghui Xu and Zhengchen Zhang and Chao Zhang and Xiaodong He and Bowen Zhou},
  title={{Efficient WaveGlow: An Improved WaveGlow Vocoder with Enhanced Speed}},
  booktitle={Proc. Interspeech 2020},