StrawNet: Self-Training WaveNet for TTS in Low-Data Regimes

Manish Sharma, Tom Kenter, Rob Clark

Recently, WaveNet has become a popular choice of neural network to synthesize speech audio. Autoregressive WaveNet is capable of producing high-fidelity audio, but is too slow for real-time synthesis. As a remedy, Parallel WaveNet was proposed, which can produce audio faster than real time through distillation of an autoregressive teacher into a feedforward student network. A shortcoming of this approach, however, is that a large amount of recorded speech data is required to produce high-quality student models, and this data is not always available. In this paper, we propose StrawNet: a self-training approach to train a Parallel WaveNet. Self-training is performed using the synthetic examples generated by the autoregressive WaveNet teacher. We show that, in low-data regimes, training on high-fidelity synthetic data from an autoregressive teacher model is superior to training the student model on (much fewer) examples of recorded speech. We compare StrawNet to a baseline Parallel WaveNet, using both side-by-side tests and Mean Opinion Score evaluations. To our knowledge, synthetic speech has not been used to train neural text-to-speech before.

 DOI: 10.21437/Interspeech.2020-1437

Cite as: Sharma, M., Kenter, T., Clark, R. (2020) StrawNet: Self-Training WaveNet for TTS in Low-Data Regimes. Proc. Interspeech 2020, 3550-3554, DOI: 10.21437/Interspeech.2020-1437.

  author={Manish Sharma and Tom Kenter and Rob Clark},
  title={{StrawNet: Self-Training WaveNet for TTS in Low-Data Regimes}},
  booktitle={Proc. Interspeech 2020},