Improving Speech Recognition Using GAN-Based Speech Synthesis and Contrastive Unspoken Text Selection

Zhehuai Chen, Andrew Rosenberg, Yu Zhang, Gary Wang, Bhuvana Ramabhadran, Pedro J. Moreno


Text-to-Speech synthesis (TTS) based data augmentation is a relatively new mechanism for utilizing text-only data to improve automatic speech recognition (ASR) training without parameter or inference architecture changes. However, efforts to train speech recognition systems on synthesized utterances suffer from limited acoustic diversity of TTS outputs. Additionally, the text-only corpus is always much larger than the transcribed speech corpus by several orders of magnitude, which makes speech synthesis of all the text data impractical. In this work, we propose to combine generative adversarial network (GAN) and multi-style training (MTR) to increase acoustic diversity in the synthesized data. We also present a contrastive language model-based data selection technique to improve the efficiency of learning from unspoken text. We demonstrate that our proposed method allows ASR models to learn from synthesis of large-scale unspoken text sources and achieves a 35% relative WER reduction on a voice-search task.


 DOI: 10.21437/Interspeech.2020-1475

Cite as: Chen, Z., Rosenberg, A., Zhang, Y., Wang, G., Ramabhadran, B., Moreno, P.J. (2020) Improving Speech Recognition Using GAN-Based Speech Synthesis and Contrastive Unspoken Text Selection. Proc. Interspeech 2020, 556-560, DOI: 10.21437/Interspeech.2020-1475.


@inproceedings{Chen2020,
  author={Zhehuai Chen and Andrew Rosenberg and Yu Zhang and Gary Wang and Bhuvana Ramabhadran and Pedro J. Moreno},
  title={{Improving Speech Recognition Using GAN-Based Speech Synthesis and Contrastive Unspoken Text Selection}},
  year=2020,
  booktitle={Proc. Interspeech 2020},
  pages={556--560},
  doi={10.21437/Interspeech.2020-1475},
  url={http://dx.doi.org/10.21437/Interspeech.2020-1475}
}