SpecSwap: A Simple Data Augmentation Method for End-to-End Speech Recognition

Xingchen Song, Zhiyong Wu, Yiheng Huang, Dan Su, Helen Meng


Recently, End-to-End (E2E) models have achieved state-of-the-art performance for automatic speech recognition (ASR). Within these large and deep models, overfitting remains an important problem that heavily influences the model performance. One solution to deal with the overfitting problem is to increase the quantity and variety of the training data with the help of data augmentation. In this paper, we present SpecSwap, a simple data augmentation scheme for automatic speech recognition that acts directly on the spectrogram of input utterances. The augmentation policy consists of swapping blocks of frequency channels and swapping blocks of time steps. We apply SpecSwap on Transformer-based networks for end-to-end speech recognition task. Our experiments on Aishell-1 show state-of-the-art performance for E2E models that are trained solely on the speech training data. Further, by increasing the depth of model, the Transformers trained with augmentations can outperform certain hybrid systems, even without the aid of a language model.


 DOI: 10.21437/Interspeech.2020-2275

Cite as: Song, X., Wu, Z., Huang, Y., Su, D., Meng, H. (2020) SpecSwap: A Simple Data Augmentation Method for End-to-End Speech Recognition. Proc. Interspeech 2020, 581-585, DOI: 10.21437/Interspeech.2020-2275.


@inproceedings{Song2020,
  author={Xingchen Song and Zhiyong Wu and Yiheng Huang and Dan Su and Helen Meng},
  title={{SpecSwap: A Simple Data Augmentation Method for End-to-End Speech Recognition}},
  year=2020,
  booktitle={Proc. Interspeech 2020},
  pages={581--585},
  doi={10.21437/Interspeech.2020-2275},
  url={http://dx.doi.org/10.21437/Interspeech.2020-2275}
}