SpeechMix — Augmenting Deep Sound Recognition Using Hidden Space Interpolations

Amit Jindal, Narayanan Elavathur Ranganatha, Aniket Didolkar, Arijit Ghosh Chowdhury, Di Jin, Ramit Sawhney, Rajiv Ratn Shah


This paper presents SpeechMix, a regularization and data augmentation technique for deep sound recognition. Our strategy is to create virtual training samples by interpolating speech samples in hidden space. SpeechMix has the potential to generate an infinite number of new augmented speech samples since the combination of speech samples is continuous. Thus, it allows downstream models to avoid overfitting drastically. Unlike other mixing strategies that only work on the input space, we apply our method on the intermediate layers to capture a broader representation of the feature space. Through an extensive quantitative evaluation, we demonstrate the effectiveness of SpeechMix in comparison to standard learning regimes and previously applied mixing strategies. Furthermore, we highlight how different hidden layers contribute to the improvements in classification using an ablation study.


 DOI: 10.21437/Interspeech.2020-3147

Cite as: Jindal, A., Ranganatha, N.E., Didolkar, A., Chowdhury, A.G., Jin, D., Sawhney, R., Shah, R.R. (2020) SpeechMix — Augmenting Deep Sound Recognition Using Hidden Space Interpolations. Proc. Interspeech 2020, 861-865, DOI: 10.21437/Interspeech.2020-3147.


@inproceedings{Jindal2020,
  author={Amit Jindal and Narayanan Elavathur Ranganatha and Aniket Didolkar and Arijit Ghosh Chowdhury and Di Jin and Ramit Sawhney and Rajiv Ratn Shah},
  title={{SpeechMix — Augmenting Deep Sound Recognition Using Hidden Space Interpolations}},
  year=2020,
  booktitle={Proc. Interspeech 2020},
  pages={861--865},
  doi={10.21437/Interspeech.2020-3147},
  url={http://dx.doi.org/10.21437/Interspeech.2020-3147}
}