Towards Speech Robustness for Acoustic Scene Classification

Shuo Liu, Andreas Triantafyllopoulos, Zhao Ren, Björn W. Schuller

This work discusses the impact of human voice on acoustic scene classification (ASC) systems. Typically, such systems are trained and evaluated on data sets lacking human speech. We show experimentally that the addition of speech can be detrimental to system performance. Furthermore, we propose two alternative solutions to mitigate that effect in the context of deep neural networks (DNNs). We first utilise data augmentation to make the algorithm robust against the presence of human speech in the data. We also introduce a voice-suppression algorithm that removes human speech from audio recordings, and test the DNN classifier on those denoised samples. Experimental results show that both approaches reduce the negative effects of human voice in ASC systems. Compared to using data augmentation, applying voice suppression achieved better classification accuracy and managed to perform more stably for different speech intensity.

 DOI: 10.21437/Interspeech.2020-2365

Cite as: Liu, S., Triantafyllopoulos, A., Ren, Z., Schuller, B.W. (2020) Towards Speech Robustness for Acoustic Scene Classification. Proc. Interspeech 2020, 3087-3091, DOI: 10.21437/Interspeech.2020-2365.

  author={Shuo Liu and Andreas Triantafyllopoulos and Zhao Ren and Björn W. Schuller},
  title={{Towards Speech Robustness for Acoustic Scene Classification}},
  booktitle={Proc. Interspeech 2020},