Using Speech Enhancement Preprocessing for Speech Emotion Recognition in Realistic Noisy Conditions

Hengshun Zhou, Jun Du, Yan-Hui Tu, Chin-Hui Lee


In this study, we investigate the effects of deep learning (DL)-based speech enhancement (SE) on speech emotion recognition (SER) in realistic environments. First, we use emotion speech data to train regression-based speech enhancement models which is shown to be beneficial to noisy speech emotion recognition. Next, to improve the model generalization capability of the regression model, an LSTM architecture with a design of hidden layers via simply densely-connected progressive learning, is adopted for the enhancement model. Finally, a post-processor utilizing an improved speech presence probability to estimate masks from the above proposed LSTM structure is shown to further improves recognition accuracies. Experiments results on the IEMOCAP and CHEAVD 2.0 corpora demonstrate that the proposed framework can yield consistent and significant improvements over the systems using unprocessed noisy speech.


 DOI: 10.21437/Interspeech.2020-2472

Cite as: Zhou, H., Du, J., Tu, Y., Lee, C. (2020) Using Speech Enhancement Preprocessing for Speech Emotion Recognition in Realistic Noisy Conditions. Proc. Interspeech 2020, 4098-4102, DOI: 10.21437/Interspeech.2020-2472.


@inproceedings{Zhou2020,
  author={Hengshun Zhou and Jun Du and Yan-Hui Tu and Chin-Hui Lee},
  title={{Using Speech Enhancement Preprocessing for Speech Emotion Recognition in Realistic Noisy Conditions}},
  year=2020,
  booktitle={Proc. Interspeech 2020},
  pages={4098--4102},
  doi={10.21437/Interspeech.2020-2472},
  url={http://dx.doi.org/10.21437/Interspeech.2020-2472}
}