Speech Emotion Recognition ‘in the Wild’ Using an Autoencoder

Vipula Dissanayake, Haimo Zhang, Mark Billinghurst, Suranga Nanayakkara

Speech Emotion Recognition (SER) has been a challenging task on which researchers have been working for decades. Recently, Deep Learning (DL) based approaches have been shown to perform well in SER tasks; however, it has been noticed that their superior performance is limited to the distribution of the data used to train the model. In this paper, we present an analysis of using autoencoders to improve the generalisability of DL based SER solutions. We train a sparse autoencoder using a large speech corpus extracted from social media. Later, the trained encoder part of the autoencoder is reused as the input to a long short-term memory (LSTM) network, and the encoder-LSTM modal is re-trained on an aggregation of five commonly used speech emotion corpora. Our evaluation uses an unseen corpus in the training & validation stages to simulate ‘in the wild’ condition and analyse the generalisability of our solution. A performance comparison is carried out between the encoder based model and a model trained without an encoder. Our results show that the autoencoder based model improves the unweighted accuracy of the unseen corpus by 8%, indicating autoencoder based pre-training can improve the generalisability of DL based SER solutions.

 DOI: 10.21437/Interspeech.2020-1356

Cite as: Dissanayake, V., Zhang, H., Billinghurst, M., Nanayakkara, S. (2020) Speech Emotion Recognition ‘in the Wild’ Using an Autoencoder. Proc. Interspeech 2020, 526-530, DOI: 10.21437/Interspeech.2020-1356.

  author={Vipula Dissanayake and Haimo Zhang and Mark Billinghurst and Suranga Nanayakkara},
  title={{Speech Emotion Recognition ‘in the Wild’ Using an Autoencoder}},
  booktitle={Proc. Interspeech 2020},