Analysis of Deep Learning Architectures for Cross-Corpus Speech Emotion Recognition

Jack Parry, Dimitri Palaz, Georgia Clarke, Pauline Lecomte, Rebecca Mead, Michael Berger, Gregor Hofer

Speech Emotion Recognition (SER) is an important and challenging task for human-computer interaction. In the literature deep learning architectures have been shown to yield state-of-the-art performance on this task when the model is trained and evaluated on the same corpus. However, prior work has indicated that such systems often yield poor performance on unseen data. To improve the generalisation capabilities of emotion recognition systems one possible approach is cross-corpus training, which consists of training the model on an aggregation of different corpora. In this paper we present an analysis of the generalisation capability of deep learning models using cross-corpus training with six different speech emotion corpora. We evaluate the models on an unseen corpus and analyse the learned representations using the t-SNE algorithm, showing that architectures based on recurrent neural networks are prone to overfit the corpora present in the training set, while architectures based on convolutional neural networks (CNNs) show better generalisation capabilities. These findings indicate that (1) cross-corpus training is a promising approach for improving generalisation and (2) CNNs should be the architecture of choice for this approach.

 DOI: 10.21437/Interspeech.2019-2753

Cite as: Parry, J., Palaz, D., Clarke, G., Lecomte, P., Mead, R., Berger, M., Hofer, G. (2019) Analysis of Deep Learning Architectures for Cross-Corpus Speech Emotion Recognition. Proc. Interspeech 2019, 1656-1660, DOI: 10.21437/Interspeech.2019-2753.

  author={Jack Parry and Dimitri Palaz and Georgia Clarke and Pauline Lecomte and Rebecca Mead and Michael Berger and Gregor Hofer},
  title={{Analysis of Deep Learning Architectures for Cross-Corpus Speech Emotion Recognition}},
  booktitle={Proc. Interspeech 2019},