Cross-lingual Speech Emotion Recognition through Factor Analysis

Brecht Desplanques, Kris Demuynck

Conventional speech emotion recognition based on the extraction of high level descriptors emerging from low level descriptors seldom delivers promising results in cross-corpus experiments. Therefore it might not perform well in real-life applications. Factor analysis, proven in the fields of language identification and speaker verification, could clear a path towards more robust emotion recognition. This paper proposes an iVector-based approach operating on acoustic MFCC features with a separate modeling of the speaker and emotion variabilities respectively. The speech analysis extracts two fixed-length low-dimensional feature vectors corresponding to the two mentioned sources of variation. To model the speaker-related nuisance variability speaker factors are extracted using an eigenvoice matrix. After compensating for this speaker variability in the supervector space, the emotion factors (one per targeted emotion) are extracted using an emotion variability matrix. The emotion factors are then fed to a basic emotion classifier. Leave-one-speaker-out cross-validation on the Berlin Database of Emotional Speech EMO-DB (German) and IEMOCAP (English) datasets lead to results that are competitive with the current state-of-the-art. Cross-lingual experiments demonstrate the excellent robustness of the method: the classification accuracies degrade less than 15% relative when emotion models are trained on one corpus and tested on the other.

 DOI: 10.21437/Interspeech.2018-1778

Cite as: Desplanques, B., Demuynck, K. (2018) Cross-lingual Speech Emotion Recognition through Factor Analysis. Proc. Interspeech 2018, 3648-3652, DOI: 10.21437/Interspeech.2018-1778.

  author={Brecht Desplanques and Kris Demuynck},
  title={Cross-lingual Speech Emotion Recognition through Factor Analysis},
  booktitle={Proc. Interspeech 2018},