Exploring Text and Audio Embeddings for Multi-Dimension Elderly Emotion Recognition

Mariana Julião, Alberto Abad, Helena Moniz

This paper investigates the use of audio and text embeddings for the classification of emotion dimensions within the scope of the Elderly Emotion Sub-Challenge of the INTERSPEECH 2020 Computational Paralinguistics Challenge. We explore speaker and time dependencies on the expression of emotions through the combination of well-known acoustic-prosodic features and speaker embeddings extracted for different time scales. We consider text information input through transformer language embeddings, both isolated and in combination with acoustic features. The combination of acoustic and text information is explored in early and late fusion schemes. Overall, early fusion of systems trained on top of hand-crafted acoustic-prosodic features (eGeMAPS and ComParE), acoustic model feature embeddings (x-vectors), and text feature embeddings provide the best classification results in development for both Arousal and Valence. The combination of modalities allows us to reach a multi-dimension emotion classification performance in the development challenge data set of up to 48.8% Unweighted Average Recall (UAR) and 61.0% UAR for Arousal and Valence, respectively. These results correspond to a 16.2% and a 8.7% relative UAR improvement.

 DOI: 10.21437/Interspeech.2020-2290

Cite as: Julião, M., Abad, A., Moniz, H. (2020) Exploring Text and Audio Embeddings for Multi-Dimension Elderly Emotion Recognition. Proc. Interspeech 2020, 2067-2071, DOI: 10.21437/Interspeech.2020-2290.

  author={Mariana Julião and Alberto Abad and Helena Moniz},
  title={{Exploring Text and Audio Embeddings for Multi-Dimension Elderly Emotion Recognition}},
  booktitle={Proc. Interspeech 2020},