Fusion Techniques for Utterance-Level Emotion Recognition Combining Speech and Transcripts

Jilt Sebastian, Piero Pierucci

In human perception and understanding, a number of different and complementary cues are adopted according to different modalities. Various emotional states in communication between humans reflect this variety of cues across modalities. Recent developments in multi-modal emotion recognition utilize deep-learning techniques to achieve remarkable performances, with models based on different features suitable for text, audio and vision. This work focuses on cross-modal fusion techniques over deep learning models for emotion detection from spoken audio and corresponding transcripts.

We investigate the use of long short-term memory (LSTM) recurrent neural network (RNN) with pre-trained word embedding for text-based emotion recognition and convolutional neural network (CNN) with utterance-level descriptors for emotion recognition from speech. Various fusion strategies are adopted on these models to yield an overall score for each of the emotional categories. Intra-modality dynamics for each emotion is captured in the neural network designed for the specific modality. Fusion techniques are employed to obtain the inter-modality dynamics. Speaker and session-independent experiments on IEMOCAP multi-modal emotion detection dataset show the effectiveness of the proposed approaches. This method yields state-of-the-art results for utterance-level emotion recognition based on speech and text.

 DOI: 10.21437/Interspeech.2019-3201

Cite as: Sebastian, J., Pierucci, P. (2019) Fusion Techniques for Utterance-Level Emotion Recognition Combining Speech and Transcripts. Proc. Interspeech 2019, 51-55, DOI: 10.21437/Interspeech.2019-3201.

  author={Jilt Sebastian and Piero Pierucci},
  title={{Fusion Techniques for Utterance-Level Emotion Recognition Combining Speech and Transcripts}},
  booktitle={Proc. Interspeech 2019},