Multimodal Emotion Recognition Using Cross-Modal Attention and 1D Convolutional Neural Networks

Krishna D. N., Ankita Patil


In this work, we propose a new approach for multimodal emotion recognition using cross-modal attention and raw waveform based convolutional neural networks. Our approach uses audio and text information to predict the emotion label. We use an audio encoder to process the raw audio waveform to extract high-level features from the audio, and we use text encoder to extract high-level semantic information from text. We use cross-modal attention where the features from audio encoder attend to the features from text encoder and vice versa. This helps in developing interaction between speech and text sequences to extract most relevant features for emotion recognition. Our experiments show that the proposed approach obtains the state of the art results on IEMOCAP dataset [1]. We obtain 1.9% absolute improvement in accuracy compared to the previous state of the art method [2]. Our proposed approach uses 1D convolutional neural network to process the raw waveform instead of spectrogram features. Our experiments also shows that processing raw waveform gives a 0.54% improvement over spectrogram based modal.


 DOI: 10.21437/Interspeech.2020-1190

Cite as: N., K.D., Patil, A. (2020) Multimodal Emotion Recognition Using Cross-Modal Attention and 1D Convolutional Neural Networks. Proc. Interspeech 2020, 4243-4247, DOI: 10.21437/Interspeech.2020-1190.


@inproceedings{N.2020,
  author={Krishna D. N. and Ankita Patil},
  title={{Multimodal Emotion Recognition Using Cross-Modal Attention and 1D Convolutional Neural Networks}},
  year=2020,
  booktitle={Proc. Interspeech 2020},
  pages={4243--4247},
  doi={10.21437/Interspeech.2020-1190},
  url={http://dx.doi.org/10.21437/Interspeech.2020-1190}
}