Multi-Modal Attention for Speech Emotion Recognition

Zexu Pan, Zhaojie Luo, Jichen Yang, Haizhou Li


Emotion represents an essential aspect of human speech that is manifested in speech prosody. Speech, visual, and textual cues are complementary in human communication. In this paper, we study a hybrid fusion method, referred to as multi-modal attention network (MMAN) to makes use of visual and textual cues in speech emotion recognition. We propose a novel multi-modal attention mechanism, cLSTM-MMA, which facilitates the attention across three modalities and selectively fuse the information. cLSTM-MMA is fused with other uni-modal sub-networks in the late fusion. The experiments show that speech emotion recognition benefits significantly from visual and textual cues, and the proposed cLSTM-MMA alone is as competitive as other fusion methods in terms of accuracy, but with a much more compact network structure. The proposed hybrid network MMAN achieves state-of-the-art performance on IEMOCAP database for emotion recognition.


 DOI: 10.21437/Interspeech.2020-1653

Cite as: Pan, Z., Luo, Z., Yang, J., Li, H. (2020) Multi-Modal Attention for Speech Emotion Recognition. Proc. Interspeech 2020, 364-368, DOI: 10.21437/Interspeech.2020-1653.


@inproceedings{Pan2020,
  author={Zexu Pan and Zhaojie Luo and Jichen Yang and Haizhou Li},
  title={{Multi-Modal Attention for Speech Emotion Recognition}},
  year=2020,
  booktitle={Proc. Interspeech 2020},
  pages={364--368},
  doi={10.21437/Interspeech.2020-1653},
  url={http://dx.doi.org/10.21437/Interspeech.2020-1653}
}