Group Gated Fusion on Attention-Based Bidirectional Alignment for Multimodal Emotion Recognition

Pengfei Liu, Kun Li, Helen Meng


Emotion recognition is a challenging and actively-studied research area that plays a critical role in emotion-aware human-computer interaction systems. In a multimodal setting, temporal alignment between different modalities has not been well investigated yet. This paper presents a new model named as Gated Bidirectional Alignment Network (GBAN), which consists of an attention-based bidirectional alignment network over LSTM hidden states to explicitly capture the alignment relationship between speech and text, and a novel group gated fusion (GGF) layer to integrate the representations of different modalities. We empirically show that the attention-aligned representations outperform the last-hidden-states of LSTM significantly, and the proposed GBAN model outperforms existing state-of-the-art multimodal approaches on the IEMOCAP dataset.


 DOI: 10.21437/Interspeech.2020-2067

Cite as: Liu, P., Li, K., Meng, H. (2020) Group Gated Fusion on Attention-Based Bidirectional Alignment for Multimodal Emotion Recognition. Proc. Interspeech 2020, 379-383, DOI: 10.21437/Interspeech.2020-2067.


@inproceedings{Liu2020,
  author={Pengfei Liu and Kun Li and Helen Meng},
  title={{Group Gated Fusion on Attention-Based Bidirectional Alignment for Multimodal Emotion Recognition}},
  year=2020,
  booktitle={Proc. Interspeech 2020},
  pages={379--383},
  doi={10.21437/Interspeech.2020-2067},
  url={http://dx.doi.org/10.21437/Interspeech.2020-2067}
}