Using Speaker-Aligned Graph Memory Block in Multimodally Attentive Emotion Recognition Network

Jeng-Lin Li, Chi-Chun Lee


Integrating multimodal emotion sensing modules in realizing human-centered technologies is rapidly growing. Despite recent advancement of deep architectures in improving recognition performances, inability to handle individual differences in the expressive cues creates a major hurdle for real world applications. In this work, we propose a Speaker-aligned Graph Memory Network (SaGMN) that leverages the use of speaker embedding learned from a large speaker verification network to characterize such an individualized personal difference across speakers. Specifically, the learning of the gated memory block is jointly optimized with a speaker graph encoder which aligns similar vocal characteristics samples together while effectively enlarge the discrimination across emotion classes. We evaluate our multimodal emotion recognition network on the CMU-MOSEI database and achieve a state-of-art accuracy of 65.1% UAR and 74.7% F1 score. Further visualization experiments demonstrate the effect of speaker space alignment with the use of graph memory blocks.


 DOI: 10.21437/Interspeech.2020-1688

Cite as: Li, J., Lee, C. (2020) Using Speaker-Aligned Graph Memory Block in Multimodally Attentive Emotion Recognition Network. Proc. Interspeech 2020, 389-393, DOI: 10.21437/Interspeech.2020-1688.


@inproceedings{Li2020,
  author={Jeng-Lin Li and Chi-Chun Lee},
  title={{Using Speaker-Aligned Graph Memory Block in Multimodally Attentive Emotion Recognition Network}},
  year=2020,
  booktitle={Proc. Interspeech 2020},
  pages={389--393},
  doi={10.21437/Interspeech.2020-1688},
  url={http://dx.doi.org/10.21437/Interspeech.2020-1688}
}