WISE: Word-Level Interaction-Based Multimodal Fusion for Speech Emotion Recognition

Guang Shen, Riwei Lai, Rui Chen, Yu Zhang, Kejia Zhang, Qilong Han, Hongtao Song


While having numerous real-world applications, speech emotion recognition is still a technically challenging problem. How to effectively leverage the inherent multiple modalities in speech data (e.g., audio and text) is key to accurate classification. Existing studies normally choose to fuse multimodal features at the utterance level and largely neglect the dynamic interplay of features from different modalities at a fine-granular level over time. In this paper, we explicitly model dynamic interactions between audio and text at the word level via interaction units between two long short-term memory networks representing audio and text. We also devise a hierarchical representation of audio information from the frame, phoneme and word levels, which largely improves the expressiveness of resulting audio features. We finally propose WISE, a novel word-level interaction-based multimodal fusion framework for speech emotion recognition, to accommodate the aforementioned components. We evaluate WISE on the public benchmark IEMOCAP corpus and demonstrate that it outperforms state-of-the-art methods.


 DOI: 10.21437/Interspeech.2020-3131

Cite as: Shen, G., Lai, R., Chen, R., Zhang, Y., Zhang, K., Han, Q., Song, H. (2020) WISE: Word-Level Interaction-Based Multimodal Fusion for Speech Emotion Recognition. Proc. Interspeech 2020, 369-373, DOI: 10.21437/Interspeech.2020-3131.


@inproceedings{Shen2020,
  author={Guang Shen and Riwei Lai and Rui Chen and Yu Zhang and Kejia Zhang and Qilong Han and Hongtao Song},
  title={{WISE: Word-Level Interaction-Based Multimodal Fusion for Speech Emotion Recognition}},
  year=2020,
  booktitle={Proc. Interspeech 2020},
  pages={369--373},
  doi={10.21437/Interspeech.2020-3131},
  url={http://dx.doi.org/10.21437/Interspeech.2020-3131}
}