Advancing Multiple Instance Learning with Attention Modeling for Categorical Speech Emotion Recognition

Shuiyang Mao, P.C. Ching, C.-C. Jay Kuo, Tan Lee


Categorical speech emotion recognition is typically performed as a sequence-to-label problem, i. e., to determine the discrete emotion label of the input utterance as a whole. One of the main challenges in practice is that most of the existing emotion corpora do not give ground truth labels for each segment; instead, we only have labels for whole utterances. To extract segment-level emotional information from such weakly labeled emotion corpora, we propose using multiple instance learning (MIL) to learn segment embeddings in a weakly supervised manner. Also, for a sufficiently long utterance, not all of the segments contain relevant emotional information. In this regard, three attention-based neural network models are then applied to the learned segment embeddings to attend the most salient part of a speech utterance. Experiments on the CASIA corpus and the IEMOCAP database show better or highly competitive results than other state-of-the-art approaches.


 DOI: 10.21437/Interspeech.2020-1779

Cite as: Mao, S., Ching, P., Kuo, C.J., Lee, T. (2020) Advancing Multiple Instance Learning with Attention Modeling for Categorical Speech Emotion Recognition. Proc. Interspeech 2020, 2357-2361, DOI: 10.21437/Interspeech.2020-1779.


@inproceedings{Mao2020,
  author={Shuiyang Mao and P.C. Ching and C.-C. Jay Kuo and Tan Lee},
  title={{Advancing Multiple Instance Learning with Attention Modeling for Categorical Speech Emotion Recognition}},
  year=2020,
  booktitle={Proc. Interspeech 2020},
  pages={2357--2361},
  doi={10.21437/Interspeech.2020-1779},
  url={http://dx.doi.org/10.21437/Interspeech.2020-1779}
}