Learning Utterance-Level Representations with Label Smoothing for Speech Emotion Recognition

Jian Huang, Jianhua Tao, Bin Liu, Zheng Lian


Emotion is high-level paralinguistic information characteristics in speech. The most essential part of speech emotion recognition is to generate robust utterance-level emotional feature representations. The commonly used approaches are pooling methods based on various models, which may lead to the loss of detailed information for emotion classification. In this paper, we utilize the NetVLAD as trainable discriminative clustering to aggregate frame-level descriptors into a single utterance-level vector. In addition, to relieve the influence of imbalanced emotional classes, we utilize unigram label smoothing with prior emotional class distribution to regularize the model. Our experimental results on the Interactive Emotional Motion Capture (IEMOCAP) database reveal that our proposed methods are beneficial to performance improvement, which is 3% better than other models.


 DOI: 10.21437/Interspeech.2020-1391

Cite as: Huang, J., Tao, J., Liu, B., Lian, Z. (2020) Learning Utterance-Level Representations with Label Smoothing for Speech Emotion Recognition. Proc. Interspeech 2020, 4079-4083, DOI: 10.21437/Interspeech.2020-1391.


@inproceedings{Huang2020,
  author={Jian Huang and Jianhua Tao and Bin Liu and Zheng Lian},
  title={{Learning Utterance-Level Representations with Label Smoothing for Speech Emotion Recognition}},
  year=2020,
  booktitle={Proc. Interspeech 2020},
  pages={4079--4083},
  doi={10.21437/Interspeech.2020-1391},
  url={http://dx.doi.org/10.21437/Interspeech.2020-1391}
}