A Lightweight Model Based on Separable Convolution for Speech Emotion Recognition

Ying Zhong, Ying Hu, Hao Huang, Wushour Silamu


One of the major challenges in Speech Emotion Recognition (SER) is to build a lightweight model with limited training data. In this paper, we propose a lightweight architecture with only fewer parameters which is based on separable convolution and inverted residuals. Speech samples are often annotated by multiple raters. While some sentences with clear emotional content are consistently annotated (easy samples), sentences with ambiguous emotional content present important disagreement between individual evaluations (hard samples). We assumed that samples hard for humans are also hard for computers. We address the problem by using focal loss, which focus on learning hard samples and down-weight easy samples. By combining attention mechanism, our proposed network can enhance the importing of emotion-salient information. Our proposed model achieves 71.72% and 90.1% of unweighted accuracy (UA) on the well-known corpora IEMOCAP and Emo-DB respectively. Comparing with the current model having fewest parameters as we know, its model size is almost 5 times of our proposed model.


 DOI: 10.21437/Interspeech.2020-2408

Cite as: Zhong, Y., Hu, Y., Huang, H., Silamu, W. (2020) A Lightweight Model Based on Separable Convolution for Speech Emotion Recognition. Proc. Interspeech 2020, 3331-3335, DOI: 10.21437/Interspeech.2020-2408.


@inproceedings{Zhong2020,
  author={Ying Zhong and Ying Hu and Hao Huang and Wushour Silamu},
  title={{A Lightweight Model Based on Separable Convolution for Speech Emotion Recognition}},
  year=2020,
  booktitle={Proc. Interspeech 2020},
  pages={3331--3335},
  doi={10.21437/Interspeech.2020-2408},
  url={http://dx.doi.org/10.21437/Interspeech.2020-2408}
}