Towards Temporal Modelling of Categorical Speech Emotion Recognition

Wenjing Han, Huabin Ruan, Xiaomin Chen, Zhixiang Wang, Haifeng Li, Björn Schuller

To model the categorical speech emotion recognition task in a temporal manner, the first challenge arising is how to transfer the categorical label for each utterance into a label sequence. To settle this, we make a hypothesis that an utterance is consisting of emotional and non-emotional segments and these non-emotional segments correspond to silent regions, short pauses, transitions between phonemes, unvoiced phonemes, etc. With this hypothesis, we propose to treat an utterance's label sequence as a chain of two states: the emotional state denoting the emotional frame and Null denoting the non-emotional frame. Then, we exploit a recurrent neural network based connectionist temporal classification model to automatically label and align an utterance's emotional segments with emotional labels, while non-emotional segments with Nulls. Experimental results on the IEMOCAP corpus validate our hypothesis and also demonstrate the effectiveness of our proposed method compared to the state-of-the-art algorithms.

 DOI: 10.21437/Interspeech.2018-1858

Cite as: Han, W., Ruan, H., Chen, X., Wang, Z., Li, H., Schuller, B. (2018) Towards Temporal Modelling of Categorical Speech Emotion Recognition. Proc. Interspeech 2018, 932-936, DOI: 10.21437/Interspeech.2018-1858.

  author={Wenjing Han and Huabin Ruan and Xiaomin Chen and Zhixiang Wang and Haifeng Li and Björn Schuller},
  title={Towards Temporal Modelling of Categorical Speech Emotion Recognition},
  booktitle={Proc. Interspeech 2018},