Attention-Enhanced Connectionist Temporal Classification for Discrete Speech Emotion Recognition

Ziping Zhao, Zhongtian Bao, Zixing Zhang, Nicholas Cummins, Haishuai Wang, Björn W. Schuller

Discrete speech emotion recognition (SER), the assignment of a single emotion label to an entire speech utterance, is typically performed as a sequence-to-label task. This approach, however, is limited, in that it can result in models that do not capture temporal changes in the speech signal, including those indicative of a particular emotion. One potential solution to overcome this limitation is to model SER as a sequence-to-sequence task instead. In this regard, we have developed an attention-based bidirectional long short-term memory (BLSTM) neural network in combination with a connectionist temporal classification (CTC) objective function (Attention-BLSTM-CTC) for SER. We also assessed the benefits of incorporating two contemporary attention mechanisms, namely component attention and quantum attention, into the CTC framework. To the best of the authors’ knowledge, this is the first time that such a hybrid architecture has been employed for SER.We demonstrated the effectiveness of our approach on the Interactive Emotional Dyadic Motion Capture (IEMOCAP) and FAU-Aibo Emotion corpora. The experimental results demonstrate that our proposed model outperforms current state-of-the-art approaches.

 DOI: 10.21437/Interspeech.2019-1649

Cite as: Zhao, Z., Bao, Z., Zhang, Z., Cummins, N., Wang, H., Schuller, B.W. (2019) Attention-Enhanced Connectionist Temporal Classification for Discrete Speech Emotion Recognition. Proc. Interspeech 2019, 206-210, DOI: 10.21437/Interspeech.2019-1649.

  author={Ziping Zhao and Zhongtian Bao and Zixing Zhang and Nicholas Cummins and Haishuai Wang and Björn W. Schuller},
  title={{Attention-Enhanced Connectionist Temporal Classification for Discrete Speech Emotion Recognition}},
  booktitle={Proc. Interspeech 2019},