End-to-End Speech Emotion Recognition Combined with Acoustic-to-Word ASR Model

Han Feng, Sei Ueno, Tatsuya Kawahara


In this paper, we propose speech emotion recognition (SER) combined with an acoustic-to-word automatic speech recognition (ASR) model. While acoustic prosodic features are primarily used for SER, textual features are also useful but are error-prone, especially in emotional speech. To solve this problem, we integrate ASR model and SER model in an end-to-end manner. This is done by using an acoustic-to-word model. Specifically, we utilize the states of the decoder in the ASR model with the acoustic features and input them into the SER model. On top of a recurrent network to learn features from this input, we adopt a self-attention mechanism to focus on important feature frames. Finally, we finetune the ASR model on the new dataset using a multi-task learning method to jointly optimize ASR with the SER task. Our model has achieved a 68.63% weighted accuracy (WA) and 69.67% unweighted accuracy (UA) on the IEMOCAP database, which is state-of-the-art performance.


 DOI: 10.21437/Interspeech.2020-1180

Cite as: Feng, H., Ueno, S., Kawahara, T. (2020) End-to-End Speech Emotion Recognition Combined with Acoustic-to-Word ASR Model. Proc. Interspeech 2020, 501-505, DOI: 10.21437/Interspeech.2020-1180.


@inproceedings{Feng2020,
  author={Han Feng and Sei Ueno and Tatsuya Kawahara},
  title={{End-to-End Speech Emotion Recognition Combined with Acoustic-to-Word ASR Model}},
  year=2020,
  booktitle={Proc. Interspeech 2020},
  pages={501--505},
  doi={10.21437/Interspeech.2020-1180},
  url={http://dx.doi.org/10.21437/Interspeech.2020-1180}
}