Multi-Stream Attention-Based BLSTM with Feature Segmentation for Speech Emotion Recognition

Yuya Chiba, Takashi Nose, Akinori Ito


This paper proposes a speech emotion recognition technique that considers the suprasegmental characteristics and temporal change of individual speech parameters. In recent years, speech emotion recognition using Bidirectional LSTM (BLSTM) has been studied actively because the model can focus on a particular temporal region that contains strong emotional characteristics. One of the model’s weaknesses is that it cannot consider the statistics of speech features, which are known to be effective for speech emotion recognition. Besides, this method cannot train individual attention parameters for different descriptors because it handles the input sequence by a single BLSTM. In this paper, we introduce feature segmentation and multi-stream processing into attention-based BLSTM to solve these problems. In addition, we employed data augmentation based on emotional speech synthesis in a training step. The classification experiments between four emotions (i.e., anger, joy, neutral, and sadness) using the Japanese Twitter-based Emotional Speech corpus (JTES) showed that the proposed method obtained a recognition accuracy of 73.4%, which is comparable to human evaluation (75.5%).


 DOI: 10.21437/Interspeech.2020-1199

Cite as: Chiba, Y., Nose, T., Ito, A. (2020) Multi-Stream Attention-Based BLSTM with Feature Segmentation for Speech Emotion Recognition. Proc. Interspeech 2020, 3301-3305, DOI: 10.21437/Interspeech.2020-1199.


@inproceedings{Chiba2020,
  author={Yuya Chiba and Takashi Nose and Akinori Ito},
  title={{Multi-Stream Attention-Based BLSTM with Feature Segmentation for Speech Emotion Recognition}},
  year=2020,
  booktitle={Proc. Interspeech 2020},
  pages={3301--3305},
  doi={10.21437/Interspeech.2020-1199},
  url={http://dx.doi.org/10.21437/Interspeech.2020-1199}
}