Pyramid Memory Block and Timestep Attention for Speech Emotion Recognition

Miao Cao, Chun Yang, Fang Zhou, Xu-cheng Yin

As a sequence model, Deep Feedforward Sequential Memory Network (DFSMN) has shown superior performance on many tasks, such as language modeling and speech recognition. Based on this work, we propose an improved speech emotion recognition (SER) end-to-end system. Our model comprises both CNN layers and pyramid FSMN layers, where CNN layers are added at the front of the network to extract more sophisticated features. A timestep attention mechanism is also integrated into our SER system, which makes the system learn how to focus on the more robust or informative segments in the input signal. Furthermore, different from traditional SER systems, the proposed model is applied directly to spectrograms which contain more raw speech information, rather than well-established hand-crafted speech features such as spectral, cepstral and pitch. Finally, we evaluate our system on the Interactive Emotional Motion Capture (IEMOCAP) database. The experimental results show that our system achieves 2.67% improvement compared to the commonly used CNN-biLSTM model which requires much more computing resource.

 DOI: 10.21437/Interspeech.2019-3140

Cite as: Cao, M., Yang, C., Zhou, F., Yin, X. (2019) Pyramid Memory Block and Timestep Attention for Speech Emotion Recognition. Proc. Interspeech 2019, 3930-3934, DOI: 10.21437/Interspeech.2019-3140.

  author={Miao Cao and Chun Yang and Fang Zhou and Xu-cheng Yin},
  title={{Pyramid Memory Block and Timestep Attention for Speech Emotion Recognition}},
  booktitle={Proc. Interspeech 2019},