Speech Representation Learning for Emotion Recognition Using End-to-End ASR with Factorized Adaptation

Sung-Lin Yeh, Yun-Shao Lin, Chi-Chun Lee


Developing robust speech emotion recognition (SER) systems is challenging due to small-scale of existing emotional speech datasets. However, previous works have mostly relied on handcrafted acoustic features to build SER models that are difficult to handle a wide range of acoustic variations. One way to alleviate this problem is by using speech representations learned from deep end-to-end models trained on large-scale speech database. Specifically, in this paper, we leverage an end-to-end ASR to extract ASR-based representations for speech emotion recognition. We further devise a factorized domain adaptation approach on the pre-trained ASR model to improve both the speech recognition rate and the emotion recognition accuracy on the target emotion corpus, and we also provide an analysis in the effectiveness of representations extracted from different ASR layers. Our experiments demonstrate the importance of ASR adaptation and layer depth for emotion recognition.


 DOI: 10.21437/Interspeech.2020-2524

Cite as: Yeh, S., Lin, Y., Lee, C. (2020) Speech Representation Learning for Emotion Recognition Using End-to-End ASR with Factorized Adaptation. Proc. Interspeech 2020, 536-540, DOI: 10.21437/Interspeech.2020-2524.


@inproceedings{Yeh2020,
  author={Sung-Lin Yeh and Yun-Shao Lin and Chi-Chun Lee},
  title={{Speech Representation Learning for Emotion Recognition Using End-to-End ASR with Factorized Adaptation}},
  year=2020,
  booktitle={Proc. Interspeech 2020},
  pages={536--540},
  doi={10.21437/Interspeech.2020-2524},
  url={http://dx.doi.org/10.21437/Interspeech.2020-2524}
}