Temporal Attention Convolutional Network for Speech Emotion Recognition with Latent Representation

Jiaxing Liu, Zhilei Liu, Longbiao Wang, Yuan Gao, Lili Guo, Jianwu Dang


As the fundamental research of affective computing, speech emotion recognition (SER) has gained a lot of attention. Unlike with common deep learning tasks, SER was restricted by the scarcity of emotional speech datasets. In this paper, the vector quantization variational automatic encoder (VQ-VAE) was introduced and trained by massive unlabeled data in an unsupervised manner. Benefiting from the excellent invariant distribution encoding capability and discrete embedding space of VQ-VAE, the pre-trained VQ-VAE could learn latent representation from labeled data. The extracted latent representation could serve as the additional source data to make data abundantly available. While solving data lacking issue, sequence information modeling was also taken into account which was considered useful for SER. The proposed sequence model, temporal attention convolutional network (TACN) was simple yet good at learning contextual information from limited data which was not friendly to complicated structures of recurrent neural network (RNN) based sequence models. To validate the effectiveness of the latent representation, t-distributed stochastic neighbor embedding (t-SNE) was introduced to analyze the visualizations. To verify the performance of the proposed TACN, quantitative classification results of all commonly used sequence models were provided. Our proposed model achieved state-of-the-art performance on IEMOCAP.


 DOI: 10.21437/Interspeech.2020-1520

Cite as: Liu, J., Liu, Z., Wang, L., Gao, Y., Guo, L., Dang, J. (2020) Temporal Attention Convolutional Network for Speech Emotion Recognition with Latent Representation. Proc. Interspeech 2020, 2337-2341, DOI: 10.21437/Interspeech.2020-1520.


@inproceedings{Liu2020,
  author={Jiaxing Liu and Zhilei Liu and Longbiao Wang and Yuan Gao and Lili Guo and Jianwu Dang},
  title={{Temporal Attention Convolutional Network for Speech Emotion Recognition with Latent Representation}},
  year=2020,
  booktitle={Proc. Interspeech 2020},
  pages={2337--2341},
  doi={10.21437/Interspeech.2020-1520},
  url={http://dx.doi.org/10.21437/Interspeech.2020-1520}
}