Speech emotion recognition

Jianhua Tao


Speech emotion recognition supports natural and efficient human-computer interaction with wide applications of website customization, education and gaming. Typical methods are based on short-time frame-level feature extraction, followed by utterance-level information extraction and classification or regression as required. However, the selection of a common and global emotional feature subspace is challenging. We explore the influence of different emotional features, voice quality features, spectral features and prosodic features on different types of corpora. Denoising auto-encoder is utilized to extract high-level discriminative representations. On the other hand, various machine learning algorithms are applied for speech emotion recognition, such as Gaussian Mixture Models, Deep Neural Networks, Support Vector Machines. Emotion is a temporally expression event, thus we favor the methods can model larger sets of contextual information well, such Hidden Markov Models and Long Short-Term Memory Recurrent Neural Network (LSTM-RNN). In this talk, I present our multi-scale emotional dynamic temporal modeling using deep belief network and LSTM-RNN. We also propose temporal pooling to release the problem of redundant information and label noise for dimensional emotion recognition. To solve the ambiguity of emotion description, we combine dimensional emotion and discrete emotion information to improve the performance of emotion recognition.


Cite as: Tao, J. (2018) Speech emotion recognition. Proc. 9th International Conference on Speech Prosody 2018.


@inproceedings{Tao2018,
  author={Jianhua Tao},
  title={Speech emotion recognition},
  year=2018,
  booktitle={Proc. 9th International Conference on Speech Prosody 2018}
}