Speech Emotion Recognition with a Reject Option

Kusha Sridhar, Carlos Busso

Speech emotion recognition (SER) for categorical descriptors is a difficult task when the recordings come from everyday spontaneous interactions. The boundaries between emotional classes are less clear, resulting in complex, mixed emotions. Since the performance of a SER system varies across speech recordings, it is important to understand the reliability associated with its prediction. An intriguing formulation in machine learning related to this problem is the reject option, where a classifier only provides predictions over samples with reliability above a given threshold. This paper proposes a classification technique with a reject option using deep neural networks (DNNs) that increases its performance by selectively trading its coverage in the testing set. We use two different criteria to develop a SER system with a reject option, where it can accept or reject a sample as needed. Using the MSP-Podcast corpus, we evaluate this idea by comparing different classification performance as a function of coverage. By selectively defining a coverage of 75% of the samples, we obtain relative gains in F1-score of up to 25.71% for a five-class problem and 20.63% for an eight-class problem. The sentences that are rejected are analyzed in the evaluation, confirming that they have lower inter-evaluator agreement.

 DOI: 10.21437/Interspeech.2019-1842

Cite as: Sridhar, K., Busso, C. (2019) Speech Emotion Recognition with a Reject Option. Proc. Interspeech 2019, 3272-3276, DOI: 10.21437/Interspeech.2019-1842.

  author={Kusha Sridhar and Carlos Busso},
  title={{Speech Emotion Recognition with a Reject Option}},
  booktitle={Proc. Interspeech 2019},