13th Annual Conference of the International Speech Communication Association

Portland, OR, USA
September 9-13, 2012

Speaker Clustering in Emotion Recognition

Ni Ding (1), Julien Epps (1,2)

(1) The School of Electrical Engineering and Telecommunications, The University of New South Wales, Sydney NSW, Australia
(2) ATP Laboratory, National ICT Australia, Sydney NSW, Australia

Speaker variability is a known challenge for emotion recognition, however little work has been done on speaker similarity in terms of its contribution to the performance in the emotion classification task. In this paper, we investigate this topic, and find a clear link between speaker proximity and the recognition accuracy. Motivated by this result, emotion based speaker clustering is proposed as a new strategy for speaker adaptation. It involves using speaker proximity to cluster individual speakers' emotion models in the training set on a per-emotion basis, and adapting the test speaker's emotion from the closest cluster. A series of tests were conducted to explore how system performance varies with clustering method, the number of clusters and the amount of adapting data. Results on the LDC Emotion Prosody and FAU Aibo Corpora show that this method outperforms speaker bootstrap, both in terms of relieving computation load and producing higher accuracy.

Index Terms: speaker clustering, emotion recognition, acoustic adaptation

Full Paper

Bibliographic reference.  Ding, Ni / Epps, Julien (2012): "Speaker clustering in emotion recognition", In INTERSPEECH-2012, 1163-1166.