Second Workshop on Child, Computer and Interaction (WOCCI 2009)
Cambridge, MA, USA
The need for automatic recognition of a speaker's emotion within a spoken dialog system framework has received increased attention with demand for computer interfaces that provide natural and user-adaptive spoken interaction. This paper addresses the problem of automatically recognizing a child's emotional state using information obtained from audio and video signals. The study is based on a multimodal data corpus consisting of spontaneous conversations between a child and a computer agent. Four different techniques [k-nearest neighborhood (k-NN) classifier, decision tree, linear discriminant classifier (LDC), and support vector machine classifier (SVC)] were employed for classifying utterances into 2 emotion classes, negative and non-negative, for both acoustic and visual information. Experimental results show that, overall, combining visual information with acoustic information leads to performance improvements in emotion recognition. We obtained the best results when information sources were combined at feature level. Specifically, results showed that the addition of visual information to acoustic information yields relative improvements in emotion recognition of 3.8% with both LDC and SVC classifiers for information fusion at the feature level over that of using only acoustic information.
Bibliographic reference. Yildirim, Serdar / Narayanan, Shrikanth (2009): "Recognizing childs emotional state in problem-solving child-machine interactions", In WOCCI-2009, 61-64.