Most emotion recognition systems do not perform real-time emotion recognition due to latencies caused by phrase segmentation and resource-intensive feature acquisition, etc. To address this issue, we present an emotion recognition approach that can estimate speaker emotions with much lower latency. The proposed approach does not rely on phrase-level features to recognize speaker emotion; rather, it estimates the speaker's emotional state over the course of the utterance incrementally, using a shifting n-word window on the basis of easily computable features. These features are obtained from three information streams, i.e. cepstral, prosodic and textual, at the word-level and combined at decision-level using a statistical framework. Our work shows that combining the three information streams yields higher emotion recognition accuracy than any single information stream. Using features extracted from n-word sequences rather than phrases provides for the low-latency capabilities of the proposed system, without any loss in utterance-level emotion recognition accuracy. The performance of the proposed system on a binary utterance-level emotion recognition task using an in-house database shows a relative improvement of 41% over chance, compared to a relative improvement of 31.82% shown by the baseline phrase-level emotion recognition approach.
Bibliographic reference. Mishra, Taniya / Dimitriadis, Dimitrios (2013): "Incremental emotion recognition", In INTERSPEECH-2013, 2876-2880.