Speech Prosody 2010

Chicago, IL, USA
May 10-14, 2010

Classification of Affective Speech using Normalized Time-Frequency Cepstra

D. Neiberg (1), P. Laukka (2), G. Ananthakrishnan (1)

(1) Centre for Speech Technology (CTT), TMH, CSC, KTH, Stockholm, Sweden
(2) Department of Psychology, Stockholm University, Stockholm, Sweden

Subtle temporal and spectral differences between categorical realizations of para-linguistic phenomena (e.g., affective vocal expressions) are hard to capture and describe. In this paper we present a signal representation based on Time Varying Constant-Q Cepstral Coefficients (TVCQCC) derived for this purpose. A method which utilizes the special properties of the constant Q-transform for mean F0 estimation and normalization is described. The coefficients are invariant to segment length, and as a special case, a representation for prosody is considered. Speaker independent classification results using &# 23;-SVM with the Berlin EMO-DB and two closed sets of basic (anger, disgust, fear, happiness, sadness, neutral) and social/interpersonal (affection, pride, shame) emotions recorded by forty professional actors from two English dialect areas are reported. The accuracy for the Berlin EMO-DB is 71.2 %, and the accuracies for the first set including basic emotions was 44.6% and for the second set including basic and social emotions the accuracy was 31.7%. It was found that F0 normalization boosts the performance and a combined feature set shows the best performance.

Index Terms: Emotion Classification, Constant-Q, 2D-DCT, supra-segmental, mean pitch estimation, prosody

Full Paper

Bibliographic reference.  Neiberg, D. / Laukka, P. / Ananthakrishnan, G. (2010): "Classification of affective speech using normalized time-frequency cepstra", In SP-2010, paper 071.