Speech Prosody 2010
Chicago, IL, USA
Subtle temporal and spectral differences between categorical realizations of para-linguistic phenomena (e.g., affective vocal expressions) are hard to capture and describe. In this paper we present a signal representation based on Time Varying Constant-Q Cepstral Coefficients (TVCQCC) derived for this purpose. A method which utilizes the special properties of the constant Q-transform for mean F0 estimation and normalization is described. The coefficients are invariant to segment length, and as a special case, a representation for prosody is considered. Speaker independent classification results using 23;-SVM with the Berlin EMO-DB and two closed sets of basic (anger, disgust, fear, happiness, sadness, neutral) and social/interpersonal (affection, pride, shame) emotions recorded by forty professional actors from two English dialect areas are reported. The accuracy for the Berlin EMO-DB is 71.2 %, and the accuracies for the first set including basic emotions was 44.6% and for the second set including basic and social emotions the accuracy was 31.7%. It was found that F0 normalization boosts the performance and a combined feature set shows the best performance.
Index Terms: Emotion Classification, Constant-Q, 2D-DCT, supra-segmental, mean pitch estimation, prosody
Bibliographic reference. Neiberg, D. / Laukka, P. / Ananthakrishnan, G. (2010): "Classification of affective speech using normalized time-frequency cepstra", In SP-2010, paper 071.