13th Annual Conference of the International Speech Communication Association

Portland, OR, USA
September 9-13, 2012

Consumer-Level Multimedia Event Detection through Unsupervised Audio Signal Modeling

Byungki Byun (1), Ilseo Kim (1), Sabato Marco Siniscalchi (2), Chin-Hui Lee (1)

(1) School of Electrical and Computer Engineering, Georgia Institute of Technology, Atlanta, GA, USA
(2) Telematics Engineering, Kore University of Enna, Enna, Sicily, Italy

In this work, a novel acoustic characterization approach to multimedia event detection (MED) task for unconstrained and unstructured consumer-level videos through audio signal modeling is proposed. The key idea is to characterize the acoustic space of interest with a set of fundamental acoustic units around which a set of acoustic segment models (ASMs) is built. A vector space modeling technique to address MED is here adopted, where an incoming audio signal is first decoded into a sequence of acoustic segments. Then, a feature vector is generated by using co-occurrence statistics of acoustic units, and the MED final decision is implemented with a vector space language classifier. Experimental evidence on the TRECVID2011 MED demonstrates the viability of the proposed approach. Furthermore, it better accounts for temporal dependencies than previously proposed MFCC bag-of-word approaches.

Index Terms: multimedia event detection, unsupervised audio modeling, acoustic segment models

Full Paper

Bibliographic reference.  Byun, Byungki / Kim, Ilseo / Siniscalchi, Sabato Marco / Lee, Chin-Hui (2012): "Consumer-level multimedia event detection through unsupervised audio signal modeling", In INTERSPEECH-2012, 2081-2084.