13th Annual Conference of the International Speech Communication Association

Portland, OR, USA
September 9-13, 2012

Exploiting Temporal Sequence Structure for Semantic Analysis of Multimedia

Sourish Chaudhuri, Rita Singh, Bhiksha Raj

Language Technologies Institute, School of Computer Science, Carnegie Mellon University, Pittsburgh, PA, USA

In this paper, we explore the hypothesis that the ability to accurately associate semantics to scenes requires processing of sequences of such scenes rather than individual snapshots in time. We build on work that seeks to represent audio as a sequence of descriptors, each spanning multiple frames, by exploring and comparing different ways of obtaining such a lexicon of descriptors. We then present an extension of such an unsupervised learning scheme to video, and report results on experiments with the Multimedia Event Detection, 2011 dataset. We find that learning the set of descriptors automatically from data significantly outperforms the vector quantization-based systems and systems using library based descriptors.

Index Terms: multimedia analysis, semantic labels, unsupervised lexicon learning, audiovisual data retrieval

Full Paper

Bibliographic reference.  Chaudhuri, Sourish / Singh, Rita / Raj, Bhiksha (2012): "Exploiting temporal sequence structure for semantic analysis of multimedia", In INTERSPEECH-2012, 1728-1731.