13th Annual Conference of the International Speech Communication Association

Portland, OR, USA
September 9-13, 2012

Goal-Oriented Auditory Scene Recognition

Kailash Patil, Mounya Elhilali

Center for Speech and Language Processing, Department of Electrical and Computer Engineering, Johns Hopkins University, Baltimore, MD, USA

How do we understand and interpret complex auditory environments in a way that may depend on some stated goals or intentions? Here, we propose a framework that provides a detailed analysis of the spectrotemporal modulations in the acoustic signal, augmented with a discriminative classifier using multilayer perceptrons. We show that such a representation is successful at capturing the non-trivial commonalties within a sound class and differences between different classes. It not only surpasses performance of current systems in the literature by about 21%, but proves quite robust for processing multi-source cases. In addition, we test the role of feature re-weighting in improving feature selectivity and signal-to-noise ratio in the direction of a sound class of interest.

Index Terms: scene understanding, acoustic event recognition, attention, bottom-up, top-down

Full Paper

Bibliographic reference.  Patil, Kailash / Elhilali, Mounya (2012): "Goal-oriented auditory scene recognition", In INTERSPEECH-2012, 2510-2513.