13th Annual Conference of the International Speech Communication Association

Portland, OR, USA
September 9-13, 2012

Combining Frame and Segment Based Models for Environmental Sound Classification

Pengfei Hu, Wenju Liu, Wei Jiang

National Laboratory of Pattern Recognition (NLPR), Institute of Automation, Chinese Academy of Sciences, Beijing, China

The paper considers the task of recognizing environmental sounds, which plays a critical role in human's perception of an auditory context in audiovisual materials. A variety of features have been proposed for audio recognition, either frame-based or segmental. Here, we propose a two-stage framework to combine modeling in these two levels. First, the Gaussian Mixture Models(GMMs) are built based on short-term features and pre-classification are performed. Then, in the event that the GMMs are not certain about the result, the system engages Support Vector Machines (SVMs) to refine the output hypothesis. In the next stage, the features are combined by taking posterior estimates of GMMs along with segmental features as SVMs' input features. Experiments on the sound dataset show that the proposed framework makes an improvement over the traditional methods.

Index Terms: environmental sound classification, model combination, GMMs, SVMs

Full Paper

Bibliographic reference.  Hu, Pengfei / Liu, Wenju / Jiang, Wei (2012): "Combining frame and segment based models for environmental sound classification", In INTERSPEECH-2012, 2502-2505.