Subspace Pooling Based Temporal Features Extraction for Audio Event Recognition

Qiuying Shi, Hui Luo, Jiqing Han

Currently, most popular methods of Audio Event Recognition (AER) firstly split audio event signals into multiple short segments, then the features of these segments are pooled for recognition. However, the temporal features between segments, which highly affect the semantic representation of signals, are usually discarded in the above pooling step. Thus, how to introduce the temporal features to the pooling step requires further investigation. Unfortunately, on the one hand, only a few studies have been conducted towards solving this problem so far. On the other hand, the effective temporal features should not only capture the temporal dynamics but also have the signal reconstruction ability, while most of the above studies mainly focus on the former but ignore the latter. In addition, the effective features of high-dimensional original signals usually inhabit a low-dimensional subspace. Therefore, we propose two novel pooling based methods which try to consider both the temporal dynamics and signal reconstruction ability of temporal features in the low-dimensional subspace. The proposed methods are evaluated on the AudioEvent database, and experimental results show that our methods can outperform most of the typical methods.

 DOI: 10.21437/Interspeech.2019-2047

Cite as: Shi, Q., Luo, H., Han, J. (2019) Subspace Pooling Based Temporal Features Extraction for Audio Event Recognition. Proc. Interspeech 2019, 3850-3854, DOI: 10.21437/Interspeech.2019-2047.

  author={Qiuying Shi and Hui Luo and Jiqing Han},
  title={{Subspace Pooling Based Temporal Features Extraction for Audio Event Recognition}},
  booktitle={Proc. Interspeech 2019},