Temporal Attentive Pooling for Acoustic Event Detection

Xugang Lu, Peng Shen, Sheng Li, Yu Tsao, Hisashi Kawai

Deep convolutional neural network (DCNN) based model has been successfully applied to acoustic event detection (AED) due to its efficiency to explore temporal-frequency structure for feature representations. In most studies, the final representation either uses a temporal average- or max-pooling algorithm to accumulate local temporal features as a global representation for event classification. The temporal pooling algorithm in the DCNN is based on the assumption that the target label is assigned to all temporal locations (average pooling) or to only one temporal location with a maximum response (max-pooling). However, the acoustic event labels are holistic descriptions in a semantic level, it is difficult or even impossible to decide features from which temporal locations contribute to the event perception. In this study, we propose a weighted temporal-pooling algorithm to accumulate local temporal features for AED. The pooling algorithm integrates global and local attention modules in a convolutional recurrent neural network to integrate temporal features. Experiments on an AED task were carried out to evaluate the proposed model. Results showed that with the global and local attentions, a large gain was obtained.

 DOI: 10.21437/Interspeech.2018-1552

Cite as: Lu, X., Shen, P., Li, S., Tsao, Y., Kawai, H. (2018) Temporal Attentive Pooling for Acoustic Event Detection. Proc. Interspeech 2018, 1354-1357, DOI: 10.21437/Interspeech.2018-1552.

  author={Xugang Lu and Peng Shen and Sheng Li and Yu Tsao and Hisashi Kawai},
  title={Temporal Attentive Pooling for Acoustic Event Detection},
  booktitle={Proc. Interspeech 2018},