ATReSN-Net: Capturing Attentive Temporal Relations in Semantic Neighborhood for Acoustic Scene Classification

Liwen Zhang, Jiqing Han, Ziqiang Shi


Convolutional Neural Networks (CNNs) have been widely investigated on Acoustic Scene Classification (ASC). Where the convolutional operation can extract useful semantic contents from a local receptive field in the input spectrogram within certain Manhattan distance, i.e., the kernel size. Although stacking multiple convolution layers can increase the range of the receptive field, without explicitly considering the temporal relations of different receptive fields, the increased range is limited around the kernel. In this paper, we propose a 3D CNN for ASC, named ATReSN-Net, which can capture temporal relations of different receptive fields from arbitrary time-frequency locations by mapping the semantic features obtained from the residual block into a semantic space. The ATReSN module has two primary components: first, a k-NN-based grouper for gathering a semantic neighborhood for each feature point in the feature maps. Second, an attentive pooling-based temporal relations aggregator for generating the temporal relations embedding of each feature point and its neighborhood. Experiments showed that our ATReSN-Net outperforms most of the state-of-the-art CNN models. We shared our code at ATReSN-Net.


 DOI: 10.21437/Interspeech.2020-1151

Cite as: Zhang, L., Han, J., Shi, Z. (2020) ATReSN-Net: Capturing Attentive Temporal Relations in Semantic Neighborhood for Acoustic Scene Classification. Proc. Interspeech 2020, 1181-1185, DOI: 10.21437/Interspeech.2020-1151.


@inproceedings{Zhang2020,
  author={Liwen Zhang and Jiqing Han and Ziqiang Shi},
  title={{ATReSN-Net: Capturing Attentive Temporal Relations in Semantic Neighborhood for Acoustic Scene Classification}},
  year=2020,
  booktitle={Proc. Interspeech 2020},
  pages={1181--1185},
  doi={10.21437/Interspeech.2020-1151},
  url={http://dx.doi.org/10.21437/Interspeech.2020-1151}
}