Memory Controlled Sequential Self Attention for Sound Recognition

Arjun Pankajakshan, Helen L. Bear, Vinod Subramanian, Emmanouil Benetos


In this paper we investigate the importance of the extent of memory in sequential self attention for sound recognition. We propose to use a memory controlled sequential self attention mechanism on top of a convolutional recurrent neural network (CRNN) model for polyphonic sound event detection (SED). Experiments on the URBAN-SED dataset demonstrate the impact of the extent of memory on sound recognition performance with the self attention induced SED model. We extend the proposed idea with a multi-head self attention mechanism where each attention head processes the audio embedding with explicit attention width values. The proposed use of memory controlled sequential self attention offers a way to induce relations among frames of sound event tokens. We show that our memory controlled self attention model achieves an event based F-score of 33.92% on the URBAN-SED dataset, outperforming the F-score of 20.10% reported by the model without self attention.


 DOI: 10.21437/Interspeech.2020-1953

Cite as: Pankajakshan, A., Bear, H.L., Subramanian, V., Benetos, E. (2020) Memory Controlled Sequential Self Attention for Sound Recognition. Proc. Interspeech 2020, 831-835, DOI: 10.21437/Interspeech.2020-1953.


@inproceedings{Pankajakshan2020,
  author={Arjun Pankajakshan and Helen L. Bear and Vinod Subramanian and Emmanouil Benetos},
  title={{Memory Controlled Sequential Self Attention for Sound Recognition}},
  year=2020,
  booktitle={Proc. Interspeech 2020},
  pages={831--835},
  doi={10.21437/Interspeech.2020-1953},
  url={http://dx.doi.org/10.21437/Interspeech.2020-1953}
}