Gated Multi-Head Attention Pooling for Weakly Labelled Audio Tagging

Sixin Hong, Yuexian Zou, Wenwu Wang


Multiple instance learning (MIL) has recently been used for weakly labelled audio tagging, where the spectrogram of an audio signal is divided into segments to form instances in a bag, and then the low-dimensional features of these segments are pooled for tagging. The choice of a pooling scheme is the key to exploiting the weakly labelled data. However, the traditional pooling schemes are usually fixed and unable to distinguish the contributions, making it difficult to adapt to the characteristics of the sound events. In this paper, a novel pooling algorithm is proposed for MIL, named gated multi-head attention pooling (GMAP), which is able to attend to the information of events from different heads at different positions. Each head allows the model to learn information from different representation subspaces. Furthermore, in order to avoid the redundancy of multi-head information, a gating mechanism is used to fuse individual head features. The proposed GMAP increases the modeling power of the single-head attention with no computational overhead. Experiments are carried out on Audioset, which is a large-scale weakly labelled dataset, and show superior results to the non-adaptive pooling and the vanilla attention pooling schemes.


 DOI: 10.21437/Interspeech.2020-1197

Cite as: Hong, S., Zou, Y., Wang, W. (2020) Gated Multi-Head Attention Pooling for Weakly Labelled Audio Tagging. Proc. Interspeech 2020, 816-820, DOI: 10.21437/Interspeech.2020-1197.


@inproceedings{Hong2020,
  author={Sixin Hong and Yuexian Zou and Wenwu Wang},
  title={{Gated Multi-Head Attention Pooling for Weakly Labelled Audio Tagging}},
  year=2020,
  booktitle={Proc. Interspeech 2020},
  pages={816--820},
  doi={10.21437/Interspeech.2020-1197},
  url={http://dx.doi.org/10.21437/Interspeech.2020-1197}
}