Two-Stage Polyphonic Sound Event Detection Based on Faster R-CNN-LSTM with Multi-Token Connectionist Temporal Classification

Inyoung Park, Hong Kook Kim


We propose a two-stage sound event detection (SED) model to deal with sound events overlapping in time-frequency. In the first stage which consists of a faster R-CNN and an attention-LSTM, each log-mel spectrogram segment is divided into one or more proposed regions (PRs) according to the coordinates of a region proposal network. To efficiently train polyphonic sound, we take only one PR for each sound event from a bounding box regressor associated with the attention-LSTM. In the second stage, the original input image and the difference image between adjacent segments are separately pooled according to the coordinate of each PR predicted in the first stage. Then, two feature maps using CNNs are concatenated and processed further by LSTM. Finally, CTC-based n-best SED is conducted using the softmax output from the CNN-LSTM, where CTC has two tokens for each event so that the start and ending time frames are accurately detected. Experiments on SED using DCASE 2019 Task 3 show that the proposed two-stage model with multi-token CTC achieves an F1-score of 97.5%, while the first stage alone and the two-stage model with a conventional CTC yield F1-scores of 91.9% and 95.6%, respectively.


 DOI: 10.21437/Interspeech.2020-3097

Cite as: Park, I., Kim, H.K. (2020) Two-Stage Polyphonic Sound Event Detection Based on Faster R-CNN-LSTM with Multi-Token Connectionist Temporal Classification. Proc. Interspeech 2020, 856-860, DOI: 10.21437/Interspeech.2020-3097.


@inproceedings{Park2020,
  author={Inyoung Park and Hong Kook Kim},
  title={{Two-Stage Polyphonic Sound Event Detection Based on Faster R-CNN-LSTM with Multi-Token Connectionist Temporal Classification}},
  year=2020,
  booktitle={Proc. Interspeech 2020},
  pages={856--860},
  doi={10.21437/Interspeech.2020-3097},
  url={http://dx.doi.org/10.21437/Interspeech.2020-3097}
}