Detecting Media Sound Presence in Acoustic Scenes

Constantinos Papayiannis, Justice Amoh, Viktor Rozgic, Shiva Sundaram, Chao Wang

Using speech to interact with electronic devices and access services is becoming increasingly common. Using such applications in our households poses new challenges for speech and audio processing algorithms as these applications should perform robustly in a number of scenarios. Media devices are very commonly present in such scenarios and can interfere with the user-device communication by contributing to the noise or simply by being mistaken as user issued voice commands. Detecting the presence of media sounds in the environment can help avoid such issues. In this work we propose a method for this task based on a parallel CNN-GRU-FC classifier architecture which relies on multi-channel information to discriminate between media and live sources. Experiments performed using 378 hours of in-house audio recordings collected by volunteers show an F1 score of 71% with a recall of 72% in detecting active media sources. The use of information from multiple channels gave a relative improvement of 16% to the F1 score when compared to using information from only a single channel.

 DOI: 10.21437/Interspeech.2018-2559

Cite as: Papayiannis, C., Amoh, J., Rozgic, V., Sundaram, S., Wang, C. (2018) Detecting Media Sound Presence in Acoustic Scenes. Proc. Interspeech 2018, 1363-1367, DOI: 10.21437/Interspeech.2018-2559.

  author={Constantinos Papayiannis and Justice Amoh and Viktor Rozgic and Shiva Sundaram and Chao Wang},
  title={Detecting Media Sound Presence in Acoustic Scenes},
  booktitle={Proc. Interspeech 2018},