Dual Stage Learning Based Dynamic Time-Frequency Mask Generation for Audio Event Classification

Donghyeon Kim, Jaihyun Park, David K. Han, Hanseok Ko


Audio based event recognition becomes quite challenging in real world noisy environments. To alleviate the noise issue, time-frequency mask based feature enhancement methods have been proposed. While these methods with fixed filter settings have been shown to be effective in familiar noise backgrounds, they become brittle when exposed to unexpected noise. To address the unknown noise problem, we develop an approach based on dynamic filter generation learning. In particular, we propose a dual stage dynamic filter generator networks that can be trained to generate a time-frequency mask specifically created for each input audio. Two alternative approaches of training the mask generator network are developed for feature enhancements in high noise environments. Our proposed method shows improved performance and robustness in both clean and unseen noise environments.


 DOI: 10.21437/Interspeech.2020-2152

Cite as: Kim, D., Park, J., Han, D.K., Ko, H. (2020) Dual Stage Learning Based Dynamic Time-Frequency Mask Generation for Audio Event Classification. Proc. Interspeech 2020, 836-840, DOI: 10.21437/Interspeech.2020-2152.


@inproceedings{Kim2020,
  author={Donghyeon Kim and Jaihyun Park and David K. Han and Hanseok Ko},
  title={{Dual Stage Learning Based Dynamic Time-Frequency Mask Generation for Audio Event Classification}},
  year=2020,
  booktitle={Proc. Interspeech 2020},
  pages={836--840},
  doi={10.21437/Interspeech.2020-2152},
  url={http://dx.doi.org/10.21437/Interspeech.2020-2152}
}