Paralinguistic Classification of Mask Wearing by Image Classifiers and Fusion

Jeno Szep, Salim Hariri


In this study, we address the ComParE 2020 Paralinguistics Mask sub-challenge, where the task is the detection of wearing surgical masks from short speech segments. In our approach, we propose a computer-vision-based pipeline to utilize the capabilities of deep convolutional neural network-based image classifiers developed in recent years and apply this technology to a specific class of spectrograms. Several linear and logarithmic scale spectrograms were tested, and the best performance is achieved on linear-scale, 3-Channel Spectrograms created from the audio segments. A single model image classifier provided a 6.1% better result than the best single-dataset baseline model. The ensemble of our models further improves accuracy and achieves 73.0% UAR by training just on the ‘train’ dataset and reaches 80.1% UAR on the test set when training includes the ‘devel’ dataset, which result is 8.3% higher than the baseline. We also provide an activation-mapping analysis to identify frequency ranges that are critical in the ‘mask’ versus ‘clear’ classification.


 DOI: 10.21437/Interspeech.2020-2857

Cite as: Szep, J., Hariri, S. (2020) Paralinguistic Classification of Mask Wearing by Image Classifiers and Fusion. Proc. Interspeech 2020, 2087-2091, DOI: 10.21437/Interspeech.2020-2857.


@inproceedings{Szep2020,
  author={Jeno Szep and Salim Hariri},
  title={{Paralinguistic Classification of Mask Wearing by Image Classifiers and Fusion}},
  year=2020,
  booktitle={Proc. Interspeech 2020},
  pages={2087--2091},
  doi={10.21437/Interspeech.2020-2857},
  url={http://dx.doi.org/10.21437/Interspeech.2020-2857}
}