Visual Speech In Real Noisy Environments (VISION): A Novel Benchmark Dataset and Deep Learning-Based Baseline System

Mandar Gogate, Kia Dashtipour, Amir Hussain


In this paper, we present VIsual Speech In real nOisy eNvironments (VISION), a first of its kind audio-visual (AV) corpus comprising 2500 utterances from 209 speakers, recorded in real noisy environments including social gatherings, streets, cafeterias and restaurants. While a number of speech enhancement frameworks have been proposed in the literature that exploit AV cues, there are no visual speech corpora recorded in real environments with a sufficient variety of speakers, to enable evaluation of AV frameworks’ generalisation capability in a wide range of background visual and acoustic noises. The main purpose of our AV corpus is to foster research in the area of AV signal processing and to provide a benchmark corpus that can be used for reliable evaluation of AV speech enhancement systems in everyday noisy settings. In addition, we present a baseline deep neural network (DNN) based spectral mask estimation model for speech enhancement. Comparative simulation results with subjective listening tests demonstrate significant performance improvement of the baseline DNN compared to state-of-the-art speech enhancement approaches.


 DOI: 10.21437/Interspeech.2020-2935

Cite as: Gogate, M., Dashtipour, K., Hussain, A. (2020) Visual Speech In Real Noisy Environments (VISION): A Novel Benchmark Dataset and Deep Learning-Based Baseline System. Proc. Interspeech 2020, 4521-4525, DOI: 10.21437/Interspeech.2020-2935.


@inproceedings{Gogate2020,
  author={Mandar Gogate and Kia Dashtipour and Amir Hussain},
  title={{Visual Speech In Real Noisy Environments (VISION): A Novel Benchmark Dataset and Deep Learning-Based Baseline System}},
  year=2020,
  booktitle={Proc. Interspeech 2020},
  pages={4521--4525},
  doi={10.21437/Interspeech.2020-2935},
  url={http://dx.doi.org/10.21437/Interspeech.2020-2935}
}