NTCD-TIMIT: A New Database and Baseline for Noise-Robust Audio-Visual Speech Recognition

Ahmed Hussen Abdelaziz


Although audio-visual speech is well known to improve the robustness properties of automatic speech recognition (ASR) systems against noise, the realm of audio-visual ASR (AV-ASR) has not gathered the research momentum it deserves. This is mainly due to the lack of audio-visual corpora and the need to combine two fields of knowledge: ASR and computer vision. This paper describes the NTCD-TIMIT database and baseline that can overcome these two barriers and attract more research interest to AV-ASR. The NTCD-TIMIT corpus has been created by adding six noise types at a range of signal-to-noise ratios to the speech material of the recently published TCD-TIMIT corpus. NTCD-TIMIT comprises visual features that have been extracted from the TCD-TIMIT video recordings using the visual front-end presented in this paper. The database contains also Kaldi scripts for training and decoding audio-only, video-only, and audio-visual ASR models. The baseline experiments and results obtained using these scripts are detailed in this paper.


 DOI: 10.21437/Interspeech.2017-860

Cite as: Abdelaziz, A.H. (2017) NTCD-TIMIT: A New Database and Baseline for Noise-Robust Audio-Visual Speech Recognition. Proc. Interspeech 2017, 3752-3756, DOI: 10.21437/Interspeech.2017-860.


@inproceedings{Abdelaziz2017,
  author={Ahmed Hussen Abdelaziz},
  title={NTCD-TIMIT: A New Database and Baseline for Noise-Robust Audio-Visual Speech Recognition},
  year=2017,
  booktitle={Proc. Interspeech 2017},
  pages={3752--3756},
  doi={10.21437/Interspeech.2017-860},
  url={http://dx.doi.org/10.21437/Interspeech.2017-860}
}