Speaker Adaptive Audio-Visual Fusion for the Open-Vocabulary Section of AVICAR

Leda Sari, Mark Hasegawa-Johnson, Kumaran S, Georg Stemmer, Krishnakumar N Nair

This experimental study establishes the first audio-visual speech recognition baseline for the TIMIT sentence portion of the AVICAR dataset, a dataset recorded in a real, noisy car environment. We use an automatic speech recognizer trained on a larger dataset to generate an audio-only recognition baseline for AVICAR. We utilize the forced alignment of the audio modality of AVICAR to get training targets for the convolutional neural network based visual front end. Based on our observation that there is a great amount of variation between visual features of different speakers, we apply feature space maximum likelihood linear regression (fMMLR) based speaker adaptation to the visual features. We find that the quality of fMLLR is sensitive to the quality of the alignment probabilities used to compute it; experimental tests compare the quality of fMLLR trained using audio-visual versus audio-only alignment probabilities. We report the first audio-visual results for TIMIT subset of AVICAR and show that the word error rate of the proposed audio-visual system is significantly better than that of the audio-only system.

 DOI: 10.21437/Interspeech.2018-2359

Cite as: Sari, L., Hasegawa-Johnson, M., S, K., Stemmer, G., Nair, K.N. (2018) Speaker Adaptive Audio-Visual Fusion for the Open-Vocabulary Section of AVICAR. Proc. Interspeech 2018, 3524-3528, DOI: 10.21437/Interspeech.2018-2359.

  author={Leda Sari and Mark Hasegawa-Johnson and Kumaran S and Georg Stemmer and Krishnakumar N Nair},
  title={Speaker Adaptive Audio-Visual Fusion for the Open-Vocabulary Section of AVICAR},
  booktitle={Proc. Interspeech 2018},