EUROSPEECH 2001 Scandinavia
We compare automatic recognition with human perception of audiovisual speech, in the large-vocabulary, continuous speech recognition (LVCSR) domain. Specifically, we study the benefit of the visual modality for both machines and humans, when combined with audio degraded by speech-babble noise at various signal-to-noise ratios (SNRs). We first consider an automatic speechreading system with a pixel based visual front end that uses feature fusion for bimodal integration, and we compare its performance with an audio-only LVCSR system. We then describe results of human speech perception experiments, where subjects are asked to transcribe audio-only and audio-visual utterances at various SNRs. For both machines and humans, we observe approximately a 6 dB effective SNR gain compared to the audio-only performance at 10 dB, however such gains significantly diverge at other SNRs. Furthermore, automatic audio-visual recognition outperforms human audio-only speech perception at low SNRs.
Bibliographic reference. Potamianos, Gerasimos / Neti, Chalapathy / Iyengar, Giridharan / Helmuth, Eric (2001): "Large-vocabulary audio-visual speech recognition by machines and humans", In EUROSPEECH-2001, 1027-1030.