EUROSPEECH 2001 Scandinavia
7th European Conference on Speech Communication and Technology

Aalborg, Denmark
September 3-7, 2001


Large-Vocabulary Audio-Visual Speech Recognition by Machines and Humans

Gerasimos Potamianos, Chalapathy Neti, Giridharan Iyengar, Eric Helmuth

IBM T.J. Watson Research Center, USA

We compare automatic recognition with human perception of audiovisual speech, in the large-vocabulary, continuous speech recognition (LVCSR) domain. Specifically, we study the benefit of the visual modality for both machines and humans, when combined with audio degraded by speech-babble noise at various signal-to-noise ratios (SNRs). We first consider an automatic speechreading system with a pixel based visual front end that uses feature fusion for bimodal integration, and we compare its performance with an audio-only LVCSR system. We then describe results of human speech perception experiments, where subjects are asked to transcribe audio-only and audio-visual utterances at various SNRs. For both machines and humans, we observe approximately a 6 dB effective SNR gain compared to the audio-only performance at 10 dB, however such gains significantly diverge at other SNRs. Furthermore, automatic audio-visual recognition outperforms human audio-only speech perception at low SNRs.

Full Paper

Bibliographic reference.  Potamianos, Gerasimos / Neti, Chalapathy / Iyengar, Giridharan / Helmuth, Eric (2001): "Large-vocabulary audio-visual speech recognition by machines and humans", In EUROSPEECH-2001, 1027-1030.