ESCA Workshop on Audio-Visual Speech Processing (AVSP'97)

September 26-27, 1997
Rhodes, Greece

Combining Noise Compensation With Visual Information in Speech Recognition

Stephen Cox, Iain Matthews, Andrew Bangham

School of Information Systems, University of East Anglia, Norwich, UK

The addition of visual information derived from the speaker's lip movements to a speech recogniser (speechreading) can significantly enhance the performance of the recogniser when it is operating under adverse signal-to-noise ratios. However, processing of video signals imposes a large computational demand on the system and there is little point in using speechreading techniques if similar performance gains can be obtained using techniques which operate on only the audio signal and which are less computationally expensive. In this paper, we show that combining visual information with an audio noise compensation technique (spectral subtraction) leads to a performance significantly higher than that obtained using speechreading only or noise compensation only. The optimum method for speech recognition in the presence of noise is to use speech models that are matched to the input speech, and we show that the addition of visual information also gives a performance gain when matched models are used. We also describe a method of "late" integration which uses a measure of confidence derived from information output by the audio recogniser to achieve a performance which is close to optimum.

Full Paper

Bibliographic reference.  Cox, Stephen / Matthews, Iain / Bangham, Andrew (1997): "Combining noise compensation with visual information in speech recognition", In AVSP-1997, 53-56.