Auditory-Visual Speech Processing (AVSP) 2010

Hakone, Kanagawa, Japan
September 30-October 3, 2010

Real-Time Audio-Visual Voice Activity Detection for Speech Recognition in Noisy Environments

Carlos T. Ishi (1), Miki Sato (1), Norihiro Hagita (1), Shihong Lao (2)

(1) ATR Intelligent Robotics and Communication Labs.; (2) OMRON Corporation, Japan

Voice activity detection (VAD) is one of the most critical issues on performance degradation of speech recognition in noisy environment applications. A real-time VAD was developed by using face parameters (eye and lip contours) as a front-end for the traditional speech and noise (audio) GMMbased method. Speech recognition performance of the audiovisual VAD is shown to be comparable with audio-only VAD, for a shopping mall background noise. Advantages and limitations of introducing the visual information are discussed.

Index Terms: voice activity detection, audio-visual, speech recognition, noisy environment, real-time.

Full Paper

Bibliographic reference.  Ishi, Carlos T. / Sato, Miki / Hagita, Norihiro / Lao, Shihong (2010): "Real-time audio-visual voice activity detection for speech recognition in noisy environments", In AVSP-2010, paper P5.