INTERSPEECH 2006 - ICSLP
This paper examines the problem of estimating stream weights for a multistream audio-visual speech recogniser in the context of a simultaneous speaker task. The task is challenging because signal-tonoise ratio (SNR) cannot be readily inferred from the acoustics alone. The method proposed employs artificial neural networks (ANNs) to estimate the SNR from HMM state-likelihoods. SNR is converted to stream weight using a mapping optimised on development data. The method produces an audio-visual recognition performance better than that of both the audio-only and the video-only baselines across a wide range of SNRs. The performance using SNR estimates based on audio state-likelihoods is compared to that obtained using both audio and visual likelihoods. Although the audio-visual SNR estimator outperforms the audio-only SNR estimator, the recognition performance benefit is small. Ideas for making fuller use of the visual information are discussed.
Bibliographic reference. Shao, Xu / Barker, Jon (2006): "Audio-visual speech recognition in the presence of a competing speaker", In INTERSPEECH-2006, paper 1589-Tue3WeO.6.