4th International Conference on Spoken Language Processing
Philadelphia, PA, USA
There has recently been increasing interest in the idea of enhancing speech recognition by the use of visual information derived from the face of the talker. This paper demonstrates the use of nonlinear image decomposition, in the form of a ‘sieve’, applied to the task of visual speech recognition. Information derived from the mouth region is used in visual and audiovisual speech recognition of a database of the letters A-Z for four talkers. A scale histogram is generated directly from the grayscale pixels of a window containing the talkers mouth on a per frame basis. Results are presented for visual-only, audio-only and in a simple audiovisual case.
Bibliographic reference. Matthews, I. A. / Bangham, J. / Cox, S. J. (1996): "Audiovisual speech recognition using multiscale nonlinear image decomposition.", In ICSLP-1996, 38-41.