4th International Conference on Spoken Language Processing

Philadelphia, PA, USA
October 3-6, 1996

Using the Visual Component in Automatic Speech Recognition

N. M. Brooke

Media Technology Research Centre, School of Mathematical Sciences, University of Bath, Bath, UK

The movements of talkersí faces are known to convey visual cues that can improve speech intelligibility, especially where there is noise or hearing-impairment. This suggests that visible facial gestures could be exploited to enhance speech intelligibility in automatic systems. Handling the volume of data represented by images of talkersí faces implies some form of data compression. Rather than using conventional feature extraction approaches, image coding and compression can be achieved using data-driven, statistically-oriented techniques such as artificial neural-networks (ANNs) or principal component analysis (PCA). A major issue is the combination of the audio and visual data so that the best use can be made of the two modalities together. Perceptual experiments may offer guidance on suitable machine architectures, many of which currently use hidden Markov models (HMMs).

Full Paper

Bibliographic reference.  Brooke, N. M. (1996): "Using the visual component in automatic speech recognition", In ICSLP-1996, 1656-1659.