5th European Conference on Speech Communication and Technology

Rhodes, Greece
September 22-25, 1997

Improved Bimodal Speech Recognition using Tied-Mixture HMMs and 5000 Word Audio-Visual Synchronous Database

Satoshi Nakamura, Ron Nagai, Kiyohiro Shikano

Graduate School of Information Science, Nara Institute of Science and Technology, Ikoma-shi, Nara, Japan

This paper presents methods to improve speech recognition accuracy by incorporating automatic lip reading. The paper improves lip reading accu- racy by following approaches; 1)collection of image and speech synchronous data of 5240 words, 2)feature extraction of 2-dimensional power spectra around a mouth and 3)sub-word unit HMMs with tied-mixture distribution(Tied-Mixture HMMs). Experiments through 100 word test show the performance of 85% by lipreading alone. It is also shown that tied-mixture HMMs improve the lip reading accuracy. The speech recognition experiments are carried out over various SNR integrating audio-visual information. The results show the integration always realizes better performance than that using either audio or visual information.

Full Paper

Bibliographic reference.  Nakamura, Satoshi / Nagai, Ron / Shikano, Kiyohiro (1997): "Improved bimodal speech recognition using tied-mixture HMMs and 5000 word audio-visual synchronous database", In EUROSPEECH-1997, 1623-1626.