Deep Lip Reading: A Comparison of Models and an Online Application

Triantafyllos Afouras, Joon Son Chung, Andrew Zisserman

The goal of this paper is to develop state-of-the-art models for lip reading - visual speech recognition. We develop three architectures and compare their accuracy and training times: (i) a recurrent model using LSTMs; (ii) a fully convolutional model; and (iii) the recently proposed transformer model. The recurrent and fully convolutional models are trained with a Connectionist Temporal Classification loss and use an explicit language model for decoding, the transformer is a sequence-to-sequence model. Our best performing model improves the state-of-the-art word error rate on the challenging BBC-Oxford Lip Reading Sentences 2 (LRS2) benchmark dataset by over 20 percent. As a further contribution we investigate the fully convolutional model when used for online (real time) lip reading of continuous speech and show that it achieves high performance with low latency.

 DOI: 10.21437/Interspeech.2018-1943

Cite as: Afouras, T., Chung, J.S., Zisserman, A. (2018) Deep Lip Reading: A Comparison of Models and an Online Application. Proc. Interspeech 2018, 3514-3518, DOI: 10.21437/Interspeech.2018-1943.

  author={Triantafyllos Afouras and Joon Son Chung and Andrew Zisserman},
  title={Deep Lip Reading: A Comparison of Models and an Online Application},
  booktitle={Proc. Interspeech 2018},