Building Large-vocabulary Speaker-independent Lipreading Systems

Kwanchiva Thangthai, Richard Harvey

Constructing a viable lipreading system is a challenge because it is claimed that only 30% of information of speech production is visible on the lips. Nevertheless, in small vocabulary tasks, there have been several reports of high accuracies. However, investigation of larger vocabulary tasks is much rarer. This work examines constructing a large vocabulary lipreading system using an approach based-on Deep Neural Network Hidden Markov Models (DNN-HMMs). We tackle the problem of lipreading an unseen speaker. We investigate the effect of employing several steps to pre-process visual features. Moreover, we examine the contribution of language modelling in a lipreading system where we use longer n-grams to recognise visual speech. Our lipreading system is constructed on the 6000-word vocabulary TCD-TIMIT audiovisual speech corpus. The results show that visual speech recognition can definitely reach 50% word accuracy on large vocabularies. We actually achieved a mean of 53.83% measured via three-fold cross-validation on the speaker independent setting of the TCD-TIMIT corpus using bigrams.

 DOI: 10.21437/Interspeech.2018-2112

Cite as: Thangthai, K., Harvey, R. (2018) Building Large-vocabulary Speaker-independent Lipreading Systems. Proc. Interspeech 2018, 2648-2652, DOI: 10.21437/Interspeech.2018-2112.

  author={Kwanchiva Thangthai and Richard Harvey},
  title={Building Large-vocabulary Speaker-independent Lipreading Systems},
  booktitle={Proc. Interspeech 2018},