Now You’re Speaking My Language: Visual Language Identification

Triantafyllos Afouras, Joon Son Chung, Andrew Zisserman


The goal of this work is to train models that can identify a spoken language just by interpreting the speaker’s lip movements. Our contributions are the following: (i) we show that models can learn to discriminate among 14 different languages using only visual speech information; (ii) we compare different designs in sequence modelling and utterance-level aggregation in order to determine the best architecture for this task; (iii) we investigate the factors that contribute discriminative cues and show that our model indeed solves the problem by finding temporal patterns in mouth movements and not by exploiting spurious correlations. We demonstrate this further by evaluating our models on challenging examples from bilingual speakers.


 DOI: 10.21437/Interspeech.2020-2921

Cite as: Afouras, T., Chung, J.S., Zisserman, A. (2020) Now You’re Speaking My Language: Visual Language Identification. Proc. Interspeech 2020, 2402-2406, DOI: 10.21437/Interspeech.2020-2921.


@inproceedings{Afouras2020,
  author={Triantafyllos Afouras and Joon Son Chung and Andrew Zisserman},
  title={{Now You’re Speaking My Language: Visual Language Identification}},
  year=2020,
  booktitle={Proc. Interspeech 2020},
  pages={2402--2406},
  doi={10.21437/Interspeech.2020-2921},
  url={http://dx.doi.org/10.21437/Interspeech.2020-2921}
}