Lip Graph Assisted Audio-Visual Speech Recognition Using Bidirectional Synchronous Fusion

Hong Liu, Zhan Chen, Bing Yang


Current studies have shown that extracting representative visual features and efficiently fusing audio and visual modalities are vital for audio-visual speech recognition (AVSR), but these are still challenging. To this end, we propose a lip graph assisted AVSR method with bidirectional synchronous fusion. First, a hybrid visual stream combines the image branch and graph branch to capture discriminative visual features. Specially, the lip graph exploits the natural and dynamic connections between the lip key points to model the lip shape, and the temporal evolution of the lip graph is captured by the graph convolutional networks followed by bidirectional gated recurrent units. Second, the hybrid visual stream is combined with the audio stream by an attention-based bidirectional synchronous fusion which allows bidirectional information interaction to resolve the asynchrony between the two modalities during fusion. The experimental results on LRW-BBC dataset show that our method outperforms the end-to-end AVSR baseline method in both clean and noisy conditions.


 DOI: 10.21437/Interspeech.2020-3146

Cite as: Liu, H., Chen, Z., Yang, B. (2020) Lip Graph Assisted Audio-Visual Speech Recognition Using Bidirectional Synchronous Fusion. Proc. Interspeech 2020, 3520-3524, DOI: 10.21437/Interspeech.2020-3146.


@inproceedings{Liu2020,
  author={Hong Liu and Zhan Chen and Bing Yang},
  title={{Lip Graph Assisted Audio-Visual Speech Recognition Using Bidirectional Synchronous Fusion}},
  year=2020,
  booktitle={Proc. Interspeech 2020},
  pages={3520--3524},
  doi={10.21437/Interspeech.2020-3146},
  url={http://dx.doi.org/10.21437/Interspeech.2020-3146}
}