Fusion Architectures for Word-Based Audiovisual Speech Recognition

Michael Wand, Jürgen Schmidhuber


In this study we investigate architectures for modality fusion in audiovisual speech recognition, where one aims to alleviate the adverse effect of acoustic noise on the speech recognition accuracy by using video images of the speaker’s face as an additional modality. Starting from an established neural network fusion system, we substantially improve the recognition accuracy by taking single-modality losses into account: late fusion (at the output logits level) is substantially more robust than the baseline, in particular for unseen acoustic noise, at the expense of having to determine the optimal weighting of the input streams. The latter requirement can be removed by making the fusion itself a trainable part of the network.


 DOI: 10.21437/Interspeech.2020-2117

Cite as: Wand, M., Schmidhuber, J. (2020) Fusion Architectures for Word-Based Audiovisual Speech Recognition. Proc. Interspeech 2020, 3491-3495, DOI: 10.21437/Interspeech.2020-2117.


@inproceedings{Wand2020,
  author={Michael Wand and Jürgen Schmidhuber},
  title={{Fusion Architectures for Word-Based Audiovisual Speech Recognition}},
  year=2020,
  booktitle={Proc. Interspeech 2020},
  pages={3491--3495},
  doi={10.21437/Interspeech.2020-2117},
  url={http://dx.doi.org/10.21437/Interspeech.2020-2117}
}