Multimodal Sign Language Recognition via Temporal Deformable Convolutional Sequence Learning

Katerina Papadimitriou, Gerasimos Potamianos


In this paper we address the challenging problem of sign language recognition (SLR) from videos, introducing an end-to-end deep learning approach that relies on the fusion of a number of spatio-temporal feature streams, as well as a fully convolutional encoder-decoder for prediction. Specifically, we examine the contribution of optical flow, human skeletal features, as well as appearance features of handshapes and mouthing, in conjunction with a temporal deformable convolutional attention-based encoder-decoder for SLR. To our knowledge, this is the first use in this task of a fully convolutional multi-step attention-based encoder-decoder employing temporal deformable convolutional block structures. We conduct experiments on three sign language datasets and compare our approach to existing state-of-the-art SLR methods, demonstrating its superiority.


 DOI: 10.21437/Interspeech.2020-2691

Cite as: Papadimitriou, K., Potamianos, G. (2020) Multimodal Sign Language Recognition via Temporal Deformable Convolutional Sequence Learning. Proc. Interspeech 2020, 2752-2756, DOI: 10.21437/Interspeech.2020-2691.


@inproceedings{Papadimitriou2020,
  author={Katerina Papadimitriou and Gerasimos Potamianos},
  title={{Multimodal Sign Language Recognition via Temporal Deformable Convolutional Sequence Learning}},
  year=2020,
  booktitle={Proc. Interspeech 2020},
  pages={2752--2756},
  doi={10.21437/Interspeech.2020-2691},
  url={http://dx.doi.org/10.21437/Interspeech.2020-2691}
}