Large Scale Weakly and Semi-Supervised Learning for Low-Resource Video ASR

Kritika Singh, Vimal Manohar, Alex Xiao, Sergey Edunov, Ross Girshick, Vitaliy Liptchinsky, Christian Fuegen, Yatharth Saraf, Geoffrey Zweig, Abdelrahman Mohamed


Many semi- and weakly-supervised approaches have been investigated for overcoming the labeling cost of building high-quality speech recognition systems. On the challenging task of transcribing social media videos in low-resource conditions, we conduct a large scale systematic comparison between two self-labeling methods on one hand, and weakly-supervised pretraining using contextual metadata on the other. We investigate distillation methods at the frame level and the sequence level for hybrid, encoder-only Connectionist Temporal Classification (CTC) based, and encoder-decoder speech recognition systems on Dutch and Romanian languages using 27,000 and 58,000 hours of unlabeled audio respectively. Although all approaches improved upon their respective baseline word error rates (WER) by more than 8%, sequence-level distillation for encoder-decoder models provided the largest relative WER reduction of 20% compared to the strongest data-augmented supervised baseline.


 DOI: 10.21437/Interspeech.2020-1917

Cite as: Singh, K., Manohar, V., Xiao, A., Edunov, S., Girshick, R., Liptchinsky, V., Fuegen, C., Saraf, Y., Zweig, G., Mohamed, A. (2020) Large Scale Weakly and Semi-Supervised Learning for Low-Resource Video ASR. Proc. Interspeech 2020, 3770-3774, DOI: 10.21437/Interspeech.2020-1917.


@inproceedings{Singh2020,
  author={Kritika Singh and Vimal Manohar and Alex Xiao and Sergey Edunov and Ross Girshick and Vitaliy Liptchinsky and Christian Fuegen and Yatharth Saraf and Geoffrey Zweig and Abdelrahman Mohamed},
  title={{Large Scale Weakly and Semi-Supervised Learning for Low-Resource Video ASR}},
  year=2020,
  booktitle={Proc. Interspeech 2020},
  pages={3770--3774},
  doi={10.21437/Interspeech.2020-1917},
  url={http://dx.doi.org/10.21437/Interspeech.2020-1917}
}