Investigating Self-Supervised Pre-Training for End-to-End Speech Translation

Ha Nguyen, Fethi Bougares, N. Tomashenko, Yannick Estève, Laurent Besacier


Self-supervised learning from raw speech has been proven beneficial to improve automatic speech recognition (ASR). We investigate here its impact on end-to-end automatic speech translation (AST) performance. We use a contrastive predictive coding (CPC) model pre-trained from unlabeled speech as a feature extractor for a downstream AST task. We show that self-supervised pre-training is particularly efficient in low resource settings and that fine-tuning CPC models on the AST training data further improves performance. Even in higher resource settings, ensembling AST models trained with filter-bank and CPC representations leads to near state-of-the-art models without using any ASR pre-training. This might be particularly beneficial when one needs to develop a system that translates from speech in a language with poorly standardized orthography or even from speech in an unwritten language.


 DOI: 10.21437/Interspeech.2020-1835

Cite as: Nguyen, H., Bougares, F., Tomashenko, N., Estève, Y., Besacier, L. (2020) Investigating Self-Supervised Pre-Training for End-to-End Speech Translation. Proc. Interspeech 2020, 1466-1470, DOI: 10.21437/Interspeech.2020-1835.


@inproceedings{Nguyen2020,
  author={Ha Nguyen and Fethi Bougares and N. Tomashenko and Yannick Estève and Laurent Besacier},
  title={{Investigating Self-Supervised Pre-Training for End-to-End Speech Translation}},
  year=2020,
  booktitle={Proc. Interspeech 2020},
  pages={1466--1470},
  doi={10.21437/Interspeech.2020-1835},
  url={http://dx.doi.org/10.21437/Interspeech.2020-1835}
}