Self-Supervised Representations Improve End-to-End Speech Translation

Anne Wu, Changhan Wang, Juan Pino, Jiatao Gu


End-to-end speech-to-text translation can provide a simpler and smaller system but is facing the challenge of data scarcity. Pre-training methods can leverage unlabeled data and have been shown to be effective on data-scarce settings. In this work, we explore whether self-supervised pre-trained speech representations can benefit the speech translation task in both high- and low-resource settings, whether they can transfer well to other languages, and whether they can be effectively combined with other common methods that help improve low-resource end-to-end speech translation such as using a pre-trained high-resource speech recognition system. We demonstrate that self-supervised pre-trained features can consistently improve the translation performance, and cross-lingual transfer allows to extend to a variety of languages without or with little tuning.


 DOI: 10.21437/Interspeech.2020-3094

Cite as: Wu, A., Wang, C., Pino, J., Gu, J. (2020) Self-Supervised Representations Improve End-to-End Speech Translation. Proc. Interspeech 2020, 1491-1495, DOI: 10.21437/Interspeech.2020-3094.


@inproceedings{Wu2020,
  author={Anne Wu and Changhan Wang and Juan Pino and Jiatao Gu},
  title={{Self-Supervised Representations Improve End-to-End Speech Translation}},
  year=2020,
  booktitle={Proc. Interspeech 2020},
  pages={1491--1495},
  doi={10.21437/Interspeech.2020-3094},
  url={http://dx.doi.org/10.21437/Interspeech.2020-3094}
}