Self-Training for End-to-End Speech Translation

Juan Pino, Qiantong Xu, Xutai Ma, Mohammad Javad Dousti, Yun Tang


One of the main challenges for end-to-end speech translation is data scarcity. We leverage pseudo-labels generated from unlabeled audio by a cascade and an end-to-end speech translation model. This provides 8.3 and 5.7 BLEU gains over a strong semi-supervised baseline on the MuST-C English-French and English-German datasets, reaching state-of-the art performance. The effect of the quality of the pseudo-labels is investigated. Our approach is shown to be more effective than simply pre-training the encoder on the speech recognition task. Finally, we demonstrate the effectiveness of self-training by directly generating pseudo-labels with an end-to-end model instead of a cascade model.


 DOI: 10.21437/Interspeech.2020-2938

Cite as: Pino, J., Xu, Q., Ma, X., Dousti, M.J., Tang, Y. (2020) Self-Training for End-to-End Speech Translation. Proc. Interspeech 2020, 1476-1480, DOI: 10.21437/Interspeech.2020-2938.


@inproceedings{Pino2020,
  author={Juan Pino and Qiantong Xu and Xutai Ma and Mohammad Javad Dousti and Yun Tang},
  title={{Self-Training for End-to-End Speech Translation}},
  year=2020,
  booktitle={Proc. Interspeech 2020},
  pages={1476--1480},
  doi={10.21437/Interspeech.2020-2938},
  url={http://dx.doi.org/10.21437/Interspeech.2020-2938}
}