Adapting Transformer to End-to-End Spoken Language Translation

Mattia A. Di Gangi, Matteo Negri, Marco Turchi

Neural end-to-end architectures for sequence-to-sequence learning represent the state of the art in machine translation (MT) and speech recognition (ASR). Their use is also promising for end-to-end spoken language translation (SLT), which combines the main challenges of ASR and MT. Exploiting existing neural architectures, however, requires task-specific adaptations. A network that has obtained state-of-the-art results in MT with reduced training time is Transformer. However, its direct application to speech input is hindered by two limitations of the self-attention network on which it is based: quadratic memory complexity and no explicit modeling of short-range dependencies between input features. High memory complexity poses constraints to the size of models trainable with a GPU, while the inadequate modeling of local dependencies harms final translation quality. This paper presents an adaptation of Transformer to end-to-end SLT that consists in: i) downsampling the input with convolutional neural networks to make the training process feasible on GPUs, ii) modeling the bidimensional nature of a spectrogram, and iii) adding a distance penalty to the attention, so to bias it towards local context. SLT experiments on 8 language directions show that, with our adaptation, Transformer outperforms a strong RNN-based baseline with a significant reduction in training time.

 DOI: 10.21437/Interspeech.2019-3045

Cite as: Gangi, M.A.D., Negri, M., Turchi, M. (2019) Adapting Transformer to End-to-End Spoken Language Translation. Proc. Interspeech 2019, 1133-1137, DOI: 10.21437/Interspeech.2019-3045.

  author={Mattia A. Di Gangi and Matteo Negri and Marco Turchi},
  title={{Adapting Transformer to End-to-End Spoken Language Translation}},
  booktitle={Proc. Interspeech 2019},