Direct Speech-to-Speech Translation with a Sequence-to-Sequence Model

Ye Jia, Ron J. Weiss, Fadi Biadsy, Wolfgang Macherey, Melvin Johnson, Zhifeng Chen, Yonghui Wu

We present an attention-based sequence-to-sequence neural network which can directly translate speech from one language into speech in another language, without relying on an intermediate text representation. The network is trained end-to-end, learning to map speech spectrograms into target spectrograms in another language, corresponding to the translated content (in a different canonical voice). We further demonstrate the ability to synthesize translated speech using the voice of the source speaker. We conduct experiments on two Spanish-to-English speech translation datasets, and find that the proposed model slightly underperforms a baseline cascade of a direct speech-to-text translation model and a text-to-speech synthesis model, demonstrating the feasibility of the approach on this very challenging task.

 DOI: 10.21437/Interspeech.2019-1951

Cite as: Jia, Y., Weiss, R.J., Biadsy, F., Macherey, W., Johnson, M., Chen, Z., Wu, Y. (2019) Direct Speech-to-Speech Translation with a Sequence-to-Sequence Model. Proc. Interspeech 2019, 1123-1127, DOI: 10.21437/Interspeech.2019-1951.

  author={Ye Jia and Ron J. Weiss and Fadi Biadsy and Wolfgang Macherey and Melvin Johnson and Zhifeng Chen and Yonghui Wu},
  title={{Direct Speech-to-Speech Translation with a Sequence-to-Sequence Model}},
  booktitle={Proc. Interspeech 2019},