Evaluating and Optimizing Prosodic Alignment for Automatic Dubbing

Marcello Federico, Yogesh Virkar, Robert Enyedi, Roberto Barra-Chicote

Automatic dubbing aims at replacing all speech contained in a video with speech in a different language, so that the result sounds and looks as natural as the original. Hence, in addition to conveying the same content of an original utterance (which is the typical objective of speech translation), dubbed speech should ideally also match its duration, the lip movements and gestures in the video, timbre, emotion and prosody of the speaker, and finally background noise and reverberation of the environment. In this paper, after describing our dubbing architecture, we focus on recent progress on the prosodic alignment component, which aims at synchronizing the translated transcript with the original utterances. We present empirical results for English-to-Italian dubbing on a publicly available collection of TED Talks. Our new prosodic alignment model, which allows for small relaxations in synchronicity, shows to significantly improve both prosodic alignment accuracy and overall subjective dubbing quality of previous work.

 DOI: 10.21437/Interspeech.2020-2983

Cite as: Federico, M., Virkar, Y., Enyedi, R., Barra-Chicote, R. (2020) Evaluating and Optimizing Prosodic Alignment for Automatic Dubbing. Proc. Interspeech 2020, 1481-1485, DOI: 10.21437/Interspeech.2020-2983.

  author={Marcello Federico and Yogesh Virkar and Robert Enyedi and Roberto Barra-Chicote},
  title={{Evaluating and Optimizing Prosodic Alignment for Automatic Dubbing}},
  booktitle={Proc. Interspeech 2020},