Hush-Hush Speak: Speech Reconstruction Using Silent Videos

Shashwat Uttam, Yaman Kumar, Dhruva Sahrawat, Mansi Aggarwal, Rajiv Ratn Shah, Debanjan Mahata, Amanda Stent

Speech Reconstruction is the task of recreation of speech using silent videos as input. In the literature, it is also referred to as lipreading. In this paper, we design an encoder-decoder architecture which takes silent videos as input and outputs an audio spectrogram of the reconstructed speech. The model, despite being a speaker-independent model, achieves comparable results on speech reconstruction to the current state-of-the-art speaker-dependent model. We also perform user studies to infer speech intelligibility. Additionally, we test the usability of the trained model using bilingual speech.

 DOI: 10.21437/Interspeech.2019-3269

Cite as: Uttam, S., Kumar, Y., Sahrawat, D., Aggarwal, M., Shah, R.R., Mahata, D., Stent, A. (2019) Hush-Hush Speak: Speech Reconstruction Using Silent Videos. Proc. Interspeech 2019, 136-140, DOI: 10.21437/Interspeech.2019-3269.

  author={Shashwat Uttam and Yaman Kumar and Dhruva Sahrawat and Mansi Aggarwal and Rajiv Ratn Shah and Debanjan Mahata and Amanda Stent},
  title={{Hush-Hush Speak: Speech Reconstruction Using Silent Videos}},
  booktitle={Proc. Interspeech 2019},