Neural Speech Completion

Kazuki Tsunematsu, Johanes Effendi, Sakriani Sakti, Satoshi Nakamura


During a conversation, humans often predict the end of a sentence even when the other person has not finished it. In contrast, most current automatic speech recognition systems remain limited to passively recognizing what is being said. But applications like voice search, simultaneous speech translation, and spoken language communication may require a system that not only recognizes what has been said but also predicts what will be said. This paper proposes a speech completion system based on deep learning and discusses the construction in a text-to-text, speech-to-text, and speech-to-speech framework. We evaluate our system on domain-specific sentences with synthesized speech utterances that are only 25%, 50%, or 75% complete. Our proposed systems provide more natural suggestions than the Bidirectional Encoder Representations from Transformers (BERT) language representation model.


 DOI: 10.21437/Interspeech.2020-2110

Cite as: Tsunematsu, K., Effendi, J., Sakti, S., Nakamura, S. (2020) Neural Speech Completion. Proc. Interspeech 2020, 2742-2746, DOI: 10.21437/Interspeech.2020-2110.


@inproceedings{Tsunematsu2020,
  author={Kazuki Tsunematsu and Johanes Effendi and Sakriani Sakti and Satoshi Nakamura},
  title={{Neural Speech Completion}},
  year=2020,
  booktitle={Proc. Interspeech 2020},
  pages={2742--2746},
  doi={10.21437/Interspeech.2020-2110},
  url={http://dx.doi.org/10.21437/Interspeech.2020-2110}
}