Joint Prediction of Punctuation and Disfluency in Speech Transcripts

Binghuai Lin, Liyuan Wang

Spoken language transcripts generated from Automatic speech recognition (ASR) often contain a large portion of disfluency and lack punctuation symbols. Punctuation restoration and disfluency removal of the transcripts can facilitate downstream tasks such as machine translation, information extraction and syntactic analysis [1]. Various studies have shown the influence between these two tasks and thus performed modeling based on a multi-task learning (MTL) framework [2, 3], which learns general representations in the shared layers and separate representations in the task-specific layers. However, task dependencies are normally ignored in the task-specific layers. To model the dependencies of tasks, we propose an attention-based structure in the task-specific layers of the MTL framework incorporating the pretrained BERT (a state-of-art NLP-related model) [4]. Experimental results based on English IWSLT dataset and the Switchboard dataset show the proposed architecture outperforms the separate modeling methods as well as the traditional MTL methods.

 DOI: 10.21437/Interspeech.2020-1277

Cite as: Lin, B., Wang, L. (2020) Joint Prediction of Punctuation and Disfluency in Speech Transcripts. Proc. Interspeech 2020, 716-720, DOI: 10.21437/Interspeech.2020-1277.

  author={Binghuai Lin and Liyuan Wang},
  title={{Joint Prediction of Punctuation and Disfluency in Speech Transcripts}},
  booktitle={Proc. Interspeech 2020},