Joint Prediction of Punctuation and Disfluency in Speech Transcripts

Binghuai Lin, Liyuan Wang


Spoken language transcripts generated from Automatic speech recognition (ASR) often contain a large portion of disfluency and lack punctuation symbols. Punctuation restoration and disfluency removal of the transcripts can facilitate downstream tasks such as machine translation, information extraction and syntactic analysis [1]. Various studies have shown the influence between these two tasks and thus performed modeling based on a multi-task learning (MTL) framework [2, 3], which learns general representations in the shared layers and separate representations in the task-specific layers. However, task dependencies are normally ignored in the task-specific layers. To model the dependencies of tasks, we propose an attention-based structure in the task-specific layers of the MTL framework incorporating the pretrained BERT (a state-of-art NLP-related model) [4]. Experimental results based on English IWSLT dataset and the Switchboard dataset show the proposed architecture outperforms the separate modeling methods as well as the traditional MTL methods.


 DOI: 10.21437/Interspeech.2020-1277

Cite as: Lin, B., Wang, L. (2020) Joint Prediction of Punctuation and Disfluency in Speech Transcripts. Proc. Interspeech 2020, 716-720, DOI: 10.21437/Interspeech.2020-1277.


@inproceedings{Lin2020,
  author={Binghuai Lin and Liyuan Wang},
  title={{Joint Prediction of Punctuation and Disfluency in Speech Transcripts}},
  year=2020,
  booktitle={Proc. Interspeech 2020},
  pages={716--720},
  doi={10.21437/Interspeech.2020-1277},
  url={http://dx.doi.org/10.21437/Interspeech.2020-1277}
}