Joint Detection of Sentence Stress and Phrase Boundary for Prosody

Binghuai Lin, Liyuan Wang, Xiaoli Feng, Jinsong Zhang

Prosodic event detection plays an important role in spoken language processing tasks and Computer-Assisted Pronunciation Training (CAPT) systems [1]. Traditional methods for the detection of sentence stress and phrase boundaries rely on machine learning methods that model limited contextual information and account little for interaction between these two prosodic events. In this paper, we propose a hierarchical network modeling the contextual factors at the granularity of phoneme, syllable and word based on bidirectional Long Short-Term Memory (BLSTM). Moreover, to account for the inherent connection between sentence stress and phrase boundaries, we perform a joint modeling of these two important prosodic events with a multitask learning framework (MTL) which shares common prosodic features. We evaluate the network performance based on Aix-Machine Readable Spoken English Corpus (Aix-MARSEC). Experimental results show our proposed method obtains the F1-measure of 90% for sentence stress detection and 91% for phrase boundary detection, which outperforms the baseline utilizing conditional random field (CRF) by about 4% and 9% respectively.

 DOI: 10.21437/Interspeech.2020-1284

Cite as: Lin, B., Wang, L., Feng, X., Zhang, J. (2020) Joint Detection of Sentence Stress and Phrase Boundary for Prosody. Proc. Interspeech 2020, 4392-4396, DOI: 10.21437/Interspeech.2020-1284.

  author={Binghuai Lin and Liyuan Wang and Xiaoli Feng and Jinsong Zhang},
  title={{Joint Detection of Sentence Stress and Phrase Boundary for Prosody}},
  booktitle={Proc. Interspeech 2020},