Word-based Neural Prosody Modeling with ToBI

Hee Hwang, Kristine Yu

We present a neural model of American English intonation using the discrete tonal transcription system MAE-ToBI. The model uses the words and tonal sequences of the MAE-ToBI annotated portion of the Boston University Radio Speech Corpus. We took as a starting point Dainoras probabilistic finite-state grammar of the tonal sequences of two speakers in the corpus. We extended Dainora's grammar to cover all six speakers in the corpus and found that bigram probabilities and distinctions in distribution of tones over pre-nuclear and nuclear intermediate phrases showed the same patterns as Dainora's results. To expand beyond her work, we built a word-based Long Short-Term Memory (LSTM) neural model that predicts the MAE-ToBI sequence within an intermediate phrase. We used both randomly initialized vector word embeddings and pre-trained word embeddings from BERT, a bidirectional transformer. BERT achieved 80.58%, 99.79%, and 90.74% accuracy in detecting pitch accent, intermediate, and intonational phrase boundary. The result demonstrates that it is possible to predict prosody given only texts and discrete prosodic labels without acoustic information. The improvement with BERT demonstrates how the addition of unannotated text data to a small prosodically annotated corpus could be leveraged for prosodic modeling in low-resource languages.

 DOI: 10.21437/SpeechProsody.2020-208

Cite as: Hwang, H., Yu, K. (2020) Word-based Neural Prosody Modeling with ToBI. Proc. 10th International Conference on Speech Prosody 2020, 1019-1023, DOI: 10.21437/SpeechProsody.2020-208.

  author={Hee Hwang and Kristine Yu},
  title={{Word-based Neural Prosody Modeling with ToBI}},
  booktitle={Proc. 10th International Conference on Speech Prosody 2020},