A Template-Based Approach for Speech Synthesis Intonation Generation Using LSTMs

Srikanth Ronanki, Gustav Eje Henter, Zhizheng Wu, Simon King

The absence of convincing intonation makes current parametric speech synthesis systems sound dull and lifeless, even when trained on expressive speech data. Typically, these systems use regression techniques to predict the fundamental frequency (F0) frame-by-frame. This approach leads to overly-smooth pitch contours and fails to construct an appropriate prosodic structure across the full utterance. In order to capture and reproduce larger-scale pitch patterns, this paper proposes a template-based approach for automatic F0 generation, where per-syllable pitch-contour templates (from a small, automatically learned set) are predicted by a recurrent neural network (RNN). The use of syllable templates mitigates the over-smoothing problem and is able to reproduce pitch patterns observed in the data. The use of an RNN, paired with connectionist temporal classification (CTC), enables the prediction of structure in the pitch contour spanning the entire utterance. This novel F0 prediction system is used alongside separate LSTMs for predicting phone durations and the other acoustic features, to construct a complete text-to-speech system. We report the results of objective and subjective tests on an expressive speech corpus of children’s audiobooks, and include comparisons to a conventional baseline that predicts F0 directly at the frame level.

DOI: 10.21437/Interspeech.2016-96

Cite as

Ronanki, S., Henter, G.E., Wu, Z., King, S. (2016) A Template-Based Approach for Speech Synthesis Intonation Generation Using LSTMs. Proc. Interspeech 2016, 2463-2467.

author={Srikanth Ronanki and Gustav Eje Henter and Zhizheng Wu and Simon King},
title={A Template-Based Approach for Speech Synthesis Intonation Generation Using LSTMs},
booktitle={Interspeech 2016},