This paper proposes a novel neural network structure for speech synthesis, in which spectrum, F0 and duration parameters are simultaneously modeled in a unified framework. In the conventional neural network approaches, spectrum and F0 parameters are predicted by neural networks while phone and/or state durations are given from other external duration predictors. In order to consistently model not only spectrum and F0 parameters but also durations, we adopt a special type of mixture density network (MDN) structure, which models utterance level probability density functions conditioned on the corresponding input feature sequence. This is achieved by modeling the conditional probability distribution of utterance level output features, given input features, with a hidden semi-Markov model, where its parameters are generated using a neural network trained with a log likelihood-based loss function. Variations of the proposed neural network structure are also discussed. Subjective listening test results show that the proposed approach improves the naturalness of synthesized speech.

DOI: `10.21437/SSW.2016-18`

Cite as

Tokuda, K., Hashimoto, K., Oura, K., Nankaku, Y. (2016) Temporal modeling in neural network based statistical parametric speech synthesis. Proc. 9th ISCA Speech Synthesis Workshop, 106-111.

Bibtex

@inproceedings{Tokuda+2016, author={Keiichi Tokuda and Kei Hashimoto and Keiichiro Oura and Yoshihiko Nankaku}, title={Temporal modeling in neural network based statistical parametric speech synthesis}, year=2016, booktitle={9th ISCA Speech Synthesis Workshop}, doi={10.21437/SSW.2016-18}, url={http://dx.doi.org/10.21437/SSW.2016-18}, pages={106--111} }