Contextual Representation using Recurrent Neural Network Hidden State for Statistical Parametric Speech Synthesis

Sivanand Achanta, Rambabu Banoth, Ayushi Pandey, Anandaswarup Vadapalli, Suryakanth V Gangashetty


In this paper, we propose to use hidden state vector obtained from recurrent neural network (RNN) as a context vector representation for deep neural network (DNN) based statistical parametric speech synthesis. While in a typical DNN based system, there is a hierarchy of text features from phone level to utterance level, they are usually in 1-hot-k encoded representation. Our hypothesis is that, supplementing the conventional text features with a continuous frame-level acoustically guided representation would improve the acoustic modeling. The hidden state from an RNN trained to predict acoustic features is used as the additional contextual information. A dataset consisting of 2 Indian languages (Telugu and Hindi) from Blizzard challenge 2015 was used in our experiments. Both the subjective listening tests and the objective scores indicate that the proposed approach performs significantly better than the baseline DNN system.


DOI: 10.21437/SSW.2016-28

Cite as

Achanta, S., Banoth, R., Pandey, A., Vadapalli, A., Gangashetty, S.V. (2016) Contextual Representation using Recurrent Neural Network Hidden State for Statistical Parametric Speech Synthesis. Proc. 9th ISCA Speech Synthesis Workshop, 172-177.

Bibtex
@inproceedings{Achanta+2016,
author={Sivanand Achanta and Rambabu Banoth and Ayushi Pandey and Anandaswarup Vadapalli and Suryakanth V Gangashetty},
title={Contextual Representation using Recurrent Neural Network Hidden State for Statistical Parametric Speech Synthesis},
year=2016,
booktitle={9th ISCA Speech Synthesis Workshop},
doi={10.21437/SSW.2016-28},
url={http://dx.doi.org/10.21437/SSW.2016-28},
pages={172--177}
}