Speaker Adaptation of Various Components in Deep Neural Network based Speech Synthesis

Shinji Takaki, SangJin Kim, Junichi Yamagishi


In this paper, we investigate the effectiveness of speaker adaptation for various essential components in deep neural network based speech synthesis, including acoustic models, acoustic feature extraction, and post-filters. In general, a speaker adaptation technique, e.g., maximum likelihood linear regression (MLLR) for HMMs or learning hidden unit contributions (LHUC) for DNNs, is applied to an acoustic modeling part to change voice characteristics or speaking styles. However, since we have proposed a multiple DNN-based speech synthesis system, in which several components are represented based on feed-forward DNNs, a speaker adaptation technique can be applied not only to the acoustic modeling part but also to other components represented by DNNs. In experiments using a small amount of adaptation data, we performed adaptation based on LHUC and simple additional fine tuning for DNNbased acoustic models, deep auto-encoder based feature extraction, and DNN-based post-filter models and compared them with HMM-based speech synthesis systems using MLLR.


DOI: 10.21437/SSW.2016-25

Cite as

Takaki, S., Kim, S., Yamagishi, J. (2016) Speaker Adaptation of Various Components in Deep Neural Network based Speech Synthesis. Proc. 9th ISCA Speech Synthesis Workshop, 153-159.

Bibtex
@inproceedings{Takaki+2016,
author={Shinji Takaki and SangJin Kim and Junichi Yamagishi},
title={Speaker Adaptation of Various Components in Deep Neural Network based Speech Synthesis},
year=2016,
booktitle={9th ISCA Speech Synthesis Workshop},
doi={10.21437/SSW.2016-25},
url={http://dx.doi.org/10.21437/SSW.2016-25},
pages={153--159}
}