On the impact of phoneme alignment in DNN-based speech synthesis

Mei Li, Zhizheng Wu, Lei Xie


Recently, deep neural networks (DNNs) have significantly improved the performance of acoustic modeling in statistical parametric speech synthesis (SPSS). However, in current implementations, when training a DNN-based speech synthesis system, phonetic transcripts are required to be aligned with the corresponding speech frames to obtain the phonetic segmentation, called phoneme alignment. Such an alignment is usually obtained by forced alignment based on hidden Markov models (HMMs) since manual alignment is labor-intensive and timeconsuming. In this work, we study the impact of phoneme alignment on the DNN-based speech synthesis system. Specifically, we compare the performances of different DNN-based speech synthesis systems, which use manual alignment and HMM-based forced alignment from three types of labels: HMM mono-phone, tri-phone and full-context. Objective and subjective evaluations are conducted in term of the naturalness of synthesized speech to compare the performances of different alignments.


DOI: 10.21437/SSW.2016-32

Cite as

Li, M., Wu, Z., Xie, L. (2016) On the impact of phoneme alignment in DNN-based speech synthesis. Proc. 9th ISCA Speech Synthesis Workshop, 196-201.

Bibtex
@inproceedings{Li+2016,
author={Mei Li and Zhizheng Wu and Lei Xie},
title={On the impact of phoneme alignment in DNN-based speech synthesis},
year=2016,
booktitle={9th ISCA Speech Synthesis Workshop},
doi={10.21437/SSW.2016-32},
url={http://dx.doi.org/10.21437/SSW.2016-32},
pages={196--201}
}