Objective Evaluation Methods for Chinese Text-To-Speech Systems

Teng Zhang, Zhipeng Chen, Ji Wu, Sam Lai, Wenhui Lei, Carsten Isert

To objectively evaluate the performance of text-to-speech (TTS) systems, many studies have been conducted in the straightforward way to compare synthesized speech and natural speech with the alignment. However, in most situations, there is no natural speech can be used. In this paper, we focus on machine learning approaches for the TTS evaluation. We exploit a subspace decomposition method to separate different components in speech, which generates distinctive acoustic features automatically. Furthermore, a pairwise based Support Vector Machine (SVM) model is used to evaluate TTS systems. With the original prosodic acoustic features and Support Vector Regression model, we obtain a ranking relevance of 0.7709. Meanwhile, with the proposed oblique matrix projection method and pairwise SVM model, we achieve a much better result of 0.9115.

DOI: 10.21437/Interspeech.2016-421

Cite as

Zhang, T., Chen, Z., Wu, J., Lai, S., Lei, W., Isert, C. (2016) Objective Evaluation Methods for Chinese Text-To-Speech Systems. Proc. Interspeech 2016, 332-336.

author={Teng Zhang and Zhipeng Chen and Ji Wu and Sam Lai and Wenhui Lei and Carsten Isert},
title={Objective Evaluation Methods for Chinese Text-To-Speech Systems},
booktitle={Interspeech 2016},