How to select a good voice for TTS

Sunhee Kim

Even though the quality of synthesized speech is not necessarily guaranteed by the perceived quality of the speaker’s natural voice, it is required to select a certain number of candidates based on their natural voice before moving to the evaluation stage of synthesized sentences. This paper describes a male speaker selection procedure for unit selection synthesis systems in English and Japanese based on perceptive evaluation and acoustic measurements of the speakers’ natural voice. A perceptive evaluation is performed on eight professional voice talents of each language. A total of twenty native-speaker listeners are recruited in both languages and each listener is asked to rate on eight analytical factors by using a five-scale score and rank three best speakers. Acoustic measurement focuses on the voice quality by extracting two measures, Long Term Average Spectrum (LTAS), the so-called Speakers Formant (SPF), which corresponds to the peak intensity between 3 kHz and 4 kHz, and the Alpha ratio, lower level difference between 0 and 1 kHz and 1 and 4 kHz ranges. The perceptive evaluation results show a very strong correlation between the total score and the preference in both languages, 0.9183 in English and 0.8589 in Japanese. The correlations between the perceptive evaluation and acoustic measurements are moderate with respect to SPF and AR, 0.473 and -0.494 in English, and 0.288 and -0.263 in Japanese.

DOI: 10.21437/SSW.2016-15

Cite as

Kim, S. (2016) How to select a good voice for TTS. Proc. 9th ISCA Speech Synthesis Workshop, 88-92.

author={Sunhee Kim},
title={How to select a good voice for TTS},
booktitle={9th ISCA Speech Synthesis Workshop},