EUROSPEECH 2001 Scandinavia
7th European Conference on Speech Communication and Technology

Aalborg, Denmark
September 3-7, 2001


Perceptual Cost Functions for Unit Searching in Large Corpus-based Text-to-Speech

Minkyu Lee

Bell Labs, Lucent Technologies, USA

In large corpus-based concatenative Text-to-Speech, unit selection is critical for the quality of synthetic speech. Dynamic programming algorithms have been used for unit-searching by minimizing a total cost (1) between target specification and candidate units and (2) between candidate units for concatenation. The cost function is often a weighted sum of sub-costs, which are the costs for each of the acoustic and phonetic features of units. The weights control the individual contribution of the sub-costs to the total cost. They also determine the relative sensitivity of a feature to the quality degradation when signal processing is applied to modify the feature. However, determining the weights for the cost function has not been a simple task. In this paper, we propose a new method for unit-searching based on a perceptual preference test. The proposed algorithm is designed to find the weights in more systematic and meaningful way. The algorithm searches for a set of weights that can produce a ranking of renditions, that is close to the perceptual test results. The downhill simplex method is used for the multi-dimensional search of the weights. A dissimilarity measure is proposed to evaluate the closeness of two rankings. About 83 percent of the cases, the unit selection algorithm using the optimal set of weights choose the same rendition that human listeners prefer. The results show that the proposed weight optimization algorithm can successfully predict the human preference pattern. The synthetic speech using the optimal weights consistantly showed smoother transition and higher voice quality than the one using manually determined weights.

Full Paper

Bibliographic reference.  Lee, Minkyu (2001): "Perceptual cost functions for unit searching in large corpus-based text-to-speech", In EUROSPEECH-2001, 2227-2230.