INTERSPEECH 2006 - ICSLP
We review the various approaches that have been used to define the target cost in unit selection speech synthesis and show that there are a number of different and sometimes incompatible ways of defining this. We propose that this cost should be thought of as a measure of how similar two units sound to a human listener. We discuss the issue of what features should be used in unit selection and the pros and cons of using derived features such as F0. We then explore some algorithms used to calculate target costs and show that none are really ideal for the problem. Finally, we propose a new solution to this that uses a neural network to synthesise points in acoustic space around which we can build new clusters of units at run time.
Bibliographic reference. Taylor, Paul (2006): "The target cost formulation in unit selection speech synthesis", In INTERSPEECH-2006, paper 1455-Wed3BuP.4.