September 22-25, 1997
In order to eliminate trial-and-error in the process of selecting a good speech database as a voice source for concatenative speech synthesis, and to determine the acoustic and prosodic characteristics that best predict 'appeal' or perceived 'quality' in the synthesised speech, we performed tests to evaluate listener preferences over a range of different synthesised voices. We found that variation of fundamental frequency in the source database, and open-quotient of the glottis as measured by joint-estimation (ARX) were the best correlates. To our surprise, there was very little correlation between the scores for 'intelligibility' and those for 'naturalness' in the test data, but the former showed a close correlation with durational characteristics, and the latter with pitch and loudness.
Bibliographic reference. Campbell, Nick / Itoh, Yoshiharu / Ding, Wen / Higuchi, Norio (1997): "Factors affecting perceived quality and intelligibility in the CHATR concatenative speech synthesiser", In EUROSPEECH-1997, 2635-2638.