Second ESCA/IEEE Workshop on Speech Synthesis

September 12-15, 1994
Mohonk Mountain House, New Paltz, NY, USA

Perceptual Evaluation of Synthetic Speech: What Have We Learned over the Last 15 Years and Where Are We Going in the Future?

David B. Pisoni

Speech Research Laboratory, Department of Psychology, Indiana University, Bloomington, IN, USA

In this presentation, I will first summarize the results we have obtained over the last 15 years on the perception of synthetic speech produced by rule. A wide variety of behavioral studies have been carried out on phoneme intelligibility, word recognition and comprehension to learn more about how human listeners perceive and understand various kinds of synthetic speech produced by rule. While some of this research, particularly the studies on segmental intelligibility, has been directed toward applied issues dealing with perceptual evaluation and assessment of different synthesis systems, other aspects of our research program over the years have been more theoretically motivated and were designed to learn more about speech perception and spoken language comprehension. Our findings have shown that the perception of synthetic speech depends on several general factors including the acoustic-phonetic properties of the speech signal, the specific cognitive demands of the information processing task the listener is asked to perform and the previous background and experience of the listener. Some suggestions for future research on improving naturalness, intelligibility and comprehension will be offered in light of several recent findings from our laboratory on the role of stimulus variability and the contribution of indexical factors to speech perception and spoken word recognition. These findings demonstrate that human listeners encode and preserve very fine talker-specific details in memory; these details appear to be employed in the perceptual analysis of natural speech and may need to be retained and carefully reproduced in the next generation of synthesis-by-rule systems in order to achieve high levels of intelligibility and naturalness that are comparable to natural speech. Our perceptual findings have shown the importance of behavioral testing with human listeners as an integral component of evaluation and assessment techniques in synthesis research. In order to improve synthetic speech in the future, we believe it will be necessary to incorporate much more acoustic-phonetic detail into the synthesis rules. Variability is useful and informative and human listeners employ these sources of information in speech perception and spoken word recognition.

