EUROSPEECH 2001 Scandinavia
7th European Conference on Speech Communication and Technology

Aalborg, Denmark
September 3-7, 2001


Feature Extraction by Auditory Modeling for Unit Selection in Concatenative Speech Synthesis

Minoru Tsuzaki

ATR Spoken Language Translation Research Laboratories, Japan

A comprehensive computational model of the human auditory peripherals was applied to extract basic features of speech sounds. The auditory model extracts features by the auditory temporal coding mechanism in addition to features by the auditory place coding mechanism which has traditionally been used as spectral features. It also considers the nonlinearity of human auditory responses. Several speech databases of different talkers for a concatenative synthesis system were analyzed by the auditory model, and segmental characteristics were estimated by calculating the averages, standard deviations, and trends of individual feature parameters. The results were compared with results obtained by a physical model. A preliminary perceptual test suggested an advantage of auditory-based distances over physical distances.

