First International Conference on Spoken Language Processing (ICSLP 90)
This paper describes a study comparing several signal representations for context-independent vowel classification. It forms the first step in our investigation for a distinctive-feature-based approach to phonetic recognition. Six different signal representations were investigated. They include the outputs of Seneff's Auditory Model (SAM), the mel-scale representations and the conventional Fourier Transform. To strive towards a fair and meaningful comparison, the mel-frequency niters were carefully designed to resemble the filters of SAM and the dimensionality of the feature vectors were constrained to be equal. The representations were compared on the basis of classifying 16 vowels in American English. Experiments with speech degraded by adding white noise were also conducted. Our results are based on over 22,000 vowel tokens excised from 2,750 sentences spoken by 550 speakers. The combined Synchronous and Mean Rate responses from SAM outperformed all the other representations with both undegraded and noisy speech, yielding top-choice accuracies of 66% and 54% respectively.
Bibliographic reference. Meng, Helen M. / Zue, Victor W. (1990): "A comparative study of acoustic representations of speech for vowel classification using multi-layer perceptrons", In ICSLP-1990, 1053-1056.