5th International Conference on Spoken Language Processing

Sydney, Australia
November 30 - December 4, 1998

Speech Intelligibility Derived From Exceedingly Sparse Spectral Information

Steven Greenberg (1), Takayuki Arai (2), Rosaria Silipo (1)

(1) International Computer Science Institute, USA
(2) Sophia University, Japan

Traditional models of speech assume that a detailed analysis of the acoustic spectrum is essential for understanding spoken language. The validity of this assumption was tested by partitioning the spectrum of spoken sentences into 1/3-octave channels ("slits") and measuring the intelligibility associated with each channel presented alone and in concert with the others. Four spectral channels, distributed over the speech-audio range (0.3 - 6 kHz) are sufficient for human listeners to decode sentential material with nearly 90% accuracy, although more than 70% of the spectrum is missing. Word recognition often remains relatively high (60-83%) when just two or three channels are presented concurrently, even though the intelligibility of these same slits, presented in isolation, is less than 9%. Such data suggest that intelligibility is derived from a compound "image" of the modulation spectrum distributed across the frequency spectrum. Because intelligibility seriously degrades when slits are desynchronized by more than 25 ms, this image is probably derived from both the amplitude and phase components of the modulation spectrum.

Full Paper

Bibliographic reference.  Greenberg, Steven / Arai, Takayuki / Silipo, Rosaria (1998): "Speech intelligibility derived from exceedingly sparse spectral information", In ICSLP-1998, paper 0074.