Wide Learning for Auditory Comprehension

Elnaz Shafaei-Bajestan, R. Harald Baayen

Classical linguistic, cognitive and engineering models for speech recognition and human auditory comprehension posit representations for sounds and words that mediate between the acoustic signal and interpretation. Recent advances in automatic speech recognition have shown, using deep learning, that state-of-the-art performance is obtained without such units. We present a cognitive model of auditory comprehension based on wide rather than deep learning that was trained on 20 to 80 hours of TV news broadcasts. Just as deep network models, our model is an end-to-end system that does not make use of phonemes and phonological wordform representations. Nevertheless, it performs well on the difficult task of single word identification (model accuracy 11.37%, Mozilla DeepSpeech: 4.45%). The architecture of the model is a simple two-layered wide neural network with weighted connections between the acoustic frequency band features as inputs and lexical outcomes (pointers to semantic vectors) as outputs. Model performance shows hardly any degredation when trained on speech in noise rather than on clean speech. Performance was further enhanced by adding a second network to a standard wide network. The present word recognition module is designed to become part of a larger system modeling the comprehension of running speech.

 DOI: 10.21437/Interspeech.2018-2420

Cite as: Shafaei-Bajestan, E., Baayen, R.H. (2018) Wide Learning for Auditory Comprehension. Proc. Interspeech 2018, 966-970, DOI: 10.21437/Interspeech.2018-2420.

  author={Elnaz Shafaei-Bajestan and R. Harald Baayen},
  title={Wide Learning for Auditory Comprehension},
  booktitle={Proc. Interspeech 2018},