Multi-channel Attention for End-to-End Speech Recognition

Stefan Braun, Daniel Neil, Jithendar Anumula, Enea Ceolini, Shih-Chii Liu

Recent end-to-end models for automatic speech recognition use sensory attention to integrate multiple input channels within a single neural network. However, these attention models are sensitive to the ordering of the channels used during training. This work proposes a sensory attention mechanism that is invariant to the channel ordering and only increases the overall parameter count by 0.09%. We demonstrate that even without re-training, our attention-equipped end-to-end model is able to deal with arbitrary numbers of input channels during inference. In comparison to a recent related model with sensory attention, our model when tested on the real noisy recordings from the multi-channel CHiME-4 dataset, achieves a relative character error rate (CER) improvement of 40.3% to 42.9%. In a two-channel configuration experiment, the attention signal allows the lower signal-to-noise ratio (SNR) sensor to be identified with 97.7% accuracy.

 DOI: 10.21437/Interspeech.2018-1301

Cite as: Braun, S., Neil, D., Anumula, J., Ceolini, E., Liu, S. (2018) Multi-channel Attention for End-to-End Speech Recognition. Proc. Interspeech 2018, 17-21, DOI: 10.21437/Interspeech.2018-1301.

  author={Stefan Braun and Daniel Neil and Jithendar Anumula and Enea Ceolini and Shih-Chii Liu},
  title={Multi-channel Attention for End-to-End Speech Recognition},
  booktitle={Proc. Interspeech 2018},