FusionRNN: Shared Neural Parameters for Multi-Channel Distant Speech Recognition

Titouan Parcollet, Xinchi Qiu, Nicholas D. Lane


Distant speech recognition remains a challenging application for modern deep learning based Automatic Speech Recognition (ASR) systems, due to complex recording conditions involving noise and reverberation. Multiple microphones are commonly combined with well-known speech processing techniques to enhance the original signals and thus enhance the speech recognizer performance. These multi-channel follow similar input distributions with respect to the global speech information but also contain an important part of noise. Consequently, the input representation robustness is key to obtaining reasonable recognition rates. In this work, we propose a Fusion Layer (FL) based on shared neural parameters. We use it to produce an expressive embedding of multiple microphone signals, that can easily be combined with any existing ASR pipeline. The proposed model called FusionRNN showed promising results on a multi-channel distant speech recognition task, and consistently outperformed baseline models while maintaining an equal training time.


 DOI: 10.21437/Interspeech.2020-2102

Cite as: Parcollet, T., Qiu, X., Lane, N.D. (2020) FusionRNN: Shared Neural Parameters for Multi-Channel Distant Speech Recognition. Proc. Interspeech 2020, 1678-1682, DOI: 10.21437/Interspeech.2020-2102.


@inproceedings{Parcollet2020,
  author={Titouan Parcollet and Xinchi Qiu and Nicholas D. Lane},
  title={{FusionRNN: Shared Neural Parameters for Multi-Channel Distant Speech Recognition}},
  year=2020,
  booktitle={Proc. Interspeech 2020},
  pages={1678--1682},
  doi={10.21437/Interspeech.2020-2102},
  url={http://dx.doi.org/10.21437/Interspeech.2020-2102}
}