An End-to-End Architecture of Online Multi-Channel Speech Separation

Jian Wu, Zhuo Chen, Jinyu Li, Takuya Yoshioka, Zhili Tan, Edward Lin, Yi Luo, Lei Xie

Multi-speaker speech recognition has been one of the key challenges in conversation transcription as it breaks the single active speaker assumption employed by most state-of-the-art speech recognition systems. Speech separation is considered as a remedy to this problem. Previously, we introduced a system, called unmixing, fixed-beamformer and extraction (UFE), that was shown to be effective in addressing the speech overlap problem in conversation transcription. With UFE, an input mixed signal is processed by fixed beamformers, followed by a neural network post filtering. Although promising results were obtained, the system contains multiple individually developed modules, leading potentially sub-optimum performance. In this work, we introduce an end-to-end modeling version of UFE. To enable gradient propagation all the way, an attentional selection module is proposed, where an attentional weight is learnt for each beamformer and spatial feature sampled over space. Experimental results show that the proposed system achieves comparable performance in an offline evaluation with the original separate processing-based pipeline, while producing remarkable improvements in an online evaluation.

 DOI: 10.21437/Interspeech.2020-1981

Cite as: Wu, J., Chen, Z., Li, J., Yoshioka, T., Tan, Z., Lin, E., Luo, Y., Xie, L. (2020) An End-to-End Architecture of Online Multi-Channel Speech Separation. Proc. Interspeech 2020, 81-85, DOI: 10.21437/Interspeech.2020-1981.

  author={Jian Wu and Zhuo Chen and Jinyu Li and Takuya Yoshioka and Zhili Tan and Edward Lin and Yi Luo and Lei Xie},
  title={{An End-to-End Architecture of Online Multi-Channel Speech Separation}},
  booktitle={Proc. Interspeech 2020},