Multi-Channel Block-Online Source Extraction Based on Utterance Adaptation

Juan M. Martín-Doñas, Jens Heitkaemper, Reinhold Haeb-Umbach, Angel M. Gomez, Antonio M. Peinado

This paper deals with multi-channel speech recognition in scenarios with multiple speakers. Recently, the spectral characteristics of a target speaker, extracted from an adaptation utterance, have been used to guide a neural network mask estimator to focus on that speaker. In this work we present two variants of speaker-aware neural networks, which exploit both spectral and spatial information to allow better discrimination between target and interfering speakers. Thus, we introduce either a spatial pre-processing prior to the mask estimation or a spatial plus spectral speaker characterization block whose output is directly fed into the neural mask estimator. The target speaker’s spectral and spatial signature is extracted from an adaptation utterance recorded at the beginning of a session. We further adapt the architecture for low-latency processing by means of block-online beamforming that recursively updates the signal statistics. Experimental results show that the additional spatial information clearly improves source extraction, in particular in the same-gender case, and that our proposal achieves state-of-the-art performance in terms of distortion reduction and recognition accuracy.

 DOI: 10.21437/Interspeech.2019-2244

Cite as: Martín-Doñas, J.M., Heitkaemper, J., Haeb-Umbach, R., Gomez, A.M., Peinado, A.M. (2019) Multi-Channel Block-Online Source Extraction Based on Utterance Adaptation. Proc. Interspeech 2019, 96-100, DOI: 10.21437/Interspeech.2019-2244.

  author={Juan M. Martín-Doñas and Jens Heitkaemper and Reinhold Haeb-Umbach and Angel M. Gomez and Antonio M. Peinado},
  title={{Multi-Channel Block-Online Source Extraction Based on Utterance Adaptation}},
  booktitle={Proc. Interspeech 2019},