Integrating Spectral and Spatial Features for Multi-Channel Speaker Separation

Zhong-Qiu Wang, DeLiang Wang

This paper tightly integrates spectral and spatial information for deep learning based multi-channel speaker separation. The key idea is to localize individual speakers so that an enhancement network can be used to separate the speaker from an estimated direction and with specific spectral characteristics. To determine the direction of the speaker of interest, we identify time-frequency (T-F) units dominated by that speaker and only use them for direction of arrival (DOA) estimation. The speaker dominance at each T-F unit is determined by a two-channel permutation invariant training network, which combines spectral and interchannel phase patterns at the input feature level. In addition, beamforming is tightly integrated in the proposed system by exploiting the magnitudes and phase pro-duced by T-F masking based beamforming. Strong separation performance has been observed on a spatialized reverberant version of the wsj0-2mix corpus.

 DOI: 10.21437/Interspeech.2018-1940

Cite as: Wang, Z., Wang, D. (2018) Integrating Spectral and Spatial Features for Multi-Channel Speaker Separation. Proc. Interspeech 2018, 2718-2722, DOI: 10.21437/Interspeech.2018-1940.

  author={Zhong-Qiu Wang and DeLiang Wang},
  title={Integrating Spectral and Spatial Features for Multi-Channel Speaker Separation},
  booktitle={Proc. Interspeech 2018},