Time-Domain Target-Speaker Speech Separation with Waveform-Based Speaker Embedding

Jianshu Zhao, Shengzhou Gao, Takahiro Shinozaki

Target-speaker speech separation, due to its essence in industrial applications, has been heavily researched for long by many. The key metric for qualifying a good separation algorithm still lies on the separation performance, i.e., the quality of the separated voice. In this paper, we presented a novel high-performance time-domain waveform based target-speaker speech separation architecture (WaveFilter) for this task. Unlike most previous researches which adopted Time-Frequency based approaches, WaveFilter does the job by applying Convolutional Neural Network (CNN) based feature extractors directly on the raw Time-domain audio data, for both the speech separation network and the auxiliary target-speaker feature extraction network. We achieved a 10.46 Signal to Noise Ratio (SNR) improvement on the WSJ0 2-mix dataset and a 10.44 SNR improvement on the Librispeech dataset as our final results, which is much higher than the existing approaches. Our method also achieved an 4.9 SNR improvement on the WSJ0 3-mix data. This proves the feasibility of WaveFilter on separating the target-speaker’s voice from multi-speaker voice mixtures without knowing the exact number of speakers in advance, which in turn proves the readiness of our method for real-world applications.

 DOI: 10.21437/Interspeech.2020-2108

Cite as: Zhao, J., Gao, S., Shinozaki, T. (2020) Time-Domain Target-Speaker Speech Separation with Waveform-Based Speaker Embedding. Proc. Interspeech 2020, 1436-1440, DOI: 10.21437/Interspeech.2020-2108.

  author={Jianshu Zhao and Shengzhou Gao and Takahiro Shinozaki},
  title={{Time-Domain Target-Speaker Speech Separation with Waveform-Based Speaker Embedding}},
  booktitle={Proc. Interspeech 2020},