SpEx+: A Complete Time Domain Speaker Extraction Network

Meng Ge, Chenglin Xu, Longbiao Wang, Eng Siong Chng, Jianwu Dang, Haizhou Li

Speaker extraction aims to extract the target speech signal from a multi-talker environment given a target speaker’s reference speech. We recently proposed a time-domain solution, SpEx, that avoids the phase estimation in frequency-domain approaches. Unfortunately, SpEx is not fully a time-domain solution since it performs time-domain speech encoding for speaker extraction, while taking frequency-domain speaker embedding as the reference. The size of the analysis window for time-domain and the size for frequency-domain input are also different. Such mismatch has an adverse effect on the system performance. To eliminate such mismatch, we propose a complete time-domain speaker extraction solution, that is called SpEx+. Specifically, we tie the weights of two identical speech encoder networks, one for the encoder-extractor-decoder pipeline, another as part of the speaker encoder. Experiments show that the SpEx+ achieves 0.8dB and 2.1dB SDR improvement over the state-of-the-art SpEx baseline, under different and same gender conditions on WSJ0-2mix-extr database respectively.

 DOI: 10.21437/Interspeech.2020-1397

Cite as: Ge, M., Xu, C., Wang, L., Chng, E.S., Dang, J., Li, H. (2020) SpEx+: A Complete Time Domain Speaker Extraction Network. Proc. Interspeech 2020, 1406-1410, DOI: 10.21437/Interspeech.2020-1397.

  author={Meng Ge and Chenglin Xu and Longbiao Wang and Eng Siong Chng and Jianwu Dang and Haizhou Li},
  title={{SpEx+: A Complete Time Domain Speaker Extraction Network}},
  booktitle={Proc. Interspeech 2020},