A Unified Framework for Low-Latency Speaker Extraction in Cocktail Party Environments

Yunzhe Hao, Jiaming Xu, Jing Shi, Peng Zhang, Lei Qin, Bo Xu

Speech recognition technology in single-talker scenes has matured in recent years. However, in noisy environments, especially in multi-talker scenes, speech recognition performance is significantly reduced. Towards cocktail party problem, we propose a unified time-domain target speaker extraction framework. In this framework, we obtain a voiceprint from a clean speech of the target speaker and then extract the speech of the same speaker in a mixed speech based on the previously obtained voiceprint. This framework uses voiceprint information to avoid permutation problems. In addition, a time-domain model can avoid the phase reconstruction problem of traditional time-frequency domain models. Our framework is suitable for scenes where people are relatively fixed and their voiceprints are easily registered, such as in a car, home, meeting room, or other such scenes. The proposed global model based on the dual-path recurrent neural network (DPRNN) block achieved state-of-the-art under speaker extraction tasks on the WSJ0-2mix dataset. We also built corresponding low-latency models. Results showed comparable model performance and a much shorter upper limit latency than time-frequency domain models. We found that performance of the low-latency model gradually decreased as latency decreased, which is important when deploying models in actual application scenarios.

 DOI: 10.21437/Interspeech.2020-2085

Cite as: Hao, Y., Xu, J., Shi, J., Zhang, P., Qin, L., Xu, B. (2020) A Unified Framework for Low-Latency Speaker Extraction in Cocktail Party Environments. Proc. Interspeech 2020, 1431-1435, DOI: 10.21437/Interspeech.2020-2085.

  author={Yunzhe Hao and Jiaming Xu and Jing Shi and Peng Zhang and Lei Qin and Bo Xu},
  title={{A Unified Framework for Low-Latency Speaker Extraction in Cocktail Party Environments}},
  booktitle={Proc. Interspeech 2020},