Speaker Diarization with Enhancing Speech for the First DIHARD Challenge

Lei Sun, Jun Du, Chao Jiang, Xueyang Zhang, Shan He, Bing Yin, Chin-Hui Lee

We design a novel speaker diarization system for the first DIHARD challenge by integrating several important modules of speech denoising, speech activity detection (SAD), i-vector design and scoring strategy. One main contribution is the proposed long short-term memory (LSTM) based speech denoising model. By fully utilizing the diversified simulated training data and advanced network architecture using progressive multitask learning with dense structure, the denoising model demonstrates the strong generalization capability to realistic noisy environments. The enhanced speech can boost the performance for the subsequent SAD, segmentation and clustering. To the best of our knowledge, this is the first time we show significant improvements of deep learning based single-channel speech enhancement over state-of-the-art diarization systems in highly mismatch conditions. For the design of i-vector extraction, we adopt a residual convolutional neural network trained on large dataset including more than 30,000 people. Finally, by score fusion of different i-vectors based on all these techniques, our systems yield diarization error rates (DERs) of 24.56% and 36.05% on the evaluation sets of Track1 and Track2, which are both in the second place among 14 and 11 participating teams, respectively.

 DOI: 10.21437/Interspeech.2018-1742

Cite as: Sun, L., Du, J., Jiang, C., Zhang, X., He, S., Yin, B., Lee, C. (2018) Speaker Diarization with Enhancing Speech for the First DIHARD Challenge. Proc. Interspeech 2018, 2793-2797, DOI: 10.21437/Interspeech.2018-1742.

  author={Lei Sun and Jun Du and Chao Jiang and Xueyang Zhang and Shan He and Bing Yin and Chin-Hui Lee},
  title={Speaker Diarization with Enhancing Speech for the First DIHARD Challenge},
  booktitle={Proc. Interspeech 2018},