New Advances in Speaker Diarization

Hagai Aronowitz, Weizhong Zhu, Masayuki Suzuki, Gakuto Kurata, Ron Hoory

Recently, speaker diarization based on speaker embeddings has shown excellent results in many works. In this paper we propose several enhancements throughout the diarization pipeline. This work addresses two clustering frameworks: agglomerative hierarchical clustering (AHC) and spectral clustering (SC).

First, we use multiple speaker embeddings. We show that fusion of x-vectors and d-vectors boosts accuracy significantly. Second, we train neural networks to leverage both acoustic and duration information for scoring similarity of segments or clusters. Third, we introduce a novel method to guide the AHC clustering mechanism using a neural network. Fourth, we handle short duration segments in SC by deemphasizing their effect on setting the number of speakers.

Finally, we propose a novel method for estimating the number of clusters in the SC framework. The method takes each eigenvalue and analyzes the projections of the SC similarity matrix on the corresponding eigenvector.

We evaluated our system on NIST SRE 2000 CALLHOME and, using cross-validation, we achieved an error rate of 5.1%, going beyond state-of-the-art speaker diarization.

 DOI: 10.21437/Interspeech.2020-1879

Cite as: Aronowitz, H., Zhu, W., Suzuki, M., Kurata, G., Hoory, R. (2020) New Advances in Speaker Diarization. Proc. Interspeech 2020, 279-283, DOI: 10.21437/Interspeech.2020-1879.

  author={Hagai Aronowitz and Weizhong Zhu and Masayuki Suzuki and Gakuto Kurata and Ron Hoory},
  title={{New Advances in Speaker Diarization}},
  booktitle={Proc. Interspeech 2020},