Self-Attentive Similarity Measurement Strategies in Speaker Diarization

Qingjian Lin, Yu Hou, Ming Li

Speaker diarization can be described as the process of extracting sequential speaker embeddings from an audio stream and clustering them according to speaker identities. Nowadays, deep neural network based approaches like x-vector have been widely adopted for speaker embedding extraction. However, in the clustering back-end, probabilistic linear discriminant analysis (PLDA) is still the dominant algorithm for similarity measurement. PLDA works in a pair-wise and independent manner, which may ignore the positional correlation of adjacent speaker embeddings. To address this issue, our previous work proposed the long short-term memory (LSTM) based scoring model, followed by the spectral clustering algorithm. In this paper, we further propose two enhanced methods based on the self-attention mechanism, which no longer focuses on the local correlation but searches for similar speaker embeddings in the whole sequence. The first approach achieves state-of-the-art performance on the DIHARD II Eval Set (18.44% DER after resegmentation), while the second one operates with higher efficiency.

 DOI: 10.21437/Interspeech.2020-1908

Cite as: Lin, Q., Hou, Y., Li, M. (2020) Self-Attentive Similarity Measurement Strategies in Speaker Diarization. Proc. Interspeech 2020, 284-288, DOI: 10.21437/Interspeech.2020-1908.

  author={Qingjian Lin and Yu Hou and Ming Li},
  title={{Self-Attentive Similarity Measurement Strategies in Speaker Diarization}},
  booktitle={Proc. Interspeech 2020},