Text-Independent Speaker Verification with Dual Attention Network

Jingyu Li, Tan Lee

This paper presents a novel design of attention model for text-independent speaker verification. The model takes a pair of input utterances and generates an utterance-level embedding to represent speaker-specific characteristics in each utterance. The input utterances are expected to have highly similar embeddings if they are from the same speaker. The proposed attention model consists of a self-attention module and a mutual attention module, which jointly contributes to the generation of the utterance-level embedding. The self-attention weights are computed from the utterance itself while the mutual-attention weights are computed with the involvement of the other utterance in the input pairs. As a result, each utterance is represented by a self-attention weighted embedding and a mutual-attention weighted embedding. The similarity between the embeddings is measured by a cosine distance score and a binary classifier output score. The whole model, named Dual Attention Network, is trained end-to-end on Voxceleb database. The evaluation results on Voxceleb 1 test set show that the Dual Attention Network significantly outperforms the baseline systems. The best result yields an equal error rate of 1.6%.

 DOI: 10.21437/Interspeech.2020-2031

Cite as: Li, J., Lee, T. (2020) Text-Independent Speaker Verification with Dual Attention Network. Proc. Interspeech 2020, 956-960, DOI: 10.21437/Interspeech.2020-2031.

  author={Jingyu Li and Tan Lee},
  title={{Text-Independent Speaker Verification with Dual Attention Network}},
  booktitle={Proc. Interspeech 2020},