Vector-Based Attentive Pooling for Text-Independent Speaker Verification

Yanfeng Wu, Chenkai Guo, Hongcan Gao, Xiaolei Hou, Jing Xu

The pooling mechanism plays an important role in deep neural network based systems for text-independent speaker verification, which aggregates the variable-length frame-level vector sequence across all frames into a fixed-dimensional utterance-level representation. Previous attentive pooling methods employ scalar attention weights for each frame-level vector, resulting in insufficient collection of discriminative information. To address this issue, this paper proposes a vector-based attentive pooling method, which adopts vectorial attention instead of scalar attention. The vectorial attention can extract fine-grained features for discriminating different speakers. Besides, the vector-based attentive pooling is extended in a multi-head way for better speaker embeddings from multiple aspects. The proposed pooling method is evaluated with the x-vector baseline system. Experiments are conducted on two public datasets, VoxCeleb and Speaker in the Wild (SITW). The results show that the vector-based attentive pooling method achieves superior performance compared with statistics pooling and three state-of-the-art attentive pooling methods, with the best equal error rate (EER) of 2.734 and 3.062 in SITW as well as the best EER of 2.466 in VoxCeleb.

 DOI: 10.21437/Interspeech.2020-1422

Cite as: Wu, Y., Guo, C., Gao, H., Hou, X., Xu, J. (2020) Vector-Based Attentive Pooling for Text-Independent Speaker Verification. Proc. Interspeech 2020, 936-940, DOI: 10.21437/Interspeech.2020-1422.

  author={Yanfeng Wu and Chenkai Guo and Hongcan Gao and Xiaolei Hou and Jing Xu},
  title={{Vector-Based Attentive Pooling for Text-Independent Speaker Verification}},
  booktitle={Proc. Interspeech 2020},