Self-Attention Encoding and Pooling for Speaker Recognition

Pooyan Safari, Miquel India, Javier Hernando


The computing power of mobile devices limits the end-user applications in terms of storage size, processing, memory and energy consumption. These limitations motivate researchers for the design of more efficient deep models. On the other hand, self-attention networks based on Transformer architecture have attracted remarkable interests due to their high parallelization capabilities and strong performance on a variety of Natural Language Processing (NLP) applications. Inspired by the Transformer, we propose a tandem Self-Attention Encoding and Pooling (SAEP) mechanism to obtain a discriminative speaker embedding given non-fixed length speech utterances. SAEP is a stack of identical blocks solely relied on self-attention and position-wise feed-forward networks to create vector representation of speakers. This approach encodes short-term speaker spectral features into speaker embeddings to be used in text-independent speaker verification. We have evaluated this approach on both VoxCeleb1 & 2 datasets. The proposed architecture is able to outperform the baseline x-vector, and shows competitive performance to some other benchmarks based on convolutions, with a significant reduction in model size. It employs 94%, 95%, and 73% less parameters compared to ResNet-34, ResNet-50, and x-vector, respectively. This indicates that the proposed fully attention based architecture is more efficient in extracting time-invariant features from speaker utterances.


 DOI: 10.21437/Interspeech.2020-1446

Cite as: Safari, P., India, M., Hernando, J. (2020) Self-Attention Encoding and Pooling for Speaker Recognition. Proc. Interspeech 2020, 941-945, DOI: 10.21437/Interspeech.2020-1446.


@inproceedings{Safari2020,
  author={Pooyan Safari and Miquel India and Javier Hernando},
  title={{Self-Attention Encoding and Pooling for Speaker Recognition}},
  year=2020,
  booktitle={Proc. Interspeech 2020},
  pages={941--945},
  doi={10.21437/Interspeech.2020-1446},
  url={http://dx.doi.org/10.21437/Interspeech.2020-1446}
}