Deep Embedding Learning for Text-Dependent Speaker Verification

Peng Zhang, Peng Hu, Xueliang Zhang


In this paper we present an effective deep embedding learning architecture for speaker verification task. Compared with the widely used residual neural network (ResNet) and time-delay neural network (TDNN) based architectures, two main improvements are proposed: 1) We use densely connected convolutional network (DenseNet) to encode the short term context information of the speaker. 2) A bidirectional attentive pooling strategy is proposed to further model the long term temporal context and aggregate the important frames which reflect the speaker identity. We evaluate the proposed architecture on the task of text-dependent speaker verification in the Interspeech 2020 Far Field Speaker Verification Challenge (FFSVC2020). Result shows that the proposed algorithm outperforms the official baseline of FFSVC2020 with 8.06%, 19.70% minDCFs and 9.26%, 16.16% EERs relative reductions on the evaluation set of Task 1 and Task 3 respectively.


 DOI: 10.21437/Interspeech.2020-1354

Cite as: Zhang, P., Hu, P., Zhang, X. (2020) Deep Embedding Learning for Text-Dependent Speaker Verification. Proc. Interspeech 2020, 3461-3465, DOI: 10.21437/Interspeech.2020-1354.


@inproceedings{Zhang2020,
  author={Peng Zhang and Peng Hu and Xueliang Zhang},
  title={{Deep Embedding Learning for Text-Dependent Speaker Verification}},
  year=2020,
  booktitle={Proc. Interspeech 2020},
  pages={3461--3465},
  doi={10.21437/Interspeech.2020-1354},
  url={http://dx.doi.org/10.21437/Interspeech.2020-1354}
}