x-Vector DNN Refinement with Full-Length Recordings for Speaker Recognition

Daniel Garcia-Romero, David Snyder, Gregory Sell, Alan McCree, Daniel Povey, Sanjeev Khudanpur

State-of-the-art text-independent speaker recognition systems for long recordings (a few minutes) are based on deep neural network (DNN) speaker embeddings. Current implementations of this paradigm use short speech segments (a few seconds) to train the DNN. This introduces a mismatch between training and inference when extracting embeddings for long duration recordings. To address this, we present a DNN refinement approach that updates a subset of the DNN parameters with full recordings to reduce this mismatch. At the same time, we also modify the DNN architecture to produce embeddings optimized for cosine distance scoring. This is accomplished using a large-margin strategy with angular softmax. Experimental validation shows that our approach is capable of producing embeddings that achieve record performance on the SITW benchmark.

 DOI: 10.21437/Interspeech.2019-2205

Cite as: Garcia-Romero, D., Snyder, D., Sell, G., McCree, A., Povey, D., Khudanpur, S. (2019) x-Vector DNN Refinement with Full-Length Recordings for Speaker Recognition. Proc. Interspeech 2019, 1493-1496, DOI: 10.21437/Interspeech.2019-2205.

  author={Daniel Garcia-Romero and David Snyder and Gregory Sell and Alan McCree and Daniel Povey and Sanjeev Khudanpur},
  title={{x-Vector DNN Refinement with Full-Length Recordings for Speaker Recognition}},
  booktitle={Proc. Interspeech 2019},