Training Speaker Enrollment Models by Network Optimization

Victoria Mingote, Antonio Miguel, Alfonso Ortega, Eduardo Lleida


In this paper, we present a new approach for the enrollment process in a deep neural network (DNN) system which learns the speaker model by an optimization process. Most Speaker Verification (SV) systems extract representations for both the enrollment and test utterances called embeddings, and then, these systems usually apply a similarity metric or complex back-ends to carry out the verification process. Unlike previous works, we propose to take advantage of the knowledge acquired by a DNN to model the speakers from the training set since the last layer of the DNN can be seen as an embedding dictionary which represents train speakers. Thus, after the initial training phase, we introduce a new learnable vector for each enrollment speaker. Furthermore, to lead this training process, we employ a loss function more appropriate for verification, the approximated Detection Cost Function ( aDCF) loss function. The new strategy to produce enrollment models for each target speaker was tested on the RSR-Part II database for text-dependent speaker verification, where the proposed approach outperforms the reference system based on directly averaging of the embeddings extracted from the enroll data using the network and the application of cosine similarity.


 DOI: 10.21437/Interspeech.2020-2325

Cite as: Mingote, V., Miguel, A., Ortega, A., Lleida, E. (2020) Training Speaker Enrollment Models by Network Optimization. Proc. Interspeech 2020, 3810-3814, DOI: 10.21437/Interspeech.2020-2325.


@inproceedings{Mingote2020,
  author={Victoria Mingote and Antonio Miguel and Alfonso Ortega and Eduardo Lleida},
  title={{Training Speaker Enrollment Models by Network Optimization}},
  year=2020,
  booktitle={Proc. Interspeech 2020},
  pages={3810--3814},
  doi={10.21437/Interspeech.2020-2325},
  url={http://dx.doi.org/10.21437/Interspeech.2020-2325}
}