Optimizing a Speaker Embedding Extractor Through Backend-Driven Regularization

Luciana Ferrer, Mitchell McLaren

State-of-the-art speaker verification systems use deep neural networks (DNN) to extract highly discriminant representations of the samples, commonly called speaker embeddings. The networks are trained to maximize the cross-entropy between the estimated posteriors and the speaker labels. The pre-activations from one of the last layers in that network are used as embeddings. These sample-level vectors are then used as input to a backend that generates the final scores. The most successful backend for speaker verification to date is the probabilistic linear discriminant analysis (PLDA) backend. The full process consists of a linear discriminant analysis (LDA) projection of the embeddings, followed by mean and length normalization, ending with PLDA for score computation. While this procedure works very well compared to other approaches, it seems to be inherently suboptimal since the embeddings extractor is not directly trained to optimize the performance of the embeddings when using the PLDA backend for scoring. In this work, we propose one way to encourage the DNN to generate embeddings that are optimized for use in the PLDA backend, by adding a secondary objective designed to measure the performance of a such backend within the network. We show modest but consistent gains across several speaker recognition datasets.

 DOI: 10.21437/Interspeech.2019-1820

Cite as: Ferrer, L., McLaren, M. (2019) Optimizing a Speaker Embedding Extractor Through Backend-Driven Regularization. Proc. Interspeech 2019, 4350-4354, DOI: 10.21437/Interspeech.2019-1820.

  author={Luciana Ferrer and Mitchell McLaren},
  title={{Optimizing a Speaker Embedding Extractor Through Backend-Driven Regularization}},
  booktitle={Proc. Interspeech 2019},