Deep Speaker Recognition: Modular or Monolithic?

Gautam Bhattacharya, Jahangir Alam, Patrick Kenny

Speaker recognition has made extraordinary progress with the advent of deep neural networks. In this work, we analyze the performance of end-to-end deep speaker recognizers on two popular text-independent tasks - NIST-SRE 2016 and VoxCeleb. Through a combination of a deep convolutional feature extractor, self-attentive pooling and large-margin loss functions, we achieve state-of-the-art performance on VoxCeleb. Our best individual and ensemble models show a relative improvement of 70% an 82% respectively over the best reported results on this task.

On the challenging NIST-SRE 2016 task, our proposed end-to-end models show good performance but are unable to match a strong i-vector baseline. State-of-the-art systems for this task use a modular framework that combines neural network embeddings with a probabilistic linear discriminant analysis (PLDA) classifier. Drawing inspiration from this approach we propose to replace the PLDA classifier with a neural network. Our modular neural network approach is able to outperform the i-vector baseline using cosine distance to score verification trials.

 DOI: 10.21437/Interspeech.2019-3146

Cite as: Bhattacharya, G., Alam, J., Kenny, P. (2019) Deep Speaker Recognition: Modular or Monolithic?. Proc. Interspeech 2019, 1143-1147, DOI: 10.21437/Interspeech.2019-3146.

  author={Gautam Bhattacharya and Jahangir Alam and Patrick Kenny},
  title={{Deep Speaker Recognition: Modular or Monolithic?}},
  booktitle={Proc. Interspeech 2019},