Learning Discriminative Features for Speaker Identification and Verification

Sarthak Yadav, Atul Rai

The success of any Text Independent Speaker Identification and/or Verification system relies upon the system’s capability to learn discriminative features. In this paper we propose a Convolutional Neural Network (CNN) Architecture based on the popular Very Deep VGG [1] CNNs, with key modifications to accommodate variable length spectrogram inputs, reduce the model disk space requirements and reduce the number of parameters, resulting in significant reduction in training times. We also propose a unified deep learning system for both Text-Independent Speaker Recognition and Speaker Verification, by training the proposed network architecture under the joint supervision of Softmax loss and Center loss [2] to obtain highly discriminative deep features that are suited for both Speaker Identification and Verification Tasks. We use the recently released VoxCeleb dataset [3], which contains hundreds of thousands of real world utterances of over 1200 celebrities belonging to various ethnicities, for benchmarking our approach. Our best CNN model achieved a Top-1 accuracy of 84.6%, a 4% absolute improvement over VoxCeleb’s approach, whereas training in conjunction with Center Loss improved the Top-1 accuracy to 89.5%, a 9% absolute improvement over Voxceleb’s approach.

 DOI: 10.21437/Interspeech.2018-1015

Cite as: Yadav, S., Rai, A. (2018) Learning Discriminative Features for Speaker Identification and Verification. Proc. Interspeech 2018, 2237-2241, DOI: 10.21437/Interspeech.2018-1015.

  author={Sarthak Yadav and Atul Rai},
  title={Learning Discriminative Features for Speaker Identification and Verification},
  booktitle={Proc. Interspeech 2018},