Densely Connected Time Delay Neural Network for Speaker Verification

Ya-Qi Yu, Wu-Jun Li


Time delay neural network (TDNN) has been widely used in speaker verification tasks. Recently, two TDNN-based models, including extended TDNN (E-TDNN) and factorized TDNN (F-TDNN), are proposed to improve the accuracy of vanilla TDNN. But E-TDNN and F-TDNN increase the number of parameters due to deeper networks, compared with vanilla TDNN. In this paper, we propose a novel TDNN-based model, called densely connected TDNN (D-TDNN), by adopting bottleneck layers and dense connectivity. D-TDNN has fewer parameters than existing TDNN-based models. Furthermore, we propose an improved variant of D-TDNN, called D-TDNN-SS, to employ multiple TDNN branches with short-term and long-term contexts. D-TDNN-SS can integrate the information from multiple TDNN branches with a newly designed channel-wise selection mechanism called statistics-and- selection (SS). Experiments on VoxCeleb datasets show that both D-TDNN and D-TDNN-SS can outperform existing models to achieve state-of-the-art accuracy with fewer parameters, and D-TDNN-SS can achieve better accuracy than D-TDNN.


 DOI: 10.21437/Interspeech.2020-1275

Cite as: Yu, Y., Li, W. (2020) Densely Connected Time Delay Neural Network for Speaker Verification. Proc. Interspeech 2020, 921-925, DOI: 10.21437/Interspeech.2020-1275.


@inproceedings{Yu2020,
  author={Ya-Qi Yu and Wu-Jun Li},
  title={{Densely Connected Time Delay Neural Network for Speaker Verification}},
  year=2020,
  booktitle={Proc. Interspeech 2020},
  pages={921--925},
  doi={10.21437/Interspeech.2020-1275},
  url={http://dx.doi.org/10.21437/Interspeech.2020-1275}
}