ARET: Aggregated Residual Extended Time-Delay Neural Networks for Speaker Verification

Ruiteng Zhang, Jianguo Wei, Wenhuan Lu, Longbiao Wang, Meng Liu, Lin Zhang, Jiayu Jin, Junhai Xu


The time-delay neural network (TDNN) is widely used in speaker verification to extract long-term temporal features of speakers. Although common TDNN approaches well capture time-sequential information, they lack the delicate transformations needed for deep representation. To solve this problem, we propose two TDNN architectures. RET integrates shortcut connections into conventional time-delay blocks, and ARET adopts a split-transform-merge strategy to extract more discriminative representation. Experiments on VoxCeleb datasets without augmentation indicate that ARET realizes satisfactory performance on the VoxCeleb1 test set, VoxCeleb1-E, and VoxCeleb1-H, with 1.389%, 1.520%, and 2.614% equal error rate (EER), respectively. Compared to state-of-the-art results on these test sets, RET achieves a 23%~43% relative reduction in EER, and ARET reaches 32%~45%.


 DOI: 10.21437/Interspeech.2020-1626

Cite as: Zhang, R., Wei, J., Lu, W., Wang, L., Liu, M., Zhang, L., Jin, J., Xu, J. (2020) ARET: Aggregated Residual Extended Time-Delay Neural Networks for Speaker Verification. Proc. Interspeech 2020, 946-950, DOI: 10.21437/Interspeech.2020-1626.


@inproceedings{Zhang2020,
  author={Ruiteng Zhang and Jianguo Wei and Wenhuan Lu and Longbiao Wang and Meng Liu and Lin Zhang and Jiayu Jin and Junhai Xu},
  title={{ARET: Aggregated Residual Extended Time-Delay Neural Networks for Speaker Verification}},
  year=2020,
  booktitle={Proc. Interspeech 2020},
  pages={946--950},
  doi={10.21437/Interspeech.2020-1626},
  url={http://dx.doi.org/10.21437/Interspeech.2020-1626}
}