An Effective Deep Embedding Learning Architecture for Speaker Verification

Yiheng Jiang, Yan Song, Ian McLoughlin, Zhifu Gao, Li-Rong Dai

In this paper we present an effective deep embedding learning architecture, which combines a dense connection of dilated convolutional layers with a gating mechanism, for speaker verification (SV) tasks. Compared with the widely used time-delay neural network (TDNN) based architecture, two main improvements are proposed: (1) The dilated filters are designed to effectively capture time-frequency context information, then the convolutional layer outputs are utilized for effective embedding learning. Specifically, we employ the idea of the successful DenseNet to collect the context information by dense connections from each layer to every other layer in a feed-forward fashion. (2) A gating mechanism is further introduced to provide channel-wise attention by exploiting inter-dependencies across channels. Motivated by squeeze-and-excitation networks (SENet), the global time-frequency information is utilized for this feature calibration. To evaluate the proposed network architecture, we conduct extensive experiments on noisy and unconstrained SV tasks, i.e., Speaker in the Wild (SITW) and Voxceleb1. The results demonstrate state-of-the-art SV performance. Specifically, our proposed method reduces equal error rate (EER) from TDNN based method by 25% and 27% for SITW and Voxceleb1, respectively.

 DOI: 10.21437/Interspeech.2019-1606

Cite as: Jiang, Y., Song, Y., McLoughlin, I., Gao, Z., Dai, L. (2019) An Effective Deep Embedding Learning Architecture for Speaker Verification. Proc. Interspeech 2019, 4040-4044, DOI: 10.21437/Interspeech.2019-1606.

  author={Yiheng Jiang and Yan Song and Ian McLoughlin and Zhifu Gao and Li-Rong Dai},
  title={{An Effective Deep Embedding Learning Architecture for Speaker Verification}},
  booktitle={Proc. Interspeech 2019},