On the Usage of Multi-Feature Integration for Speaker Verification and Language Identification

Zheng Li, Miao Zhao, Jing Li, Lin Li, Qingyang Hong


In this paper, we study the technology of multiple acoustic feature integration for the applications of Automatic Speaker Verification (ASV) and Language Identification (LID). In contrast to score level fusion, a common method for integrating subsystems built upon various acoustic features, we explore a new integration strategy, which integrates multiple acoustic features based on the x-vector framework. The frame level, statistics pooling level, segment level, and embedding level integrations are investigated in this study. Our results indicate that frame level integration of multiple acoustic features achieves the best performance in both speaker and language recognition tasks, and the multi-feature integration strategy can be generalized in both classification tasks. Furthermore, we introduce a time-restricted attention mechanism into the frame level integration structure to further improve the performance of multi-feature integration. The experiments are conducted on VoxCeleb 1 for ASV and AP-OLR-17 for LID, and we achieve 28% and 19% relative improvement in terms of Equal Error Rate (EER) in ASV and LID tasks, respectively.


 DOI: 10.21437/Interspeech.2020-1960

Cite as: Li, Z., Zhao, M., Li, J., Li, L., Hong, Q. (2020) On the Usage of Multi-Feature Integration for Speaker Verification and Language Identification. Proc. Interspeech 2020, 457-461, DOI: 10.21437/Interspeech.2020-1960.


@inproceedings{Li2020,
  author={Zheng Li and Miao Zhao and Jing Li and Lin Li and Qingyang Hong},
  title={{On the Usage of Multi-Feature Integration for Speaker Verification and Language Identification}},
  year=2020,
  booktitle={Proc. Interspeech 2020},
  pages={457--461},
  doi={10.21437/Interspeech.2020-1960},
  url={http://dx.doi.org/10.21437/Interspeech.2020-1960}
}