Segment-Level Effects of Gender, Nationality and Emotion Information on Text-Independent Speaker Verification

Kai Li, Masato Akagi, Yibo Wu, Jianwu Dang


Speaker embeddings extracted from neural network (NN) achieve excellent performance on general speaker verification (SV) missions. Most current SV systems use only speaker labels. Therefore, the interaction between different types of domain information decrease the prediction accuracy of SV. To overcome this weakness and improve SV performance, four effective SV systems were proposed by using gender, nationality, and emotion information to add more constraints in the NN training stage. More specifically, multitask learning-based systems which including multitask gender (MTG), multitask nationality (MTN) and multitask gender and nationality (MTGN) were used to enhance gender and nationality information learning. Domain adversarial training-based system which including emotion domain adversarial training (EDAT) was used to suppress different emotions information learning. Experimental results indicate that encouraging gender and nationality information and suppressing emotion information learning improve the performance of SV. In the end, our proposed systems achieved 16.4 and 22.9% relative improvements in the equal error rate for MTL- and DAT-based systems, respectively.


 DOI: 10.21437/Interspeech.2020-1700

Cite as: Li, K., Akagi, M., Wu, Y., Dang, J. (2020) Segment-Level Effects of Gender, Nationality and Emotion Information on Text-Independent Speaker Verification. Proc. Interspeech 2020, 2987-2991, DOI: 10.21437/Interspeech.2020-1700.


@inproceedings{Li2020,
  author={Kai Li and Masato Akagi and Yibo Wu and Jianwu Dang},
  title={{Segment-Level Effects of Gender, Nationality and Emotion Information on Text-Independent Speaker Verification}},
  year=2020,
  booktitle={Proc. Interspeech 2020},
  pages={2987--2991},
  doi={10.21437/Interspeech.2020-1700},
  url={http://dx.doi.org/10.21437/Interspeech.2020-1700}
}