Multimodal Association for Speaker Verification

Suwon Shon, James Glass


In this paper, we propose a multimodal association on a speaker verification system for fine-tuning using both voice and face. Inspired by neuroscientific findings, the proposed approach is to mimic the unimodal perception system benefits from the multisensory association of stimulus pairs. To verify this, we use the SRE18 evaluation protocol for experiments and use out-of-domain data, Voxceleb, for the proposed multimodal fine-tuning. Although the proposed approach relies on voice-face paired multimodal data during the training phase, the face is no more needed after training is done and only speech audio is used for the speaker verification system. In the experiments, we observed that the unimodal model, i.e. speaker verification model, benefits from the multimodal association of voice and face and generalized better than before by learning channel invariant speaker representation.


 DOI: 10.21437/Interspeech.2020-1996

Cite as: Shon, S., Glass, J. (2020) Multimodal Association for Speaker Verification. Proc. Interspeech 2020, 2247-2251, DOI: 10.21437/Interspeech.2020-1996.


@inproceedings{Shon2020,
  author={Suwon Shon and James Glass},
  title={{Multimodal Association for Speaker Verification}},
  year=2020,
  booktitle={Proc. Interspeech 2020},
  pages={2247--2251},
  doi={10.21437/Interspeech.2020-1996},
  url={http://dx.doi.org/10.21437/Interspeech.2020-1996}
}