Multi-Modality Matters: A Performance Leap on VoxCeleb

Zhengyang Chen, Shuai Wang, Yanmin Qian


The information from different modalities usually compensates each other. In this paper, we use the audio and visual data in VoxCeleb dataset to do person verification. We explored different information fusion strategies and loss functions for the audio-visual person verification system at the embedding level. System performance is evaluated using the public trail lists on VoxCeleb1 dataset. Our best system using audio-visual knowledge at the embedding level achieves 0.585%, 0.427% and 0.735% EER on the three official trial lists of VoxCeleb1, which are the best reported results on this dataset. Moreover, to imitate more complex test environment with one modality corrupted or missing, we construct a noisy evaluation set based on VoxCeleb1 dataset. We use a data augmentation strategy at the embedding level to help our audio-visual system to distinguish the noisy and the clean embedding. With such data augmented strategy, the proposed audio-visual person verification system is more robust on the noisy evaluation set.


 DOI: 10.21437/Interspeech.2020-2229

Cite as: Chen, Z., Wang, S., Qian, Y. (2020) Multi-Modality Matters: A Performance Leap on VoxCeleb. Proc. Interspeech 2020, 2252-2256, DOI: 10.21437/Interspeech.2020-2229.


@inproceedings{Chen2020,
  author={Zhengyang Chen and Shuai Wang and Yanmin Qian},
  title={{Multi-Modality Matters: A Performance Leap on VoxCeleb}},
  year=2020,
  booktitle={Proc. Interspeech 2020},
  pages={2252--2256},
  doi={10.21437/Interspeech.2020-2229},
  url={http://dx.doi.org/10.21437/Interspeech.2020-2229}
}