Seeing Voices and Hearing Voices: Learning Discriminative Embeddings Using Cross-Modal Self-Supervision

Soo-Whan Chung, Hong-Goo Kang, Joon Son Chung


The goal of this work is to train discriminative cross-modal embeddings without access to manually annotated data. Recent advances in self-supervised learning have shown that effective representations can be learnt from natural cross-modal synchrony. We build on earlier work to train embeddings that are more discriminative for uni-modal downstream tasks. To this end, we propose a novel training strategy that not only optimises metrics across modalities, but also enforces intra-class feature separation within each of the modalities. The effectiveness of the method is demonstrated on two downstream tasks: lip reading using the features trained on audio-visual synchronisation, and speaker recognition using the features trained for cross-modal biometric matching. The proposed method outperforms state-of-the-art self-supervised baselines by a significant margin.


 DOI: 10.21437/Interspeech.2020-1113

Cite as: Chung, S., Kang, H., Chung, J.S. (2020) Seeing Voices and Hearing Voices: Learning Discriminative Embeddings Using Cross-Modal Self-Supervision. Proc. Interspeech 2020, 3486-3490, DOI: 10.21437/Interspeech.2020-1113.


@inproceedings{Chung2020,
  author={Soo-Whan Chung and Hong-Goo Kang and Joon Son Chung},
  title={{Seeing Voices and Hearing Voices: Learning Discriminative Embeddings Using Cross-Modal Self-Supervision}},
  year=2020,
  booktitle={Proc. Interspeech 2020},
  pages={3486--3490},
  doi={10.21437/Interspeech.2020-1113},
  url={http://dx.doi.org/10.21437/Interspeech.2020-1113}
}