Audiovisual Correspondence Learning in Humans and Machines

Venkat Krishnamohan, Akshara Soman, Anshul Gupta, Sriram Ganapathy


Audiovisual correspondence learning is the task of acquiring the association between images and its corresponding audio. In this paper, we propose a novel experimental paradigm in which unfamiliar pseudo images and pseudowords in audio form are introduced to both humans and machine systems. The task is to learn the association between the pairs of image and audio which is later evaluated with a retrieval task. The machine system used in the study is pretrained with the ImageNet corpus along with the corresponding audio labels. This model is transfer learned for the new image-audio pairs. Using the proposed paradigm, we perform a direct comparison of one-shot, two-shot and three-shot learning performance for humans and machine systems. The human behavioral experiment confirms that the majority of the correspondence learning happens in the first exposure of the audio-visual pair. This paper proposes a machine model which performs on par with the humans in audiovisual correspondence learning. But compared to the machine model, humans exhibited better generalization ability for new input samples with a single exposure.


 DOI: 10.21437/Interspeech.2020-2674

Cite as: Krishnamohan, V., Soman, A., Gupta, A., Ganapathy, S. (2020) Audiovisual Correspondence Learning in Humans and Machines. Proc. Interspeech 2020, 4462-4466, DOI: 10.21437/Interspeech.2020-2674.


@inproceedings{Krishnamohan2020,
  author={Venkat Krishnamohan and Akshara Soman and Anshul Gupta and Sriram Ganapathy},
  title={{Audiovisual Correspondence Learning in Humans and Machines}},
  year=2020,
  booktitle={Proc. Interspeech 2020},
  pages={4462--4466},
  doi={10.21437/Interspeech.2020-2674},
  url={http://dx.doi.org/10.21437/Interspeech.2020-2674}
}