Speech-Image Semantic Alignment Does Not Depend on Any Prior Classification Tasks

Masood S. Mortazavi


Semantically-aligned ( speech; image) datasets can be used to explore “visually-grounded speech”. In a majority of existing investigations, features of an image signal are extracted using neural networks “pre-trained” on other tasks (e.g., classification on ImageNet). In still others, pre-trained networks are used to extract audio features prior to semantic embedding. Without “transfer learning” through pre-trained initialization or pre-trained feature extraction, previous results have tended to show low rates of recall in speech → image and image → speech queries.

Choosing appropriate neural architectures for encoders in the speech and image branches and using large datasets, one can obtain competitive recall rates without any reliance on any pre-trained initialization or feature extraction: ( speech; image) semantic alignment and speech → image and image → speech retrieval are canonical tasks worthy of independent investigation of their own and allow one to explore other questions — e.g., the size of the audio embedder can be reduced significantly with little loss of recall rates in speech → image and image → speech queries.


 DOI: 10.21437/Interspeech.2020-3024

Cite as: Mortazavi, M.S. (2020) Speech-Image Semantic Alignment Does Not Depend on Any Prior Classification Tasks. Proc. Interspeech 2020, 3515-3519, DOI: 10.21437/Interspeech.2020-3024.


@inproceedings{Mortazavi2020,
  author={Masood S. Mortazavi},
  title={{Speech-Image Semantic Alignment Does Not Depend on Any Prior Classification Tasks}},
  year=2020,
  booktitle={Proc. Interspeech 2020},
  pages={3515--3519},
  doi={10.21437/Interspeech.2020-3024},
  url={http://dx.doi.org/10.21437/Interspeech.2020-3024}
}