Pair Expansion for Learning Multilingual Semantic Embeddings Using Disjoint Visually-Grounded Speech Audio Datasets

Yasunori Ohishi, Akisato Kimura, Takahito Kawanishi, Kunio Kashino, David Harwath, James Glass


We propose a data expansion method for learning a multilingual semantic embedding model using disjoint datasets containing images and their multilingual audio captions. Here, disjoint means that there are no shared images among the multiple language datasets, in contrast to existing works on multilingual semantic embedding based on visually-grounded speech audio, where it has been assumed that each image is associated with spoken captions of multiple languages. Although learning on disjoint datasets is more challenging, we consider it crucial in practical situations. Our main idea is to refer to another paired data when evaluating a loss value regarding an anchor image. We call this scheme “pair expansion”. The motivation behind this idea is to utilize even disjoint pairs by finding similarities, or commonalities, that may exist in different images. Specifically, we examine two approaches for calculating similarities: one using image embedding vectors and the other using object recognition results. Our experiments show that expanded pairs improve crossmodal and cross-lingual retrieval accuracy compared with non-expanded cases. They also show that similarities measured by the image embedding vectors yield better accuracy than those based on object recognition results.


 DOI: 10.21437/Interspeech.2020-3078

Cite as: Ohishi, Y., Kimura, A., Kawanishi, T., Kashino, K., Harwath, D., Glass, J. (2020) Pair Expansion for Learning Multilingual Semantic Embeddings Using Disjoint Visually-Grounded Speech Audio Datasets. Proc. Interspeech 2020, 1486-1490, DOI: 10.21437/Interspeech.2020-3078.


@inproceedings{Ohishi2020,
  author={Yasunori Ohishi and Akisato Kimura and Takahito Kawanishi and Kunio Kashino and David Harwath and James Glass},
  title={{Pair Expansion for Learning Multilingual Semantic Embeddings Using Disjoint Visually-Grounded Speech Audio Datasets}},
  year=2020,
  booktitle={Proc. Interspeech 2020},
  pages={1486--1490},
  doi={10.21437/Interspeech.2020-3078},
  url={http://dx.doi.org/10.21437/Interspeech.2020-3078}
}