Crossmodal Sound Retrieval Based on Specific Target Co-Occurrence Denoted with Weak Labels

Masahiro Yasuda, Yasunori Ohishi, Yuma Koizumi, Noboru Harada


Recent advancements in representation learning enable cross-modal retrieval by modeling an audio-visual co-occurrence in a single aspect, such as physical and linguistic. Unfortunately, in real-world media data, since co-occurrences in various aspects are complexly mixed, it is difficult to distinguish a specific target co-occurrence from many other non-target co-occurrences, resulting in failure in crossmodal retrieval. To overcome this problem, we propose a triplet-loss-based representation learning method that incorporates an awareness mechanism. We adopt weakly-supervised event detection, which provides a constraint in representation learning so that our method can “be aware” of a specific target audio-visual co-occurrence and discriminate it from other non-target co-occurrences. We evaluated the performance of our method by applying it to a sound effect retrieval task using recorded TV broadcast data. In the task, a sound effect appropriate for a given video input should be retrieved. We then conducted objective and subjective evaluations, the results indicating that the proposed method produces significantly better associations of sound and visual effects than baselines with no awareness mechanism.


 DOI: 10.21437/Interspeech.2020-2445

Cite as: Yasuda, M., Ohishi, Y., Koizumi, Y., Harada, N. (2020) Crossmodal Sound Retrieval Based on Specific Target Co-Occurrence Denoted with Weak Labels. Proc. Interspeech 2020, 1446-1450, DOI: 10.21437/Interspeech.2020-2445.


@inproceedings{Yasuda2020,
  author={Masahiro Yasuda and Yasunori Ohishi and Yuma Koizumi and Noboru Harada},
  title={{Crossmodal Sound Retrieval Based on Specific Target Co-Occurrence Denoted with Weak Labels}},
  year=2020,
  booktitle={Proc. Interspeech 2020},
  pages={1446--1450},
  doi={10.21437/Interspeech.2020-2445},
  url={http://dx.doi.org/10.21437/Interspeech.2020-2445}
}