Sound-Image Grounding Based Focusing Mechanism for Efficient Automatic Spoken Language Acquisition

Mingxin Zhang, Tomohiro Tanaka, Wenxin Hou, Shengzhou Gao, Takahiro Shinozaki


The process of spoken language acquisition based on sound-image grounding has been one of the topics that has attracted the most significant interest of linguists and human scientists for decades. To understand the process and enable new possibilities for intelligent robots, we designed a spoken-language acquisition task in which a software robot learns to fulfill its desire by correctly identifying and uttering the name of its preferred object from the given images, without relying on any labeled dataset. We propose an unsupervised vision-based focusing strategy and a pre-training approach based on sound-image grounding to boost the efficiency of reinforcement learning. These ideas are motivated by the introspection that human babies first observe the world and then try actions to realize their desires. Our experiments show that the software robot can successfully acquire spoken language from spoken indications with images and dialogues. Moreover, the learning speed of reinforcement learning is significantly improved compared to several baseline approaches.


 DOI: 10.21437/Interspeech.2020-2027

Cite as: Zhang, M., Tanaka, T., Hou, W., Gao, S., Shinozaki, T. (2020) Sound-Image Grounding Based Focusing Mechanism for Efficient Automatic Spoken Language Acquisition. Proc. Interspeech 2020, 4183-4187, DOI: 10.21437/Interspeech.2020-2027.


@inproceedings{Zhang2020,
  author={Mingxin Zhang and Tomohiro Tanaka and Wenxin Hou and Shengzhou Gao and Takahiro Shinozaki},
  title={{Sound-Image Grounding Based Focusing Mechanism for Efficient Automatic Spoken Language Acquisition}},
  year=2020,
  booktitle={Proc. Interspeech 2020},
  pages={4183--4187},
  doi={10.21437/Interspeech.2020-2027},
  url={http://dx.doi.org/10.21437/Interspeech.2020-2027}
}