Listen, Watch and Understand at the Cocktail Party: Audio-Visual-Contextual Speech Separation

Chenda Li, Yanmin Qian


Solving the cocktail party problem with the multi-modal approach has become popular in recent years. Humans can focus on the speech that they are interested in for the multi-talker mixed speech, by hearing the mixed speech, watching the speaker, and understanding the context what the speaker is talking about. In this paper, we try to solve the speaker-independent speech separation problem with all three audio-visual-contextual modalities at the first time, and those are hearing speech, watching speaker and understanding contextual language. Compared to the previous methods applying pure audio modal or audio-visual modals, a specific model is further designed to extract contextual language information for all target speakers directly from the speech mixture. Then these extracted contextual knowledge are further incorporated into the multi-modal based speech separation architecture with an appropriate attention mechanism. The experiments show that a significant performance improvement can be observed with the newly proposed audio-visual-contextual speech separation.


 DOI: 10.21437/Interspeech.2020-2028

Cite as: Li, C., Qian, Y. (2020) Listen, Watch and Understand at the Cocktail Party: Audio-Visual-Contextual Speech Separation. Proc. Interspeech 2020, 1426-1430, DOI: 10.21437/Interspeech.2020-2028.


@inproceedings{Li2020,
  author={Chenda Li and Yanmin Qian},
  title={{Listen, Watch and Understand at the Cocktail Party: Audio-Visual-Contextual Speech Separation}},
  year=2020,
  booktitle={Proc. Interspeech 2020},
  pages={1426--1430},
  doi={10.21437/Interspeech.2020-2028},
  url={http://dx.doi.org/10.21437/Interspeech.2020-2028}
}