Audio-Visual Multi-Speaker Tracking Based on the GLMB Framework

Shoufeng Lin, Xinyuan Qian

Multi-speaker tracking using both audio and video modalities is a key task in human-robot interaction and video conferencing. The complementary nature of audio and video signals improves the tracking robustness against noise and outliers compared to the uni-modal approaches. However, the online tracking of multiple speakers via audio-video fusion, especially without the target number prior, is still an open challenge. In this paper, we propose a Generalized Labelled Multi-Bernoulli (GLMB)-based framework that jointly estimates the number of targets and their respective states online. Experimental results using the AV16.3 dataset demonstrate the effectiveness of the proposed method.

 DOI: 10.21437/Interspeech.2020-1969

Cite as: Lin, S., Qian, X. (2020) Audio-Visual Multi-Speaker Tracking Based on the GLMB Framework. Proc. Interspeech 2020, 3082-3086, DOI: 10.21437/Interspeech.2020-1969.

  author={Shoufeng Lin and Xinyuan Qian},
  title={{Audio-Visual Multi-Speaker Tracking Based on the GLMB Framework}},
  booktitle={Proc. Interspeech 2020},