Audio-Visual Multi-Speaker Tracking Based on the GLMB Framework

Shoufeng Lin, Xinyuan Qian


Multi-speaker tracking using both audio and video modalities is a key task in human-robot interaction and video conferencing. The complementary nature of audio and video signals improves the tracking robustness against noise and outliers compared to the uni-modal approaches. However, the online tracking of multiple speakers via audio-video fusion, especially without the target number prior, is still an open challenge. In this paper, we propose a Generalized Labelled Multi-Bernoulli (GLMB)-based framework that jointly estimates the number of targets and their respective states online. Experimental results using the AV16.3 dataset demonstrate the effectiveness of the proposed method.


 DOI: 10.21437/Interspeech.2020-1969

Cite as: Lin, S., Qian, X. (2020) Audio-Visual Multi-Speaker Tracking Based on the GLMB Framework. Proc. Interspeech 2020, 3082-3086, DOI: 10.21437/Interspeech.2020-1969.


@inproceedings{Lin2020,
  author={Shoufeng Lin and Xinyuan Qian},
  title={{Audio-Visual Multi-Speaker Tracking Based on the GLMB Framework}},
  year=2020,
  booktitle={Proc. Interspeech 2020},
  pages={3082--3086},
  doi={10.21437/Interspeech.2020-1969},
  url={http://dx.doi.org/10.21437/Interspeech.2020-1969}
}