Detecting and Counting Overlapping Speakers in Distant Speech Scenarios

Samuele Cornell, Maurizio Omologo, Stefano Squartini, Emmanuel Vincent

We consider the problem of detecting the activity and counting overlapping speakers in distant-microphone recordings. We treat supervised Voice Activity Detection (VAD), Overlapped Speech Detection (OSD), joint VAD+OSD, and speaker counting as instances of a general Overlapped Speech Detection and Counting (OSDC) task, and we design a Temporal Convolutional Network (TCN) based method to address it. We show that TCNs significantly outperform state-of-the-art methods on two real-world distant speech datasets. In particular our best architecture obtains, for OSD, 29.1% and 25.5% absolute improvement in Average Precision over previous techniques on, respectively, the AMI and CHiME-6 datasets. Furthermore, we find that generalization for joint VAD+OSD improves by using a speaker counting objective rather than a VAD+OSD objective. We also study the effectiveness of forced alignment based labeling and data augmentation, and show that both can improve OSD performance.

 DOI: 10.21437/Interspeech.2020-2671

Cite as: Cornell, S., Omologo, M., Squartini, S., Vincent, E. (2020) Detecting and Counting Overlapping Speakers in Distant Speech Scenarios. Proc. Interspeech 2020, 3107-3111, DOI: 10.21437/Interspeech.2020-2671.

  author={Samuele Cornell and Maurizio Omologo and Stefano Squartini and Emmanuel Vincent},
  title={{Detecting and Counting Overlapping Speakers in Distant Speech Scenarios}},
  booktitle={Proc. Interspeech 2020},