Simulating Realistically-Spatialised Simultaneous Speech Using Video-Driven Speaker Detection and the CHiME-5 Dataset

Jack Deadman, Jon Barker


Simulated data plays a crucial role in the development and evaluation of novel distant microphone ASR techniques. However, the commonly used simulated datasets adopt uninformed and potentially unrealistic speaker location distributions. We wish to generate more realistic simulations driven by recorded human behaviour. By using devices with a paired microphone array and camera, we analyse unscripted dinner party scenarios (CHiME-5) to estimate the distribution of speaker separation in a realistic setting. We deploy face-detection, and pose-detection techniques on 114 cameras to automatically locate speakers in 20 dinner party sessions. Our analysis found that on average, the separation between speakers was only 17 degrees. We use this analysis to create datasets with realistic distributions and compare it with commonly used datasets of simulated signals. By changing the position of speakers, we show that the word error rate can increase by over 73.5% relative when using a strong speech enhancement and ASR system.


 DOI: 10.21437/Interspeech.2020-2807

Cite as: Deadman, J., Barker, J. (2020) Simulating Realistically-Spatialised Simultaneous Speech Using Video-Driven Speaker Detection and the CHiME-5 Dataset. Proc. Interspeech 2020, 349-353, DOI: 10.21437/Interspeech.2020-2807.


@inproceedings{Deadman2020,
  author={Jack Deadman and Jon Barker},
  title={{Simulating Realistically-Spatialised Simultaneous Speech Using Video-Driven Speaker Detection and the CHiME-5 Dataset}},
  year=2020,
  booktitle={Proc. Interspeech 2020},
  pages={349--353},
  doi={10.21437/Interspeech.2020-2807},
  url={http://dx.doi.org/10.21437/Interspeech.2020-2807}
}