Evaluating Audiovisual Source Separation in the Context of Video Conferencing

Berkay İnan, Milos Cernak, Helmut Grabner, Helena Peic Tukuljac, Rodrigo C.G. Pena, Benjamin Ricaud

Source separation involving mono-channel audio is a challenging problem, in particular for speech separation where source contributions overlap both in time and frequency. This task is of high interest for applications such as video conferencing. Recent progress in machine learning has shown that the combination of visual cues, coming from the video, can increase the source separation performance. Starting from a recently designed deep neural network, we assess its ability and robustness to separate the visible speakers’ speech from other interfering speeches or signals. We test it for different configuration of video recordings where the speaker’s face may not be fully visible. We also asses the performance of the network with respect to different sets of visual features from the speakers’ faces.

 DOI: 10.21437/Interspeech.2019-2671

Cite as: İnan, B., Cernak, M., Grabner, H., Tukuljac, H.P., Pena, R.C., Ricaud, B. (2019) Evaluating Audiovisual Source Separation in the Context of Video Conferencing. Proc. Interspeech 2019, 4579-4583, DOI: 10.21437/Interspeech.2019-2671.

  author={Berkay İnan and Milos Cernak and Helmut Grabner and Helena Peic Tukuljac and Rodrigo C.G. Pena and Benjamin Ricaud},
  title={{Evaluating Audiovisual Source Separation in the Context of Video Conferencing}},
  booktitle={Proc. Interspeech 2019},