2nd International Workshop on Speech, Language and Audio in Multimedia (SLAM2014)
Detecting accurately when a person whose face is visible in an audio-visual medium is the audible speaker is an enabling technology with a number of useful applications. These include fused audio/visual speaker recognition, AV (audio/visual) segmentation and diarization as well as AV synchronization. The likelihood-ratio test formulation and feature signal processing employed here allow the use of high-dimensional feature sets in the audio and visual domain, and the approach appears to have good detection performance for AV segments as short as a few seconds. Computation costs for the resulting algorithm are modest, typically much less than the front-end facedetection system. While the resulting system requires model training, only true condition training (i.e. video where the talking speaker is audible) is required.
Bibliographic reference. Quillen, Carl / Greenfield, Kara / Campbell, William (2014): "Talking head detection by likelihood-ratio test", In SLAM-2014, 9-13.