13th Annual Conference of the International Speech Communication Association

Portland, OR, USA
September 9-13, 2012

Unsupervised Speaker Identification using Overlaid Texts in TV Broadcast

Johann Poignant (1), Hervé Bredin (2), Viet Bac Le (3), Laurent Besacier (1), Claude Barras (2), Georges Quénot (1)

(1) UJF-Grenoble 1 / UPMF-Grenoble 2 / Grenoble INP/CNRS, LIG UMR 5217, Grenoble, France
(2) Univ Paris-Sud, LIMSI-CNRS, Spoken Language Processing Group, Orsay, France
(3) Vocapia Research, Parc Orsay Université, Orsay, France

We propose an approach for unsupervised speaker identification in TV broadcast videos, by combining acoustic speaker diarization with person names obtained via video OCR from overlaid texts. Three methods for the propagation of the overlaid names to the speech turns are compared, taking into account the co-occurence duration between the speaker clusters and the names provided by the video OCR and using a task-adapted variant of the TF-IDF information retrieval coefficient. These methods were tested on the REPERE dry-run evaluation corpus, containing 3 hours of annotated videos. Our best unsupervised system reaches a F-measure of 69.1% when considering all the speakers, and 80.9% if anchor speakers are left out. By comparison, a mono-modal, supervised speaker identification system with 535 speaker models trained on matching development data and additional TV and radio data only provided a 57.5% F-measure when considering all the speakers and 45.9% without anchor.

Index Terms: unsupervised speaker identification, multimodal fusion, speaker diarization, optical character recognition, reproducible results

Full Paper

Bibliographic reference.  Poignant, Johann / Bredin, Hervé / Le, Viet Bac / Besacier, Laurent / Barras, Claude / Quénot, Georges (2012): "Unsupervised speaker identification using overlaid texts in TV broadcast", In INTERSPEECH-2012, 2650-2653.