First Workshop on Speech, Language and Audio in Multimedia (SLAM 2013)

Marseille, France
August 22-23, 2013

Towards a Better Integration of Written Names for Unsupervised Speakers Identification in Videos

Johann Poignant (1), Hervé Bredin (2), Laurent Besacier (1), Georges Quénot (1), Claude Barras (2)

(1) UJF-Grenoble 1 / UPMF-Grenoble 2 / Grenoble INP / CNRS, LIG UMR 5217, Grenoble, France
(2) Univ Paris-Sud, LIMSI-CNRS, Spoken Language Processing Group, Orsay, France

Existing methods for unsupervised identification of speakers in TV broadcast usually rely on the output of a speaker diarization module and try to name each cluster using names provided by another source of information: we call it “late naming”. Hence, written names extracted from title blocks tend to lead to high precision identification, although they cannot correct errors made during the clustering step.
   In this paper, we extend our previous “late naming” approach in two ways: “integrated naming” and “early naming”. While “late naming” relies on a speaker diarization module optimized for speaker diarization, “integrated naming” jointly optimize speaker diarization and name propagation in terms of identification errors. “Early naming” modifies the speaker diarization module by adding constraints preventing two clusters with different written names to be merged together.
   While “integrated naming” yields similar identification performance as “late naming” (with better precision), “early naming” improves over this baseline both in terms of identification error rate and stability of the clustering stopping criterion.

Index Terms: speaker identification, speaker diarization, written names, multimodal fusion, TV broadcast.

