First Workshop on Speech, Language and Audio in Multimedia (SLAM 2013)
Existing methods for unsupervised identification of speakers in
TV broadcast usually rely on the output of a speaker diarization
module and try to name each cluster using names provided
by another source of information: we call it late naming.
Hence, written names extracted from title blocks tend to lead
to high precision identification, although they cannot correct errors
made during the clustering step.
In this paper, we extend our previous late naming approach in two ways: integrated naming and early naming. While late naming relies on a speaker diarization module optimized for speaker diarization, integrated naming jointly optimize speaker diarization and name propagation in terms of identification errors. Early naming modifies the speaker diarization module by adding constraints preventing two clusters with different written names to be merged together.
While integrated naming yields similar identification performance as late naming (with better precision), early naming improves over this baseline both in terms of identification error rate and stability of the clustering stopping criterion.
Index Terms: speaker identification, speaker diarization, written names, multimodal fusion, TV broadcast.
Bibliographic reference. Poignant, Johann / Bredin, Hervé / Besacier, Laurent / Quénot, Georges / Barras, Claude (2013): "Towards a better integration of written names for unsupervised speakers identification in videos", In SLAM-2013, 84-89.