INTERSPEECH 2006 - ICSLP
Differentiating speakers participating in telephone conversations is a challenging task in speech processing because only short consecutive utterances can be examined for each speaker. Research has shown that, given only brief utterances (1 second or less), humans can recognize speakers with an accuracy of about 54% on average. The task becomes even more challenging when no information about the speakers is known a priori. In this paper, a technique for determining whether there are two or three speakers participating in a telephone conversation is presented. This approach assumes no knowledge or information about any of the participating speakers. The technique is based on comparing short utterances within the conversation and deciding whether or not they belong to the same speaker. The applications of this research include 3-way call detection and speaker tracking, and could be extended to speaker change-point detection and indexing. The proposed method involves an elimination process in which speech segments matching two reference models are sequentially removed from the conversation. Models are formed using the mean vectors and covariance matrices of Linear Predictive Cepstral Coefficients of voiced segments in each conversation. Hotelling’s T2-Statistic is used to determine if two models belong to the same or to different speakers based on likelihood ratio testing. The relative amount of residual speech is observed after the elimination process to determine if a third speaker is present. The proposed technique yielded an equal error rate of 20% when tested on artificially simulated conversations from the HTIMIT database and 23% error rate when tested on actual telephone conversations.
Bibliographic reference. Ofoegbu, Uchechukwu O. / Iyer, Ananth N. / Yantorno, Robert E. / Wenndt, Stanley J. (2006): "Detection of a third speaker in telephone conversations", In INTERSPEECH-2006, paper 1133-Wed3CaP.1.