Automatic Detection of Multi-speaker Fragments with High Time Resolution

Evdokia Kazimirova, Andrey Belyaev

Interruptions and simultaneous talking represent important patterns of speech behavior. However, there is a lack of approaches to their automatic detection in continuous audio data. We have developed a solution for automatic labeling of multi-speaker fragments using harmonic traces analysis. Since harmonic traces in multi-speaker intervals form an irregular pattern as opposed to the structured pattern typical for a single speaker, we resorted to computer vision methods to detect multi-speaker fragments. A convolutional neural network was trained on synthetic material to differentiate between single-speaker and multi-speaker fragments. For evaluation of the proposed method the SSPNet Conflict Corpus with provided manual diarization was used. We also examined factors affecting algorithm performance. The main advantages of the proposed method are calculation simplicity and high time resolution. With our approach it is possible to detect segments with minimum duration of 0.5 seconds. The proposed method demonstrates highly accurate results and may be used for speech segmentation, speaker tracking, content analysis such as conflict detection and other practical purposes.

 DOI: 10.21437/Interspeech.2018-1878

Cite as: Kazimirova, E., Belyaev, A. (2018) Automatic Detection of Multi-speaker Fragments with High Time Resolution. Proc. Interspeech 2018, 1388-1392, DOI: 10.21437/Interspeech.2018-1878.

  author={Evdokia Kazimirova and Andrey Belyaev},
  title={Automatic Detection of Multi-speaker Fragments with High Time Resolution},
  booktitle={Proc. Interspeech 2018},