This paper compares turn-taking in terms of timing and prediction in human-human conversations under the conditions when participants has eye-contact versus when there is no eye-contact, as found in the HCRC Map Task corpus. By measuring between speaker intervals it was found that a larger proportion of speaker shifts occurred in overlap for the no eye-contact condition. For prediction we used prosodic and spectral features parametrized by time-varying length-invariant discrete cosine coefficients. With Gaussian Mixture Modeling and variations of classifier fusion schemes, we explored the task of predicting whether there is an upcoming speaker change (SC) or not (hold), at the end of an utterance (EOU) with a pause lag of 200 ms. The label SC was further split into LRs (listener responses, e.g. back-channels) and other turn-shifts. The prediction was found to be somewhat easier for the eye-contact condition, for which the average recall rates were 60.57%, 66.35% and 62.00% for turn-shifts, LR and SC respectively.
Bibliographic reference. Neiberg, Daniel / Gustafson, Joakim (2011): "Predicting speaker changes and listener responses with and without eye-contact", In INTERSPEECH-2011, 1565-1568.