Augmenting Turn-Taking Prediction with Wearable Eye Activity During Conversation

Hang Li, Siyuan Chen, Julien Epps

In a variety of conversation contexts, accurately predicting the time point at which a conversational participant is about to speak can help improve computer-mediated human-human communications. Although it is not difficult for a human to perceive turn-taking intent in conversations, it has been a challenging task for computers to date. In this study, we employed eye activity acquired from low-cost wearable hardware during natural conversation and studied how pupil diameter, blink and gaze direction could assist speech in voice activity and turn-taking prediction. Experiments on a new 2-hour corpus of natural conversational speech between six pairs of speakers wearing near-field eye video glasses revealed that the F1 score for predicting the voicing activity up to 1s ahead of the current instant can be above 80%, for speech and non-speech detection with fused eye and speech features. Further, extracting features synchronously from both interlocutors provides a relative reduction in error rate of 8.5% compared with a system based on just a single speaker. The performance of four turn-taking states based on the predicted voice activity also achieved F1 scores significantly higher than chance level. These findings suggest that wearable eye activity can play a role in future speech communication systems.

 DOI: 10.21437/Interspeech.2020-3204

Cite as: Li, H., Chen, S., Epps, J. (2020) Augmenting Turn-Taking Prediction with Wearable Eye Activity During Conversation. Proc. Interspeech 2020, 676-680, DOI: 10.21437/Interspeech.2020-3204.

  author={Hang Li and Siyuan Chen and Julien Epps},
  title={{Augmenting Turn-Taking Prediction with Wearable Eye Activity During Conversation}},
  booktitle={Proc. Interspeech 2020},