Head Motion Generation with Synthetic Speech: A Data Driven Approach

Najmeh Sadoughi, Carlos Busso

To have believable head movements for conversational agents (CAs), the natural coupling between speech and head movements needs to be preserved, even when the CA uses synthetic speech. To incorporate the relation between speech head movements, studies have learned these couplings from real recordings, where speech is used to derive head movements. However, relying on recorded speech for every sentence that a virtual agent utters constrains the versatility and scalability of the interface, so most practical solutions for CAs use text to speech. While we can generate head motion using rule-based models, the head movements may become repetitive, spanning only a limited range of behaviors. This paper proposes strategies to leverage speech-driven models for head motion generation for cases relying on synthetic speech. The straightforward approach is to drive the speech-based models using synthetic speech, which creates mismatch between the test and train conditions. Instead, we propose to create a parallel corpus of synthetic speech aligned with natural recordings for which we have motion capture recordings. We use this parallel corpus to either retrain or adapt the speech-based models with synthetic speech. Objective and subjective metrics show significant improvements of the proposed approaches over the case with mismatched condition.

DOI: 10.21437/Interspeech.2016-419

Cite as

Sadoughi, N., Busso, C. (2016) Head Motion Generation with Synthetic Speech: A Data Driven Approach. Proc. Interspeech 2016, 52-56.

author={Najmeh Sadoughi and Carlos Busso},
title={Head Motion Generation with Synthetic Speech: A Data Driven Approach},
booktitle={Interspeech 2016},