Joint Learning of Facial Expression and Head Pose from Speech

David Greenwood, Iain Matthews, Stephen Laycock

Natural movement plays a significant role in realistic speech animation and numerous studies have demonstrated the contribution visual cues make to the degree human observers find an animation acceptable. Natural, expressive, emotive and prosodic speech exhibits motion patterns that are difficult to predict with considerable variation in visual modalities. Recently, there have been some impressive demonstrations of face animation derived in some way from the speech signal. Each of these methods have taken unique approaches, but none have included rigid head pose in their predicted output. We observe a high degree of correspondence with facial activity and rigid head pose during speech and exploit this observation to jointly learn full face animation and head pose rotation and translation combined. From our own corpus, we train Deep Bi-Directional LSTMs (BLSTM) capable of learning long-term structure in language to model the relationship that speech has with the complex activity of the face. We define a model architecture to encourage learning of rigid head motion via the latent space of the speaker's facial activity. The result is a model that can predict lip sync and other facial motion along with rigid head motion directly from audible speech.

 DOI: 10.21437/Interspeech.2018-2587

Cite as: Greenwood, D., Matthews, I., Laycock, S. (2018) Joint Learning of Facial Expression and Head Pose from Speech. Proc. Interspeech 2018, 2484-2488, DOI: 10.21437/Interspeech.2018-2587.

  author={David Greenwood and Iain Matthews and Stephen Laycock},
  title={Joint Learning of Facial Expression and Head Pose from Speech},
  booktitle={Proc. Interspeech 2018},