The Effect of Real-Time Constraints on Automatic Speech Animation

Danny Websdale, Sarah Taylor, Ben Milner

Machine learning has previously been applied successfully to speech-driven facial animation. To account for carry-over and anticipatory coarticulation a common approach is to predict the facial pose using a symmetric window of acoustic speech that includes both past and future context. Using future context limits this approach for animating the faces of characters in real-time and networked applications, such as online gaming. An acceptable latency for conversational speech is 200ms and typically network transmission times will consume a significant part of this. Consequently, we consider asymmetric windows by investigating the extent to which decreasing the future context effects the quality of predicted animation using both deep neural networks (DNNs) and bi-directional LSTM recurrent neural networks (BiLSTMs). Specifically we investigate future contexts from 170ms (fully-symmetric) to 0ms (fully-asymmetric). We find that a BiLSTM trained using 70ms of future context is able to predict facial motion of equivalent quality as a DNN trained with 170ms, while introducing increased processing time of only 5ms. Subjective tests using the BiLSTM show that reducing the future context from 170ms to 50ms does not significantly decrease perceived realism. Below 50ms, the perceived realism begins to deteriorate, generating a trade-off between realism and latency.

 DOI: 10.21437/Interspeech.2018-2066

Cite as: Websdale, D., Taylor, S., Milner, B. (2018) The Effect of Real-Time Constraints on Automatic Speech Animation. Proc. Interspeech 2018, 2479-2483, DOI: 10.21437/Interspeech.2018-2066.

  author={Danny Websdale and Sarah Taylor and Ben Milner},
  title={The Effect of Real-Time Constraints on Automatic Speech Animation},
  booktitle={Proc. Interspeech 2018},