INTERSPEECH 2006 - ICSLP
This paper describes our initial work in developing a real-time audio-visual Chinese speech synthesizer with a 3D expressive avatar. The avatar model is parameterized according to the MPEG-4 facial animation standard . This standard offers a compact set of facial animation parameters (FAPs) and feature points (FPs) to enable realization of 20 Chinese visemes and 7 facial expressions (i.e. 27 target facial configurations). The Xface  open source toolkit enables us to define the influence zone for each FP and the deformation function that relates them. Hence we can easily animate a large number of coordinates in the 3D model by specifying values for a small set of FAPs and their FPs. FAP values for 27 target facial configurations were estimated from available corpora. We extended the dominance blending approach to effect animations for coarticulated visemes superposed with expression changes. We selected six sentiment-carrying text messages and synthesized expressive visual speech (for all expressions, in randomized order) with neutral audio speech. A perceptual experiment involving 11 subjects shows that they can identify the facial expression that matches the text message’s sentiment 85% of the time.
Bibliographic reference. Wu, Zhiyong / Zhang, Shen / Cai, Lianhong / Meng, Helen M. (2006): "Real-time synthesis of Chinese visual speech and facial expressions using MPEG-4 FAP features in a three-dimensional avatar", In INTERSPEECH-2006, paper 1823-Wed2BuP.4.