ESCA Workshop on Audio-Visual Speech Processing (AVSP'97)
September 26-27, 1997
We have developed a visual speech synthesizer from unlimited French text, and synchronized it to an audio text-to-speech synthesizer also developed at the ICP (Le Goff & Benoit, 1996). The front-end of our synthesizer is a 3-D model of the face whose speech gestures are controlled by eight parameters: Five for the lips, one for the chin, two for the tongue. In contrast to most of the existing systems which are based on a limited set of prestored facial images, we have adopted the parametric approach to coarticulation first proposed by Cohen and Massaro (1993). We have thus implemented a coarticulation model based on spline-like functions, defined by three coefficients, applied to each target in a library of 16 French visemes. However, unlike Cohen & Massaro (1993), we have adopted a data-driven automatic approach to identify the many coefficients necessary to model coarticulation. To do so, we systematically analyzed an ad-hoc corpus uttered by a French male speaker. An intelligibility test has been run to quantify the benefit of seeing the synthetic face in addition to hearing the synthetic voice under several conditions of background noise (Le Goff, 1997). We here extended this evaluation to audiovisual material where the same corpus was acoustically uttered by a male speaker and synchronized to the synthetic head.
Bibliographic reference. Goff, Bertrand Le / Benoît, Christian (1997): "A French-speaking synthetic head", In AVSP-1997, 145-148.