Auditory-Visual Speech Processing (AVSP) 2010
Hakone, Kanagawa, Japan
This paper presents an initial bimodal acoustic-visual synthesis system able to generate concurrently the speech signal and a 3D animation of the speakers face. This is done by concatenating bimodal diphone units that consist of both acoustic and visual information. The latter is acquired using a stereovision technique. The proposed method addresses the problems of asynchrony and incoherence inherent in classic approaches to audiovisual synthesis. Unit selection is based on classic target and join costs from acoustic-only synthesis, which are augmented with a visual join cost. Preliminary results indicate the benefits of this approach, since both the synthesized speech signal and the face animation are of good quality.
Index Terms: audiovisual speech synthesis, talking head, bimodal unit concatenation, diphones
Bibliographic reference. Toutios, Asterios / Musti, Utpala / Ouni, Slim / Colotte, Vincent / Wrobel-Dautcourt, Brigitte / Berger, Marie-Odile (2010): "Towards a true acoustic-visual speech synthesis", In AVSP-2010, paper P8.