The Seventh ISCA Tutorial and Research Workshop on Speech Synthesis

Kyoto, Japan
September 22-24, 2010

Photo-Real Lips Synthesis with Trajectory-Guided Sample Selection

Lijuan Wang (1), Xiaojun Qian (2), Wei Han (3), Frank K. Soong (1)

(1) Microsoft Research Asia, Beijing, China
(2) Department of Systems Engineering, Chinese University of Hong Kong, China
(3) Department of Computer Science, Shanghai Jiao Tong University, China

In this paper, we propose an HMM trajectory-guided, real image sample concatenation approach to photo-real talking head synthesis. It renders a smooth and natural video of articulators in sync with given speech signals. An audio-visual database is used to train a statistical Hidden Markov Model (HMM) of lips movement first and the trained model is then used to generate a visual parameter trajectory of lips movement for given speech signals, all in the maximum likelihood sense. The HMM generated trajectory is then used as a guide to select, in the original training database, an optimal sequence of mouth images which are then stitched back to a background head video. The whole procedure is fully automatic and data driven. With an audio/video footage as short as 20 minutes from a speaker, the proposed system can synthesize a highly photo-real video in sync with the given speech signals. This system won the FIRST place in the Audio-Visual match contest in LIPS2009 Challenge, which was perceptually evaluated by recruited human subjects.

Index Terms: visual speech synthesis, photo-real, talking head, trajectory-guided

Full Paper

Bibliographic reference.  Wang, Lijuan / Qian, Xiaojun / Han, Wei / Soong, Frank K. (2010): "Photo-real lips synthesis with trajectory-guided sample selection", In SSW7-2010, 217-222.