Auditory-Visual Speech Processing
In this paper, we propose a corpus-based lip-sync algorithm for natural face animation. An audio-visual (AV) corpus was constructed from the video-recorded announcer's facial shot, speaking the given texts selected from newspapers. To obtain lip parameters, we attached 19 markers on the speaker's face, and we extracted the marker positions by the color filtering followed by the center-of-gravity methods. Also, the spoken utterances were labeled with HTK and such prosodic information as duration, pitch and intensity was extracted as parameters. By combining the audio information with the lip parameters, we constructed audio-visual corpus.
Based on this AV corpus, we propose a concatenating method of AV units, which is similar to corpus-based Text-to-speech. For an AV unit search, we used a CVC-syllable unit as a basic synthetic unit. There are two procedures to get lip parameters for given texts and speech. First, top-N candidates for necessary CVC units are selected by two proposed distance measures. The one measure is a phonetic environment distance and the other is a prosodic distance. Second, the best path is estimated from the top-N AV unit sequence and Viterbi search algorithm is used for it. From the computer simulation results, we found that the information not only about duration but also about pitch and intensity is useful to enhance the lip-sync performance. The reconstructed lip parameters are almost equal to the original parameters.
Bibliographic reference. Kim, Jinyoung / Choi, Seungho / Lee, Joohun (2001): "Development of a lip-sync algorithm based on an audio-visual corpus", In AVSP-2001, 110-114.