ESCA Workshop on Audio-Visual Speech Processing (AVSP'97)
September 26-27, 1997
Synthesized lip movement images can compensate lack of auditory information for hearing impaired people, and also contribute to realize a human-like face of computer agents. We propose a novel method to synthesize lip movement based on mapping from an input speech using HMM. This paper compares the HMM method and a conventional method using VQ or ANN to convert speech-to-lip movement images. In the experiment, error and time difference error between synthesized lip movement images and original ones are utilized for evaluation. The result shows that the error of the HMM method is 8.6% smaller than that of the VQ method. Moreover, the HMM method reduces time difference error by 34.8% than the VQ's. The result also shows that the errors are mostly caused by phoneme /h/ and /Q/. Since those phonemes are dependent on succeeding phoneme, the context-dependent synthesis on the HMM method is applied to reduce the error. The context-dependent HMM method realizes that the error(difference error) is reduced by 11.3%(8.9%) compared with the original HMM method.
Bibliographic reference. Yamamoto, Eli / Nakamura, Satoshi / Shikano, Kiyohiro (1997): "Speech to lip movement synthesis by HMM", In AVSP-1997, 137-140.