Auditory-Visual Speech Processing (AVSP'98)
December 4-6, 1998
This paper describes a technique for synthesizing synchronized lip movements from auditory input speech signal. The technique is based on an algorithm for parameter generation from HMM with dynamic features, which has been successfully applied to text-to-speech synthesis. Audio-visual speech unit HMMs, namely, syllable HMMs are trained with parameter vector sequences that represent both auditory and visual speech features. Input speech is recognized using the syllable HMMs and converted into a transcription and a state sequence. A sentence HMM is constructed by concatenating the syllable HMMs corresponding to the transcription for the input speech. Then an optimum visual speech parameter sequence is generated from the sentence HMM in ML sense. Since the generated parameter sequence reflects statistical information of both static and dynamic features of several phonemes before and after the current phonemes, synthetic lip motion becomes smooth and realistic. We show experimental results which demonstrate the effectiveness of the proposed technique.
Bibliographic reference. Tamura, Masatsune / Masuko, Takashi / Kobayashi, Takao / Tokuda, Keiichi (1998): "VisuaL Speech Synthesis Based On Parameter Generation From HMM: speech-driven and text-and-speech-driven approaches", In AVSP-1998, 221-226.
|av98_221_1.mov (537 KB)||41_01.mov||Real lip movements||Video File: QuickTime|
|av98_221_2.mov (537 KB)||41_02.mov||Synthetic lip movements using speech-driven approach with dynamic features.||Video File: QuickTime|
|av98_221_3.mov (537 KB)||41_03.mov||Synthetic lip movements using text-and-speech-driven approach with dynamic features.||Video File: QuickTime|
|av98_221_4.mov (537 KB)||41_04.mov||Synthetic lip movements using speech-driven approach without dynamic features.||Video File: QuickTime|
|av98_221_5.mov (537 KB)||41_05.mov||Synthetic lip movements using text-and-speech-driven approach without dynamic features.||Video File: QuickTime|