Auditory-Visual Speech Processing (AVSP) 2011
In this paper, we present a method to take into account visual information during the selection process in an acoustic-visual synthesizer. The acoustic-visual speech synthesizer is based on the selection and concatenation of synchronous bimodal diphone units i.e., speech signal and 3D facial movements of the speakers face. The visual speech information is acquired using a stereovision technique. Unit selection for synthesis is based on the classical target cost consisting of linguistic and phonological features. We compare several methods to take into account the visual articulatory context in the target cost. We present an objective evaluation of the synthesis results based on correlation of the actual visual speech trajectory and synthesized visual speech trajectory.
Index Terms. speech synthesis, unit selection, target costs.
Bibliographic reference. Musti, Utpala / Colotte, Vincent / Toutios, Asterios / Ouni, Slim (2011): "Introducing visual target cost within an acoustic-visual unit-selection speech synthesizer", In AVSP-2011, 49-55.