International Conference on Auditory-Visual Speech Processing 2008
Tangalooma Wild Dolphin Resort,
Moreton Island, Queensland, Australia
This paper describes issues relating to the subjective evaluation of synthesised visual speech. Two approaches to synthesis are compared: a text-driven synthesiser and a speech-driven synthesiser. Both synthesisers are trained using the same data and both use the same model for rendering the synthesised visual speech. Naturalness is used as a performance metric, and the naturalness of real visual speech re-rendered on the same model is used as a benchmark. The naturalness of the textdriven synthesiser is significantly better than the speech-driven synthesiser, but neither synthesiser can yet achieve the naturalness of real visual speech. The impact of likely sources of error apparent in the synthesised visual speech is investigated. Similar forms of error are introduced into real visual speech sequences and the degradation in naturalness is measured using the same naturalness ratings used to evaluate the performance of the synthesisers. We find that the overall perception of sentencelevel utterances is severely degraded when only a small region of an otherwise perfect rendering of the visual sequence is incorrect. For example, if the visual gesture for only a single syllable in an utterance is incorrect, the overall naturalness of this real sequence is rated lower than the text-based synthesiser.
Bibliographic reference. Theobald, Barry-John / Wilkinson, Nicholas / Matthews, Iain (2008): "On evaluating synthesised visual speech", In AVSP-2008, 7-12.