Auditory-Visual Speech Processing (AVSP) 2009
University of East Anglia, Norwich, UK
This paper proposes a 2D audiovisual text-to-speech synthesis system that constructs the output signal by selecting and concatenating multimodal segments containing natural combinations of audio and video. We describe the experiments that were conducted in order to assess the impact of this joint audio/video synthesis technique on the perceived quality of the synthetic speech. The experiments indicate that a maximal level of audiovisual coherence present in the output speech improves the perceived quality when compared to the traditional approach of synthesizing the visual signal separately from the audio. In addition, we measured that there is a same maximum allowable desynchronization between the audio and the image sequence, irrespective whether the degree of desynchronization is constant or time varying. This tolerance is used in the synthesizer for further optimizing the segment cuttings points in the audio and in the video mode.
Index Terms: audiovisual speech synthesis, multimodal unit selection, audiovisual synchrony
Bibliographic reference. Mattheyses, Wesley / Latacz, Lukas / Verhelst, Werner (2009): "Multimodal coherency issues in designing and optimizing audiovisual speech synthesis techniques", In AVSP-2009, 47-52.