Auditory-Visual Speech Processing (AVSP) 2013
This study concerns the perception of prominence in auditoryvisual speech perception. We constructed A/V stimuli from five-syllable sentences in which every syllable was a candidate for receiving stress. All syllables were of uniform length, and the F0 contours were manipulated using the Fujisaki model, moving a peak of F0 from the beginning to the end of the utterance. The peak was either aligned with the center of the syllable or the boundary between syllables, yielding a total of nine positions. Likewise, a video showing the upper part of a speakers face exhibiting one single raise of eyebrows was aligned with the audio, hence yielding nine positions for the visual cue, with the maximum displacement of the eyebrows coinciding with syllable centers or boundaries. Another series of stimuli was produced with head nods as the visual cue. In addition stimuli with constant F0 with or without video were created. 22 German native subjects rated the strength of each of the five syllables in a stimulus on a scale from 1-3. Results show that the acoustic prominence outweighs the visual one, and that the integration of both in a single syllable is the strongest when the movement as well as the F0 peak are aligned with the center of the syllable. However, F0 peaks aligned with the right boundary of the accented syllable, as well as visual peaks aligned with the left one also boost prominence considerably. Nods had an effect similar in magnitude as eye brow movements, however, results suggest that they rather have to be aligned with the right boundary of the syllable than the left one.
Index Terms: Prominence, auditory-visual integration, F0 modeling
Bibliographic reference. Mixdorff, Hansjörg / Hönemann, Angelika / Fagel, Sascha (2013): "Integration of acoustic and visual cues in prominence perception", In AVSP-2013, 111-116.