13th Annual Conference of the International Speech Communication Association

Portland, OR, USA
September 9-13, 2012

Speech Pattern Discovery using Audio-Visual Fusion and Canonical Correlation Analysis

Lei Xie (1), Yinqing Xu (1), Lilei Zheng (1), Qiang Huang (2), Bingfeng Li (1)

(1) School of Computer Science, Northwestern Polytechnical University, Xi'an, China
(2) School of Computing Sciences, University of East Anglia, Norwich, UK

In this paper, we propose a speech pattern discovery approach using audio visual information fusion. We first align the audio and visual feature sequences using canonical correlation analysis (CCA) to account for the temporal asynchrony between audio and visual speech modalities. We then search for potential patterns, called paths, using unbounded dynamic time warping (UDTW) on the inter-utterance audio and visual similarity matrices, individually. Audio paths and visual paths are finally integrated and the reliable ones are reserved as the discovered speech patterns. Experiments on an audio-visual corpus has shown for the first time that the performance of speech pattern discovery can be improved by the use of visual information when the speaker's facial information is avaliable. Specifically, the proposed path fusion approach shows superior performance as compared to feature concatenation and similarity weighting. CCA-based audio-visual synchronization plays an important role in the performance improvement.

Index Terms: Speech pattern discovery, canonical correlation analysis, audio-visual speech processing, dynamic time warping

Full Paper

Bibliographic reference.  Xie, Lei / Xu, Yinqing / Zheng, Lilei / Huang, Qiang / Li, Bingfeng (2012): "Speech pattern discovery using audio-visual fusion and canonical correlation analysis", In INTERSPEECH-2012, 2374-2377.