Auditory-Visual Speech Processing (AVSP) 2011

Volterra, Italy
September 1-2, 2011

Bilingual Corpus for AVASR using Multiple Sensors and Depth Information

Georgios Galatas (1,2), Gerasimos Potamianos (2), Dimitrios Kosmopoulos (1,2), Chris McMurrough (1), Fillia Makedon (1)

(1) Heracleia Lab, Dept. of CSE, University of Texas at Arlington, USA
(2) Institute of Informatics and Telecommunications, NCSR “Demokritos”, Athens, Greece

In this paper we present the Bilingual Audio-Visual Corpus with Depth information (BAVCD). The database contains utterances of connected digits, spoken by 15 subjects in English and 6 subjects in Greek, and collected employing multiple audio-visual sensors. Among them, of particular interest is the use of the Microsoft Kinect device, which is able to capture facial depth images using the structured light technique in addition to the traditional RGB video. The database allows conducting research on multiple aspects of small-vocabulary audio-visual automatic speech recognition, such as the use of visual depth information for speechreading, fusion of multiple video and audio streams, and language dependencies of the task. Preliminary results on the corpus are also presented.

Index Terms. Audiovisual speech recognition, corpora, multisensory fusion, depth information, languages.

Full Paper

Bibliographic reference.  Galatas, Georgios / Potamianos, Gerasimos / Kosmopoulos, Dimitrios / McMurrough, Chris / Makedon, Fillia (2011): "Bilingual corpus for AVASR using multiple sensors and depth information", In AVSP-2011, 103-106.