Auditory-Visual Speech Processing (AVSP) 2010
Hakone, Kanagawa, Japan
In this paper, an audio-visual speech corpus CENSREC-1-AV for noisy speech recognition is introduced. CENSREC-1-AV consists of an audio-visual database and a baseline system of bimodal speech recognition which uses audio and visual information. In the database, there are 3,234 and 1,963 utterances made by 42 and 51 speakers as a training and a test sets respectively. Each utterance consists of a speech signal as well as color and infrared pictures around a speakers mouth. A baseline system is built so that a user can evaluate a proposed bimodal speech recognizer. In the baseline system, multi-stream HMMs are obtained using training data. A preliminary experiment was conducted to evaluate the baseline using acoustically noisy testing data. The results show that roughly a 35% relative error reduction was achieved in low SNR conditions compared with an audio-only ASR method.
Index Terms: audio-visual database, bimodal speech recognition, noise robustness, eigenface, optical flow.
Bibliographic reference. Tamura, Satoshi / Miyajima, Chiyomi / Kitaoka, Norihide / Yamada, Takeshi / Tsuge, Satoru / Takiguchi, Tetsuya / Yamamoto, Kazumasa / Nishiura, Takanobu / Nakayama, Masato / Denda, Yuki / Fujimoto, Masakiyo / Matsuda, Shigeki / Ogawa, Tetsuji / Kuroiwa, Shingo / Takeda, Kazuya / Nakamura, Satoshi (2010): "CENSREC-1-AV: an audio-visual corpus for noisy bimodal speech recognition", In AVSP-2010, paper P6.