In this paper we propose the use of the recently introduced twin-HMM-based audio-visual speech enhancement algorithm as a front-end for audio-visual speech recognition systems. This algorithm determines the clean speech statistics in the recognition domain based on the audio-visual observations and transforms these statistics to the synthesis domain through the so-called twin HMMs. The adopted front-end is used together with back-end methods like the conventional maximum likelihood decoding or the newly introduced significance decoding. The proposed combination of the front- and back-end is applied to acoustically corrupted signals of the Grid audio-visual corpus and results in statistically significant improvements of the audio-visual recognition accuracy compared to using the ETSI advanced front-end.
Bibliographic reference. Abdelaziz, Ahmed Hussen / Zeiler, Steffen / Kolossa, Dorothea (2013): "Using twin-HMM-based audio-visual speech enhancement as a front-end for robust audio-visual speech recognition", In INTERSPEECH-2013, 867-871.