ESCA Workshop on Audio-Visual Speech Processing (AVSP'97)
September 26-27, 1997
In this paper we report on a systematic comparison of different neural architectures for the fusion of acoustic and optic information in speechrecognition. Experiments were performed with MLPs for a noiseless and a noisy acoustic channel. Two different kinds of input representations are investigated, resulting from a low level preprocessing and a linear discriminant analysis. We have done crossvalidation experiments. Our results suggest that given the same complexity of the architecture early and late integration models perform equal at least for the noiseless case. Pronounced differences in performance arise if one compares the different input representations. The linear discriminant analysis leads to highly distinguishable features and therefore a better recognition performance. This is especially true in case of a joinet preprocessing of the acoustic and optic signal.
Bibliographic reference. Krone, G. / Talk, B. / Wichert, A. / Palm, G. (1997): "Neural architectures for sensorfusion in speechrecognition", In AVSP-1997, 57-60.