4th International Conference on Spoken Language Processing
Philadelphia, PA, USA
This paper deals with the integration of visual data in automatic speech recognition systems. We first describe the framework of our research; the development of advanced multi-user multi-modal interfaces. Then we present audiovisual speech recognition problems in general, and the ones we are interested in, in particular. After a very brief discussion of existing systems, the major part of the paper describes the systems we developed according to two different approaches to the problem of integration of visual data in speech recognition systems. Section 3 presents the architecture of our audio-only reference and baseline systems. Our audio-visual systems are described in Section 2. We first describe a system we developed according to the first approach (called the direct integration model) and show its limitations. Our approach, which we call asynchronous integration, is then presented in Sectio 4.2. After the general guidelines, we go into some details about the distributed architecture and the variant of the N-best algorithm we developed for the implementation of this approach. In Section 6 the performances of these different systems are compared, then we conclude by a brief discussion of the performance improvements we obtain and future work.
Bibliographic reference. Alissali, Mamoun / Deleglise, Paul / Rogozan, Alexandrina (1996): "Asynchronous integration of visual information in an automatic speech recognition system", In ICSLP-1996, 34-37.