First Workshop on Speech, Language and Audio in Multimedia (SLAM 2013)
In this paper, we create an open-domain conversational system by combining the power of internet browser interfaces with multi-modal inputs and data mined from web search and browser logs. The work focuses on two novel components: (1) dynamic contextual adaptation of speech recognition and understanding models using visual context, and (2) fusion of users speech and gesture inputs to understand their intents and associated arguments. The system was evaluated in a living room setup with live test subjects on a real-time implementation of the multimodal dialog system. Users interacted with a television browser using gestures and speech. Gestures were captured by Microsoft Kinect skeleton tracking and speech was recorded by a Kinect microphone array. Results show a 16% error rate reduction (ERR) for contextual ASR adaptation to clickable web page content, and 7-10% ERR when using gestures with speech. Analysis of the results suggest a strategy for selection of multimodal intent when users clearly and persistently indicate pointing intent (e.g., eye gaze), giving a 54.7% ERR over lexical features.
Index Terms: spoken dialog systems, spoken language understanding, multi-modal fusion, conversational search, conversational browsing.
Bibliographic reference. Heck, Larry / Hakkani-Tür, Dilek / Chinthakunta, Madhu / Tur, Gokhan / Iyer, Rukmini / Parthasarathy, Partha / Stifelman, Lisa / Shriberg, Elizabeth / Fidler, Ashley (2013): "Multi-modal conversational search and browse", In SLAM-2013, 96-101.