6th SIGdial Workshop on Discourse and Dialogue

Lisbon, Portugal
September 2-3, 2005

Automatic Induction of Language Model Data for A Spoken Dialogue System

Grace Chung (1), Stephanie Seneff (2), Chao Wang (2)

(1) Corporation for National Research Initiatives, Reston, VA, USA
(2)Spoken Language Systems Group, MIT Computer Science and Artificial Intelligence Laboratory, Cambridge, MA, USA

When building a new spoken dialogue application, large amounts of domain specific data are required. This paper addresses the issue of generating in-domain training data when little or no real user data are available. The twostage approach taken begins with a data induction phase whereby linguistic constructs from out-of-domain sentences are harvested and integrated with artificially constructed in-domain phrases. After some syntactic and semantic filtering, a large corpus of synthetically assembled user utterances is induced. The second stage involves sampling the synthetic corpus towards the goal of obtaining data that would be representative of the statistics of applicationspecific real user interactions. The sampling methods proposed employ an example-based generation framework, a simulated user model and information extracted from development data. Evaluation is conducted on recognition performance in a restaurant information domain. We show that word error rate can be reduced when limited amounts of real user training data are augmented with synthetic data derived by our methods.

Full Paper

Bibliographic reference.  Chung, Grace / Seneff, Stephanie / Wang, Chao (2005): "Automatic induction of language model data for a spoken dialogue system", In SIGdial6-2005, 55-64.