Ninth International Conference on Spoken Language Processing

Pittsburgh, PA, USA
September 17-21, 2006

Language Model Adaptation with a Word List and a Raw Corpus

Shinsuke Mori

IBM Japan Ltd., Japan

In this paper, we discuss language model adaptation methods given a word list and a raw corpus. In this situation, the general method is to segment the raw corpus automatically using a word list, correct the output sentences by hand, and build a model from the segmented corpus. In this sentence-by-sentence error correction method, however, the annotator encounters grammatically complicated positions and this results in a decrease of productivity. In this paper, we propose to concentrate on correcting the positions in which the words in the list appear by taking a word as a correction unit. This method allows us to avoid these problems and go directly to capturing the statistical behavior of specific words in the application. In the experiments, we used a variety of methods for preparing a segmented corpus and compared the language models by their speech recognition accuracies. The results showed the advantages of our method.

Full Paper

Bibliographic reference.  Mori, Shinsuke (2006): "Language model adaptation with a word list and a raw corpus", In INTERSPEECH-2006, paper 1146-Wed2CaP.3.