EUROSPEECH 2001 Scandinavia
7th European Conference on Speech Communication and Technology

Aalborg, Denmark
September 3-7, 2001


Automatic N-Gram Language Model Creation from Web Resources

Ryuichi Nisimura (1), Kumiko Komatsu (2), Yuka Kuroda (3), Kentaro Nagatomo (1), Akinobu Lee (1), Hiroshi Saruwatari (1), Kiyohiro Shikano (1)

(1) Nara Institute of Science and Technology, Japan
(2) Laboratories of Image Information Science and Technology, Japan
(3) TIS Inc., Japan

This paper describes an automatic building of N-gram language models from Web texts for large vocabulary continuous speech recognition. Although a huge amount of well-formed texts are needed to train a model, collecting and organizing such text corpus for every task by hand needs a great labor. We need the language model to update frequently to cover the current topics. To deal with this problem, we propose an automatic language model creation method by collecting Web texts via keyword-based Web search engines. We can build a task-dependent language model by selecting suitable keywords for the task. A text filtering algorithm based on character perplexity is developed to extract proper Japanese texts from Web texts. A language model for a medical consulting task created by the proposed method shows the higher word recognition rate by 11.4% than that of a conventional newspaper language model.

Full Paper

Bibliographic reference.  Nisimura, Ryuichi / Komatsu, Kumiko / Kuroda, Yuka / Nagatomo, Kentaro / Lee, Akinobu / Saruwatari, Hiroshi / Shikano, Kiyohiro (2001): "Automatic n-gram language model creation from web resources", In EUROSPEECH-2001, 2127-2130.