Ninth International Conference on Spoken Language Processing

Pittsburgh, PA, USA
September 17-21, 2006

Unsupervised Language Model Adaptation Based on Automatic Text Collection from WWW

Motoyuki Suzuki, Yasutomo Kajiura, Akinori Ito, Shozo Makino

Tohoku University, Japan

An n-gram trained by a general corpus gives high performance. However, it is well known that a topic-specialized n-gram gives higher performance than that of the general n-gram. In order to make a topic specialized n-gram, several adaptation methods were proposed. These methods use a given corpus corresponding to the target topic, or collect documents related to the topic from a database. If there is neither the given corpus nor the topic-related documents in the database, the general n-gram cannot be adapted to the topic-specialized n-gram. In this paper, a new unsupervised adaptation method is proposed. The method collects topic-related documents from the world wide web. Several query terms are extracted from recognized text, and collected web pages given by a search engine are used for adaptation. Experimental results showed the proposed method gave 7.2 points higher word accuracy than that given by the general n-gram.

Full Paper

Bibliographic reference.  Suzuki, Motoyuki / Kajiura, Yasutomo / Ito, Akinori / Makino, Shozo (2006): "Unsupervised language model adaptation based on automatic text collection from WWW", In INTERSPEECH-2006, paper 1806Thu1A2O.1.