4th International Conference on Spoken Language Processing

Philadelphia, PA, USA
October 3-6, 1996

Combination of Word-based and Category-based Language Models

T. R. Niesler, P. C. Woodland

Cambridge University Engineering Department, Cambridge, UK

A language model combining word-based and category-based ngrams within a backoff framework is presented. Word n-grams conveniently capture sequential relations between particular words, while the category-model, which is based on part-of-speech classifications and allows ambiguous category membership, is able to generalise to unseen word sequences and therefore appropriate in backoff situations. Experiments on the LOB, Switchboard and WSJ0 corpora demonstrate that the technique greatly improves language model perplexities for sparse training sets, and offers significantly improved complexity versus performance tradeoffs when compared with standard trigram models.

Full Paper

Bibliographic reference.  Niesler, T. R. / Woodland, P. C. (1996): "Combination of word-based and category-based language models", In ICSLP-1996, 220-223.