Third International Conference on Spoken Language Processing (ICSLP 94)

Yokohama, Japan
September 18-22, 1994

A Class Bigram Model for Very Large Corpus

Michele Jardino

LIMSI-CNRS, B.P.133, Orsay, France

As pointed out by Jelinek the n-gram word model is a very efficient model but not well adapted for highly inflected languages such as French. So we have developed a class-based bigram model determined entirely automatically from written corpora. The classes are not predefined, the words are not tagged, the solely assumption is the number of classes. We get a robust model which insures a more complete coverage of the succession probabilities (the studied training text of 2 millions of French words gives a coverage rate of the class bigram model of 50% to be compared with 0.1% for the word bigram model).

Here we present results on new classifications of the text defined above, obtained with more than one possible class for each word, as well as optimised combinations of word and class bigram models.

Full Paper

Bibliographic reference.  Jardino, Michele (1994): "A class bigram model for very large corpus", In ICSLP-1994, 867-870.