We model the language of the courts by using a number of statistical techniques, and compare the models. In the case of word-phrase bigram and word-phrase trigram models, an issue which arises is the choice of tokens to form the word phrase. We compare the model obtained by choosing the pair which has maximal mutual information, and the model obtained by assuming a binomial ditribution of words and using a likelihood ratio test to choose pairs. The latter model gives a greater reduction in perplexity. We also compare the two choice methods on a corpus which is not based on spoken material, and find similar results.
Bibliographic reference. Kenne, P. E. / O'Kane, M. / Pearcy, H. G. (1995): "Language modeling of spontaneous speech in a court context", In EUROSPEECH-1995, 1801-1804.