5th International Conference on Spoken Language Processing

Sydney, Australia
November 30 - December 4, 1998

A Thesaurus-Based Statistical Language Model for Broadcast News Transcription

Akio Ando, Akio Kobayashi, Toru Imai

NHK Sci. & Tech. Res. Labs., Japan

This paper describes a thesaurus-based class n-gram model for broadcast news transcription. The most important issue concerned with class n-gram models is how to develop a word classification. We construct a word classification mapping based on a thesaurus so as to maximize the average mutual information function on a training corpus. To examine the effectiveness of the new method, we compare it with two our previous methods, in which the same thesaurus is used but word-class mappings are determined in the different manners. The new method achieved substantially lower perplexity for 83 news transcription sentences broadcast on June 4, 1996.

Full Paper

Bibliographic reference.  Ando, Akio / Kobayashi, Akio / Imai, Toru (1998): "A thesaurus-based statistical language model for broadcast news transcription", In ICSLP-1998, paper 0016.