Ninth International Conference on Spoken Language Processing

Pittsburgh, PA, USA
September 17-21, 2006

Compact N-Gram Models by Incremental Growing and Clustering of Histories

Sami Virpioja, Mikko Kurimo

Helsinki University of Technology, Finland

This work concerns building n-gram language models that are suitable for large vocabulary speech recognition in devices that have a restricted amount of memory and space available. Our target language is Finnish, and in order to evade the problems of its rich morphology, we use sub-word units, morphs, as model units instead of the words. In the proposed model we apply incremental growing and clustering of the morph n-gram histories. By selecting the histories using maximum a posteriori estimation, and clustering them with information radius measure, we obtain a clustered varigram model. We show that for restricted model sizes this model gives better cross-entropy and speech recognition results than the conventional n-gram models, and also better recognition results than non-clustered varigram models built with another recently introduced method.

Full Paper

Bibliographic reference.  Virpioja, Sami / Kurimo, Mikko (2006): "Compact n-gram models by incremental growing and clustering of histories", In INTERSPEECH-2006, paper 1231-Tue2A2O.5.