Fourth European Conference on Speech Communication and Technology

Madrid, Spain
September 18-21, 1995

Algorithms for Bigram and Trigram Word Clustering

Sven Martin, Jörg Liermann, Hermann Ney

Lehrstuhl fur Informatik VI, RWTH Aachen, University of Technology, Aachen, Germany

This paper presents and analyzes improved algorithms for clustering bigram and trigram word equivalence classes, and their respective results: 1) We give a detailed time complexity analysis of bigram clustering algorithms. 2) We present an improved implementation of bigram clustering so that large corpora (38 million words and more) can be clustered within a small number of days or even hours. 3) We extend the clustering approach from bigrams to trigrams. 4) We present experimental results on a 38 million word training corpus.

Full Paper

Bibliographic reference.  Martin, Sven / Liermann, Jörg / Ney, Hermann (1995): "Algorithms for bigram and trigram word clustering", In EUROSPEECH-1995, 1253-1256.