5th International Conference on Spoken Language Processing

Sydney, Australia
November 30 - December 4, 1998

An Asymmetric Stochastic Language Model Based on Multi-Tagged Words

Julio Pastor, Josť Colas, Ruben San-Segundo, Josť Manuel Pardo

Grupo de Tecnologia del Habla - Departamento de Ingenieria Electronica - E. T. S. I. Telecomunicacion - Universidad Politecnica de Madrid, Spain

Tag definition in stochastic language models (n-grams and n-pos) is based on grouping together words with similar right and left context behavior. A modification of the n-gram model using multi-tagged words and unsupervised clustering was already introduced for French with a corpus of millions of non-tagged words. We present a variation of bi-pos language model where two tag sets are defined and assigned to each word (multi-tagged model) using grammatical information. Each tag set is based on different context behavior. We use linguistic expert knowledge and a simple automatic clustering procedure to obtain groups of words with similar left context behavior (first set of tags) and with similar right context (second set of tags). We propose a grammatical based model useful when no big text corpus is available and a performance increase has been observed when multi-tagged words are used because of its better adaptation to the language.

Full Paper

Bibliographic reference.  Pastor, Julio / Colas, Josť / San-Segundo, Ruben / Pardo, Josť Manuel (1998): "An asymmetric stochastic language model based on multi-tagged words", In ICSLP-1998, paper 1108.