A system to automatically tag arbitrary text with the part-of-speech of each word is described. The system is based on a probabilistic model where we assume that words in a given sequence are the output symbols of a Hidden Markov Model, the states of which are represented by pairs of parts-of-speech. Using a 17 tag set the rate of correctly tagged words ranged from 96. 2% to 97. 2% on various texts. The system proved to be quite effective even using a small set of initial statistics. As to words never occurred in training data, we employed a statistical technique based on word-endings frequencies. This technique resulted in a 22% decrease in tagging error rate using a 260,000-word reference vocabulary and in a 49% decrease making use of a 20,000-word vocabulary.
Bibliographic reference. Maltese, Giulio / Mancini, Federico (1991): "A technique to automatically assign parts-of-speech to words taking into account word-ending information through a probabilistic model", In EUROSPEECH-1991, 753-756.