Ninth International Conference on Spoken Language Processing

Pittsburgh, PA, USA
September 17-21, 2006

Maximum Entropy Modeling for Diacritization of Arabic Text

Ruhi Sarikaya (1), Ossama Emam (2), Imed Zitouni (1), Yuqing Gao (1)

(1) IBM T.J. Watson Research Center, USA; (2) IBM Egypt, Egypt

We propose a novel modeling framework for automatic diacritization of Arabic text. The framework is based on Markov modeling where each grapheme is modeled as a state emitting a diacritic (or none) from the diacritic space. This space is exactly defined using 13 diacritics and a null-diacritic and covers all the diacritics used in any Arabic text. The state emission probabilities are estimated using maximum entropy (MaxEnt) models. The diacritization process is formulated as a search problem where the most likely diacritization realization is assigned to a given sentence. We also propose a diacritization parse tree (DPT) for Arabic that allows joint representation of diacritics, graphemes, words, word contexts, morphologically analyzed units, syntactic (parse tree), semantic (parse tree), part-of-speech tags and possibly other information sources. The features used to train MaxEnt models are obtained from the DPT. In our evaluation we obtained 7.8% diacritization error rate (DER) and 17.3% word diacritization error rate (WDER) on a dialectal Arabic data using the proposed framework.

Full Paper

Bibliographic reference.  Sarikaya, Ruhi / Emam, Ossama / Zitouni, Imed / Gao, Yuqing (2006): "Maximum entropy modeling for diacritization of Arabic text", In INTERSPEECH-2006, paper 1418-Mon1BuP.11.