Ninth International Conference on Spoken Language Processing

Pittsburgh, PA, USA
September 17-21, 2006

Low-Resource Autodiacritization of Abjads for Speech Keyword Search

Patrick Schone

U.S. Department of Defense, USA

Keyword search in speech requires retrieval systems to know the pronunciation of keywords. Many languages of the world are either largely alphabetic or have pronouncing dictionaries so that deducing pronunciations at run-time is manageable. There are many under-resourced languages, though, with writing systems where only some of the vowels are represented in the orthography (i.e., "abjads"). The absence of vowels makes direct mapping of abjads to pronunciation non-trivial. We describe an automatic system for inferring pronunciations from abjadic languages which seamlessly integrates into an existing contextsensitive pronunciation generator that serves a language-universal keyword search system. We also identify Web resources and system performance for each of five abjadic languages: Arabic, Farsi, Hebrew, Pashto, and Urdu. We show that almost effortlessly, the system can learn new rules which increase pronunciation accuracies by as much as 31.2% relative.

Full Paper

Bibliographic reference.  Schone, Patrick (2006): "Low-resource autodiacritization of abjads for speech keyword search", In INTERSPEECH-2006, paper 1412-Mon3FoP.13.