Ninth International Conference on Spoken Language Processing

Pittsburgh, PA, USA
September 17-21, 2006

How to Handle Gender and Number Agreement in Statistical Language Models?

Caroline Lavecchia, Kamel Smaïli, Jean-Paul Haton

LORIA, France

The agreement in gender and number is a critical problem in statistical language modeling. One of the main difficulties in speech recognition of French language is the presence of misrecognized words due to the bad agreement (in gender and number) between words. Statistical language models do not treat this phenomena directly. This paper focuses on how to handle the issue of this agreement. We introduce an original model called Features-Cache (FC) to estimate the gender and the number of the word to predict. It is a dynamic variable-length Features-Cache. The size of the cache is automatically determined in accordance to syntagm delimitors. The main advantage of this model is that there is no need to any syntactic parsing : it is used as any other statistical language model. Several models have been carried out and the best one achieves an improvement of approximately 9 points in terms of perplexity. This model has been integrated in a speech recognition system based on JULIUS engine. Tests have been carried out on 280 sentences provided by AUPELF for the French automatic speech recognition evaluation campaign. This new model outperforms the baseline one, in terms of word error, by 3%.

Full Paper

Bibliographic reference.  Lavecchia, Caroline / Smaïli, Kamel / Haton, Jean-Paul (2006): "How to handle gender and number agreement in statistical language models?", In INTERSPEECH-2006, paper 1362-Wed2CaP.1.