International Workshop on Spoken Language Translation (IWSLT) 2006

Keihanna Science City, Kyoto, Japan
November 27-28, 2006

Using Monolingual Source-Language Data to Improve MT Performance

Nicola Ueffing

Interactive Language Technologies Group, National Research Council Canada, Gatineau, Québec, Canada

Statistical machine translation systems are usually trained on large amounts of bilingual text and of monolingual text in the target language. In this paper, we will present a self-training approach which additionally explores the use of monolingual source text, namely the documents to be translated, to improve the system performance. An initial version of the translation system is used to translate the source text. Among the generated translations, target sentences of low quality are automatically identified and discarded. The reliable translations together with their sources are then used as a new bilingual corpus for training an additional phrase translation model. Thus, the translation system can be adapted to the new source data even if no bilingual data in this domain is available. Experimental evaluation was performed on a standard Chinese-to-English translation task. We focus on settings where the domain and/or the style of the test data is different from that of the training material. We will show a significant improvement in translation quality through the use of the adaptive phrase translation model. BLEU score rises up to 1.1 points, and mWER is reduced by up to 3.1% absolute.

Full Paper     Presentation

Bibliographic reference.  Ueffing, Nicola (2006): "Using monolingual source-language data to improve MT performance", In IWSLT-2006, 174-181.