International Workshop on Spoken Language Translation (IWSLT) 2010

Paris, France
December 2-3, 2010

Improved Vietnamese-French Parallel Corpus Mining Using English Language

Do Thi Ngoc Diep (1,2), Laurent Besacier (1), Eric Castelli (2)

(1) LIG Laboratory, CNRS/UMR-5217, Grenoble, France
(2) MICA Center, CNRS/UMI-2954, Hanoi, Vietnam

This paper improves our unsupervised method for extracting parallel sentence pairs from a comparable corpus presented in [1]. In this former paper, a translation system was used to mine a comparable corpus and to detect French-Vietnamese parallel sentence pairs. An iterative process was implemented to increase the number of extracted parallel sentence pairs which improved the overall quality of the translation.
This paper validates the unsupervised approach on a new under-resourced language pair (Vietnamese- English) and it also addresses the problem of using triangulation through a third language to improve the parallel data mining process. An extension of the unsupervised method is proposed to make use of triangulation. Two ways to include the additional data from triangulation are carried out. The experiments conducted on Vietnamese - French show that using triangulation through English can improve the quality of the extracted data and slightly improve the quality of the translation system measured with BLEU.


  1. Do, T.N.D, L. Besacier, E. Castelli, “A Fully Unsupervised Approach for Mining Parallel Data from Comparable Corpora”, European Association for Machine Translation (EAMT 2010), Saint-Raphael (France), June 2010

Full Paper

Bibliographic reference.  Diep, Do Thi Ngoc / Besacier, Laurent / Castelli, Eric (2010): "Improved Vietnamese-French parallel corpus mining using English language", In IWSLT-2010, 235-242.