13th Annual Conference of the International Speech Communication Association

Portland, OR, USA
September 9-13, 2012

EuskoParl: a Speech and Text Spanish-Basque Parallel Corpus

Alicia Pérez (1), José M. Alcaide (2), María-Inés Torres (2)

1Dpto. Lenguajes y Sistemas Informáticos: (2) Dpto. Electricidad y Electrónica;
Universidad del País Vasco UPV/EHU, Bilbao, Spain

The advances in corpus-based approaches and machine learning techniques have promoted the development of minority languages. The aim of this work is to acquire a parallel corpus in Spanish and Basque with both text and speech data. In order to be able to compare the obtained results with those developed for other languages, we took Europarl as a reference. Thus, the data was acquired within the Basque Parliament reports and speeches. The acquisition process shows subtle differences to that of Europarl acquisition. The resulting corpus is described and a few preliminary experiments on machine translation with Moses reported.

Index Terms: speech resources, statistical machine translation, under-resourced languages

Full Paper

Bibliographic reference.  Pérez, Alicia / Alcaide, José M. / Torres, María-Inés (2012): "Euskoparl: a speech and text Spanish-basque parallel corpus", In INTERSPEECH-2012, 2362-2365.