Making the most of comparable corpora in Neural Machine Translation: a case study

Data: 06.02.2022

Language Resources and Evaluation


Abstract

Comparable corpora can benefit the development of Neural Machine Translation models, in particular for under-resourced languages. We present a case study centred on the exploitation of a large comparable corpus for Basque-Spanish, created from independently-produced news by the Basque public broadcaster EITB, where we evaluate the impact of different techniques to exploit the original data, in order to complement parallel datasets for this language pair in both translation directions. Two efficient methods for parallel sentence mining are explored, which identified a common core of approximately half of the total number of aligned sentences, each one uniquely identifying valid parallel sentences not captured by the other method. Filtering the data via identification of length-difference outliers
proved highly effective to improve the models, as was the use of tags to discriminate between comparable and parallel data in the training corpora. The use of backtranslated data is also evaluated in this work, with results indicating that alignmentbased datasets remain the most beneficial, although complementary backtranslations should also be included to fully exploit the available comparable data. Overall, the results in this work demonstrate that this type of data needs to be carefully analysed prior to its use as training data for Neural Machine Translation, since issues such as information imbalance between source and target data can lead to unoptimal results
for a given translation pair.

BIB_text

@Article {
title = {Making the most of comparable corpora in Neural Machine Translation: a case study},
journal = {Language Resources and Evaluation},
keywds = {
Comparable corpora, Basque, Spanish, Neural Machine Translation
}
abstract = {

Comparable corpora can benefit the development of Neural Machine Translation models, in particular for under-resourced languages. We present a case study centred on the exploitation of a large comparable corpus for Basque-Spanish, created from independently-produced news by the Basque public broadcaster EITB, where we evaluate the impact of different techniques to exploit the original data, in order to complement parallel datasets for this language pair in both translation directions. Two efficient methods for parallel sentence mining are explored, which identified a common core of approximately half of the total number of aligned sentences, each one uniquely identifying valid parallel sentences not captured by the other method. Filtering the data via identification of length-difference outliers
proved highly effective to improve the models, as was the use of tags to discriminate between comparable and parallel data in the training corpora. The use of backtranslated data is also evaluated in this work, with results indicating that alignmentbased datasets remain the most beneficial, although complementary backtranslations should also be included to fully exploit the available comparable data. Overall, the results in this work demonstrate that this type of data needs to be carefully analysed prior to its use as training data for Neural Machine Translation, since issues such as information imbalance between source and target data can lead to unoptimal results
for a given translation pair.


}
doi = {10.1007/s10579-021-09572-2},
date = {2022-02-06},
}
Vicomtech

Gipuzkoako Zientzia eta Teknologia Parkea,
Mikeletegi Pasealekua 57,
20009 Donostia / San Sebasti√°n (Espainia)

+(34) 943 309 230

Ensanche eraikina,
Zabalgune Plaza 11,
48009 Bilbo (Espainia)

close overlay