Set-Theoretic Alignment for Comparable Corpora

Egileak: Thierry Etchegoyhen Andoni Azpeitia Zaldua

Data: 07.08.2016


Abstract

We describe and evaluate a simple method to extract parallel sentences from comparable corpora. The approach, termed STACC , is based on expanded lexical sets and the Jaccard similarity coefficient. We evaluate our system against state-of-the-art methods on a large range of datasets in different domains, for ten language pairs, showing that it either matches or outperforms current methods across the board and gives significantly better results on the noisiest datasets. STACC is a portable method, requiring no particular adaptation for new domains or language pairs, thus enabling the efficient mining of parallel sentences in comparable corpora.

BIB_text

@Article {
title = {Set-Theoretic Alignment for Comparable Corpora},
pages = {2009-2018},
volume = {1},
keywds = {

Comparable Corpora, Alignment, Statistical machine Translation


}
abstract = {

We describe and evaluate a simple method to extract parallel sentences from comparable corpora. The approach, termed STACC , is based on expanded lexical sets and the Jaccard similarity coefficient. We evaluate our system against state-of-the-art methods on a large range of datasets in different domains, for ten language pairs, showing that it either matches or outperforms current methods across the board and gives significantly better results on the noisiest datasets. STACC is a portable method, requiring no particular adaptation for new domains or language pairs, thus enabling the efficient mining of parallel sentences in comparable corpora.


}
isbn = {978-1-945626-00-5},
doi = {10.18653/v1/P16-1189},
date = {2016-08-07},
year = {2016},
}
Vicomtech

Gipuzkoako Zientzia eta Teknologia Parkea,
Mikeletegi Pasealekua 57,
20009 Donostia / San Sebastián (Espainia)

+(34) 943 309 230

Zorrotzaurreko Erribera 2, Deusto,
48014 Bilbo (Espainia)

close overlay

Jokaeraren araberako publizitateko cookieak beharrezkoak dira eduki hau kargatzeko

Onartu jokaeraren araberako publizitateko cookieak