Lexical normalization of Spanish tweets with rule-based components and language models

Egileak: Pablo Ruiz, Montse Cuadros, Thierry Etchegoyhen

Data: 03.03.2014

Procesamiento del Lenguaje Natural


PDF

Abstract

This paper presents a system to normalize Spanish tweets, which uses preprocessing rules, a domain-appropriate edit-distance model, and language models to select correction candidates based on context. The system is an improvement on the tool we submitted to the Tweet-Norm 2013 shared task, and results on the task’s test-corpus are above-average. Additionally, we provide a study of the impact for tweet normalization of the different components of the system: rule-based, edit-distance based and statistical.

BIB_text

@Article {
author = {Pablo Ruiz, Montse Cuadros, Thierry Etchegoyhen},
title = {Lexical normalization of Spanish tweets with rule-based components and language models},
journal = {Procesamiento del Lenguaje Natural},
pages = {45-52},
volume = {52},
keywds = {

Spanish microtext, lexical normalization, Twitter, edit distance, language model, microtexto, español, castellano, normalización léxica, Twitter, distancia de edición, modelo de lenguaje


}
abstract = {

This paper presents a system to normalize Spanish tweets, which uses preprocessing rules, a domain-appropriate edit-distance model, and language models to select correction candidates based on context. The system is an improvement on the tool we submitted to the Tweet-Norm 2013 shared task, and results on the task’s test-corpus are above-average. Additionally, we provide a study of the impact for tweet normalization of the different components of the system: rule-based, edit-distance based and statistical.


}
isi = {1},
date = {2014-03-03},
year = {2014},
}
Vicomtech

Gipuzkoako Zientzia eta Teknologia Parkea,
Mikeletegi Pasealekua 57,
20009 Donostia / San Sebastián (Espainia)

+(34) 943 309 230

Zorrotzaurreko Erribera 2, Deusto,
48014 Bilbo (Espainia)

close overlay

Jokaeraren araberako publizitateko cookieak beharrezkoak dira eduki hau kargatzeko

Onartu jokaeraren araberako publizitateko cookieak