Abstract

This paper presents a system to normalize Spanish tweets, which uses preprocessing rules, a domain-appropriate edit-distance model, and language models to select correction candidates based on context. The system is an improvement on the tool we submitted to the Tweet-Norm 2013 shared task, and results on the task’s test-corpus are above-average. Additionally, we provide a study of the impact for tweet normalization of the different components of the system: rule-based, edit-distance based and statistical.

BIB_text

@Article {
author = {Pablo Ruiz, Montse Cuadros, Thierry Etchegoyhen},
title = {Lexical normalization of Spanish tweets with rule-based components and language models},
journal = {Procesamiento del Lenguaje Natural},
pages = {45-52},
volume = {52},
keywds = {

Spanish microtext, lexical normalization, Twitter, edit distance, language model, microtexto, español, castellano, normalización léxica, Twitter, distancia de edición, modelo de lenguaje

}
abstract = {

}
isi = {1},
date = {2014-03-03},
year = {2014},
}