Towards Similar User Utterance Augmentation for Out-of-Domain Detection

Autores: Andoni Azpeitia Manex Serras Saenz Laura García Sardiña Mikel Fernández Arantza del Pozo Echezarreta

Fecha: 01.01.2021


Abstract

Data scarcity is a common issue in the development of Dialogue Systems from scratch, where it is difficult to find dialogue data. This scenario is more likely to happen when the system’s language differs from English. This paper proposes a first text augmentation approach that selects samples similar to annotated user utterances from existing corpora, even if they differ in style, domain or content, in order to improve the detection of Out-of-Domain (OOD) user inputs. Three different sampling methods based on word-vectors extracted from BERT language representation model are compared. The evaluation is carried out using a Spanish chatbot corpus for OOD utterances detection, which has been artificially reduced to simulate various scenarios with different amounts of data. The presented approach is shown to be capable of enhancing the detection of OOD user utterances, achieving greater improvements when less annotated data is available.

BIB_text

@Article {
title = {Towards Similar User Utterance Augmentation for Out-of-Domain Detection},
pages = {289-302},
keywds = {
Dialogue, BERT, Data Augmentation, OOD detection
}
abstract = {

Data scarcity is a common issue in the development of Dialogue Systems from scratch, where it is difficult to find dialogue data. This scenario is more likely to happen when the system’s language differs from English. This paper proposes a first text augmentation approach that selects samples similar to annotated user utterances from existing corpora, even if they differ in style, domain or content, in order to improve the detection of Out-of-Domain (OOD) user inputs. Three different sampling methods based on word-vectors extracted from BERT language representation model are compared. The evaluation is carried out using a Spanish chatbot corpus for OOD utterances detection, which has been artificially reduced to simulate various scenarios with different amounts of data. The presented approach is shown to be capable of enhancing the detection of OOD user utterances, achieving greater improvements when less annotated data is available.


}
isbn = {978-981-15-8394-0},
date = {2021-01-01},
}
Vicomtech

Parque Científico y Tecnológico de Gipuzkoa,
Paseo Mikeletegi 57,
20009 Donostia / San Sebastián (España)

+(34) 943 309 230

close overlay