Knowledge Transfer for Active Learning in Textual Anonymisation

Abstract

Data privacy compliance has gained a lot of attention over the last years. The automation of the de-identification process is a challenging task that often requires annotating in-domain data from scratch, as there is usually a lack of annotated resources for such scenarios. In this work, knowledge from a classifier learnt from a source annotated dataset is transferred to speed up the process of training a binary personal data identification classifier in a pool-based Active Learning context, for a new initially unlabelled target dataset which differs in language and domain. To this end, knowledge from the source classifier is used for seed selection and uncertainty based query selection strategies. Through the experimentation phase, multiple entropy-based criteria and input diversity measures are combined. Results show a significant improvement of the anonymisation label from the first batch, speeding up the classifier’s learning curve in the target domain and reaching top performance with less than 10% of the total training data, thus demonstrating the usefulness of the proposed approach even when the anonymisation domains diverge significantly.

BIB_text

@Article {
title = {Knowledge Transfer for Active Learning in Textual Anonymisation},
pages = {155-166},
keywds = {
Knowledge Transfer, Active Learning, Seed Selection, Query Selection Strategy, Textual Anonymisation
}
abstract = {

Data privacy compliance has gained a lot of attention over the last years. The automation of the de-identification process is a challenging task that often requires annotating in-domain data from scratch, as there is usually a lack of annotated resources for such scenarios. In this work, knowledge from a classifier learnt from a source annotated dataset is transferred to speed up the process of training a binary personal data identification classifier in a pool-based Active Learning context, for a new initially unlabelled target dataset which differs in language and domain. To this end, knowledge from the source classifier is used for seed selection and uncertainty based query selection strategies. Through the experimentation phase, multiple entropy-based criteria and input diversity measures are combined. Results show a significant improvement of the anonymisation label from the first batch, speeding up the classifier’s learning curve in the target domain and reaching top performance with less than 10% of the total training data, thus demonstrating the usefulness of the proposed approach even when the anonymisation domains diverge significantly.


}
isbn = {978-3-030-00810-9},
doi = {10.1007/978-3-030-00810-9_14},
date = {0000-00-00},
}
Vicomtech

Parque Científico y Tecnológico de Gipuzkoa,
Paseo Mikeletegi 57,
20009 Donostia / San Sebastián (Spain)

+(34) 943 309 230

close overlay