Siamese Neural Network and Machine Learning for DGA Classification

Authors: Lander Segurola Gil

Date: 08.07.2022


Abstract

Abstract: Domain Generation Algorithms (DGA) are systems used to create immediate multiple and varying domain names. Such “artificial” domains can be then used for siting command and control servers which in turn are in charge of recruiting/infecting devices, and finally turning them into new resources to be exploited. In this sense, identifying DGA domain names can be crucial, in order to avoid cyberattacks like Phishing, Spam sending, Bitcoin mining, and many other. Usually, domain names generated by DGAs, are comprised by illegible character strings, but new “intelligent” DGAs tend to generate names using combination of words in dictionaries making its detection a challenging task. For this reason, in this work, we propose to address this problem using a combination of Machine Learning algorithms for improving the classification of DGAs domains. In particular, we propose to combine Siamese Neural Networks and traditional supervised Machine Learning algorithms in order to expand the input domain into separable n-dimensional data points and then achieve the domain classification. The proposed approach can be separated into 3 phases. In a first phase, domain names are encoded, by a one-hot encoder and a variation of this, named probabilistic one-hot encoder, which are implemented separately. Then, in the second phase, Long Short-Term Memory and Convolutional Siamese embedders are tested and compared. In particular, the first one is combined with the one-hot, while the Convolution algorithm is applied with the probabilistic one-hot encoded data. In the final step, five Machine Learning algorithms are tested using the two ways embedded data. Both embedder approaches reach very high results in terms of F1-score and Accuracy (about 91%) depending on the implemented classifier. The promising results obtained by the application of the proposed method shows that it is possible to perform DGA domain classification uniquely over the domain names, without taking into account external information such as DNS packets features

BIB_text

@Article {
author = {Lander Segurola Gil},
title = {Siamese Neural Network and Machine Learning for DGA Classification},
keywds = {
Siamese Neural Network, DGA classification, Cybersecurity
}
abstract = {

Abstract: Domain Generation Algorithms (DGA) are systems used to create immediate multiple and varying domain names. Such “artificial” domains can be then used for siting command and control servers which in turn are in charge of recruiting/infecting devices, and finally turning them into new resources to be exploited. In this sense, identifying DGA domain names can be crucial, in order to avoid cyberattacks like Phishing, Spam sending, Bitcoin mining, and many other. Usually, domain names generated by DGAs, are comprised by illegible character strings, but new “intelligent” DGAs tend to generate names using combination of words in dictionaries making its detection a challenging task. For this reason, in this work, we propose to address this problem using a combination of Machine Learning algorithms for improving the classification of DGAs domains. In particular, we propose to combine Siamese Neural Networks and traditional supervised Machine Learning algorithms in order to expand the input domain into separable n-dimensional data points and then achieve the domain classification. The proposed approach can be separated into 3 phases. In a first phase, domain names are encoded, by a one-hot encoder and a variation of this, named probabilistic one-hot encoder, which are implemented separately. Then, in the second phase, Long Short-Term Memory and Convolutional Siamese embedders are tested and compared. In particular, the first one is combined with the one-hot, while the Convolution algorithm is applied with the probabilistic one-hot encoded data. In the final step, five Machine Learning algorithms are tested using the two ways embedded data. Both embedder approaches reach very high results in terms of F1-score and Accuracy (about 91%) depending on the implemented classifier. The promising results obtained by the application of the proposed method shows that it is possible to perform DGA domain classification uniquely over the domain names, without taking into account external information such as DNS packets features


}
date = {2022-07-08},
}
Vicomtech

Parque Científico y Tecnológico de Gipuzkoa,
Paseo Mikeletegi 57,
20009 Donostia / San Sebastián (Spain)

+(34) 943 309 230

Zorrotzaurreko Erribera 2, Deusto,
48014 Bilbao (Spain)

close overlay

Behavioral advertising cookies are necessary to load this content

Accept behavioral advertising cookies