CN116170168A

CN116170168A - DGA domain name detection method and system based on depth support vector data description

Info

Publication number: CN116170168A
Application number: CN202210253611.1A
Authority: CN
Inventors: 袁方方; 田腾; 刘燕兵; 曹聪; 张春燕; 谭建龙; 郭莉
Original assignee: Institute of Information Engineering of CAS
Current assignee: Institute of Information Engineering of CAS
Priority date: 2022-03-15
Filing date: 2022-03-15
Publication date: 2023-05-26

Abstract

The invention discloses a DGA domain name detection method and system based on deep support vector data description, relates to the field of network security, and aims to solve the problems of single method and low detection rate of the existing work detection unknown DGA family.

Description

DGA domain name detection method and system based on depth support vector data description

Technical Field

The invention relates to the field of network security, in particular to a DGA domain name detection method and system based on deep support vector data description.

Background

Botnet consists of a bot host infected with malware, remotely controlled by a bot controller (Botmaster). Bot controllers perform malicious activities by issuing instructions to infected bot hosts through command and control (Command and Control, C & C) servers. To avoid detection, zombie controllers can use fast-flux technology to quickly generate large numbers of domain names. The domain name generation algorithm (Domain Generation Algorithm, DGA) is an important means to implement fast-flux technology, which enables automated generation of a large number of domain names. The domain name generated by the domain name generation algorithm is referred to as a DGA domain name. DGA domain names are currently used in a variety of attack scenarios, and the same domain name list can be generated as long as the command and control server uses the same DGA and seed (same DGA family) as the infected machine, thereby establishing a connection. The same domain name generation algorithm can be considered as the same DGA family. A large number of DGA Domain names can be generated in a malicious act using DGA technology generally, but only a small number of DGA Domain names can be resolved to a C & C server, so most DGA Domain names are Non-resolvable Domain names (nxdata). Analysis of non-resolvable domain names in traffic is the main way to detect DGA activity and discover new DGA families.

DGA domain name detection methods can be classified into machine learning-based detection methods and deep learning-based detection methods. The DGA domain name character string generated by the domain name generation algorithm has strong randomness, so that the detection method based on machine learning can achieve good effects on the premise of good feature design. But the machine learning based detection method is less effective for some unknown DGA domain names. The detection method based on deep learning can automatically learn potential semantic features in the domain name character strings without manually extracting the features, and achieves good effect in detecting DGA family domain names.

(one) a detection method based on machine learning:

schuppen et al (see Schuppen S, teubert D, et al: feature-based automated NXDomain classification and intergene.2018.) identify DGA domain names by monitoring non-resolvable domain names (NXDomain) in DNS traffic, they extract structural, linguistic and statistical features of the domain names from the domain name string, and construct a classifier using a support vector machine (Support Vector Machine, SVM) and Random Forest (RF). Finally, they evaluate on university campus networks and large-scale corporate internal networks, verifying that the method can have high classification accuracy under low false alarm rate. Antonakakis et al (see: antonakakis M, perdis R, nadji Y, et al from Thow-Away Traffic to Bots: detecting the Rise of DGA-Based Malware [ C ]// Usenix Conference on Security symposium.2012.) designed a Pleiades system, including a DGA discovery module and a DGA classification module. The DGA discovery module aims at discovering unknown DGA families, and firstly, the unknown DGA families cluster according to the similarity between domain names and users inquiring the domain names; each cluster is then classified using an alternating decision tree based classifier. The DGA classification module receives traffic of the active domain name and is used for detecting the active DGA domain name and the active C & C server. Drichel et al (see: drichel a, faerber N, meyer u. First Step Towards EXPLAINable DGA Multiclass Classification C// ARES 2021:The 16th International Conference on Availability,Reliability and Security.2021.) propose DGA real-time detection multi-class classifiers with high interpretability that summarize 136 domain name features previously available and use various feature refining algorithms for feature engineering, which still achieves good results with fewer features.

(II) a detection method based on deep learning:

anderson et al (see: anderson H S, woodbidge J, polar B.deep DGA: adversalally-Tuned Domain Generation and Detection [ J ]. ACM, 2016.) construct a deep learning-based DGA domain name detection method using a generated antagonism network, considering that training data sets of certain DGA families are very limited and DGA variants are numerous. During the learning process, the generator learns to generate domain names that are more and more difficult to detect, and the detector model updates its parameters to compensate for the disadvantageously generated domain names. Ren et al (see: ren F, jiang Z, wang X, et al A DGA domain names detection modeling method based on integrating an attention mechanism and deep neural network [ J ]. Network space security science and technology (English), 2018,1 (1): 13.) consider that existing methods are inadequate to handle word table based DGA threats, and utilize convolutional neural network (Convolutional Neural Networks, CNN) and Bi-directional LSTM, biLSTM) neural network layers to extract features of domain sequence information; secondly, the attention layer is used for distributing corresponding weights of depth information extracted from the domain name; and finally, putting the features with different weights in the domain name into an output layer to finish detection and classification tasks. Ravi et al (see: ravi V, alazab M, srinivasan S, et al, universal Density: DGA-Based Botnets and DNS Homographs Detection Through Integrated Deep Learning [ J ]. IEEE Transactions on Engineering Management,2021, PP (99): 1-18.) propose a two-stage DNS traffic analysis framework based on deep learning. The first stage of the framework adopts character level embedding, and a twin neural network (Siamese neural networks) is used for detecting the similarity between domain names; the second stage takes the different domain names as input to the previous stage and uses a cost-sensitive deep learning model to detect and classify DGA domain names.

The prior art has good detection effect on the known DGA family domain names, but most of the existing works use the DGA domain name detection technology based on supervised learning, and the unknown DGA family domain names cannot be detected when the unknown DGA family domain names appear. In the existing method for detecting the unknown DGA family domain names, the results are obtained by clustering the unknown DGA family domain names and then filtering, the method needs to change the threshold value according to different actual network scenes, and when the number of the unknown DGA family domain names is small and insufficient to form clusters, the detection rate becomes very low.

Disclosure of Invention

The invention aims to provide a DGA domain name detection method and system based on depth support vector data description, which are used for solving the problems of single method and low detection rate of the existing work detection unknown DGA family. According to the method, firstly, an unresolved domain name is obtained from real DNS traffic and is used as a domain name to be detected, then, a feature vector is extracted from the unresolved domain name, and finally, the feature vector is input into a deep support vector data description algorithm model to judge whether each unresolved domain name is a DGA domain name.

In order to achieve the above purpose, the present invention adopts the following technical scheme:

a DGA domain name detection method based on depth support vector data description comprises the following steps:

acquiring real DNS traffic, and acquiring a domain name which cannot be resolved from the real DNS traffic;

extracting the characteristics of each unresolved domain name, extracting language characteristics, structural characteristics and statistical characteristics, and forming characteristic vectors of each unresolved domain name;

labeling the unresolved domain name by using a public domain name white list Alexa to obtain a known benign domain name in the unresolved domain name, and taking other domain names except the known benign domain name as the domain name to be detected;

constructing a depth support vector data description classification model based on a convolutional neural mapping network, inputting all known benign domain names into the convolutional neural mapping network of the model as training data for training, mapping all the known benign domain names into a new feature space by the convolutional neural mapping network, adjusting network parameters through multiple rounds of training, and obtaining a trained convolutional neural mapping network and the center position and radius of the hypersphere when all the known benign domain names fall into the hypersphere;

and inputting the feature vector of each unresolved domain name to be detected into a trained convolutional neural mapping network of a depth support vector data description classification model, wherein the convolutional neural mapping network judges the distance between each unresolved domain name and the center of the hypersphere, if the distance is within the radius range of the hypersphere, the distance is a benign domain name, and otherwise, the distance is a DGA domain name.

Further, the method for acquiring the non-resolvable domain name comprises the following steps: and inquiring the resource records of each domain name in the real DNS traffic, and if no resource records exist, the domain name is an unresolved domain name.

Further, the language features include: whether or not to include numeric characters, vowel character ratios, numeric character ratios, number of letter categories, ratio of repeated characters, ratio of consecutive consonants, ratio of consecutive numbers, and longest significant consecutive substring length.

Further, the structural features include: the name length, number of sub-names, average sub-domain name length, whether there is a www prefix, whether the top-level domain name is valid, whether there is a single character sub-domain name, whether there is a top-level domain name string as a sub-domain name, a number as a sub-domain scale, hexadecimal characters as a sub-domain scale, an underlined character scale, and whether IP is included.

Further, the statistical features include 1-Gram statistics, 2-Gram statistics, 3-Gram statistics, and character entropy values.

Further, the step of training the depth support vector data description classification model includes:

initializing all network parameters of a convolutional neural mapping network;

mapping all the input known benign domain names to a new feature space by using a convolutional neural mapping network, and calculating the center position of the hypersphere according to the average center position of all the known benign domain names in the new feature space;

sequentially scrambling and grouping all known benign domain names to obtain training data of a plurality of batches;

obtaining a new mapping space representation from the training data of each batch through a convolutional neural mapping network, and calculating the distance from each sample of each batch to the center of the hypersphere according to the new mapping space representation and the hypersphere center position;

calculating the loss generated by the distance between each sample of each batch and the center of the hypersphere, and updating all network parameters of the convolutional neural mapping network through gradient back propagation of the neural network;

and updating the central position of the hypersphere and calculating the radius of the hypersphere according to the updated network parameters, and outputting the trained convolutional neural mapping network, and the central position and the radius of the hypersphere.

Further, the convolutional neural mapping network has the structure that: input layer + full connection layer + shape change layer + convolution layer + normalization layer + maximum pooling layer + shape change layer + full connection layer.

A DGA domain name detection system based on depth support vector data description includes a memory on which is stored a computer program, and a processor which, when executing the program, performs the steps of the above method.

The invention achieves the following technical effects:

1. the method of the invention does not need any known DGA domain name at all when detecting the DGA domain name, only needs to know a small number of benign domain names, has lower requirements compared with other methods needing the supervision training of the DGA domain name, and can obtain better detection effect.

2. When the unresolved domain name in the flow is filtered by the known DGA family in advance, the DGA domain name detected again by the method is the unknown DGA family domain name, and the method can be used for finding the unknown DGA family.

3. The method of the invention adopts a method of clustering benign domain names to detect unknown DGA domain names. Since benign domain names are very easily available in traffic, the method of the present invention is easier to implement than the detection method of clustering DGA domain names. And when the number of potential DGA domain names in the flow is small, the detection method for clustering the DGA domain names is greatly limited, and the method still can achieve good performance.

Drawings

Fig. 1 is a flowchart of a DGA domain name detection method based on depth support vector data description according to the present invention.

FIG. 2 is a training depth support vector data description algorithm model and model workflow diagram.

Fig. 3 is a graph of the number of different known benign domain names F1 versus performance data.

Detailed Description

In order to make the above features and advantages of the present invention more comprehensible, embodiments accompanied with figures are described in detail below.

The embodiment of the invention provides a DGA domain name detection method based on depth support vector data description, and the workflow of the method is shown in figure 1 and is specifically described as follows.

1) Acquiring real DNS traffic: setting up a network probe in the network, acquiring real DNS traffic data for a plurality of days, and storing the data into a passive traffic database.

2) Obtaining an unresolvable domain name from real DNS traffic: and querying the resource record of each domain name in the real DNS traffic in the step 1), and if no resource record exists, obtaining the domain name which can not be resolved.

3) Feature extraction: and extracting the characteristics of each unresolved domain name. The features to be extracted are shown in table 1. After the extraction is completed, for each unresolved domain name d _i All will have a feature vector v _i ＝{f _i1 ,f _i2 ,…,f _i42 }。

Table 1 extracted domain name features

4) Labeling benign domain names: the non-resolvable domain name is labeled with a public domain name whitelist Alexa (see: alexa. Https:// aws. Amazon. Com/Alexa-top-sites [ M ]. On-line Resources, 2021.) to obtain a known benign domain name in the non-resolvable domain name, and the other domain names are used as the domain name to be detected.

5) Constructing a depth support vector data description classification model: the model uses a depth support vector data description algorithm that aims to map all non-resolvable domain names to a new feature space such that the hypersphere volume of all known benign domain names within the new feature space is as small as possible. The calculation process of the depth support vector data description algorithm is described as follows:

algorithm input: the iteration round number T, the unresolved domain name state feature dimension n, the convolution neural mapping network Q used for mapping to the new feature space, the batch training domain name sample number m, the known benign domain name data set D.

Algorithm output: convolutional neural mapping network Q, hypersphere center O, hypersphere radius R.

a) All parameters w of the convolutional neural mapping network Q are randomly initialized.

b) All benign domain names are mapped to a new feature space Q (w, D) using Q, and the hypersphere center o=mean (Q (w, D)) is calculated from their average center positions, where the MEAN function is a MEAN function.

c) Initializing training round epicode=1.

d) The data set D is disordered and divided into D according to m samples of each group ₁ ,d ₂ ,…,d _K There are K batches of training data.

e) Let i=1.

f) Will d _i The new mapping space representation nd is obtained from each batch of data through a convolutional neural mapping network Q _i ＝Q(w,d _i ) And find the distance p of each sample of the batch from the center of the hypersphere _i ＝DIST(nd _i O), wherein the DIST (x, y) function is the distance between x and y.

g) Loss of calculation=mean (p _i ) And updates all parameters w of the convolutional neural mapping network Q by gradient back-propagation of the neural network.

h) Let i=i+1, go to step f) if i+.k.

i) Update the hypersphere center o=mean (Q (w, D)).

j) Let epoode = epoode +1, if epoode is less than or equal to T, return to step d).

k) The hypersphere radius r=max (DIST (Q (w, D), O)) is calculated, and the function of MAX is to maximize.

l) outputting a convolutional neural mapping network Q, a hypersphere center O and a hypersphere radius R.

During training, the convolutional neural mapping network Q of the present invention may map all points to one point with a radius of zero, which is known as hypersphere breakdown. The cause and solution for the hypersphere collapse are as follows:

a) Convolutional neural mapping network Q causes hypersphere breakdown due to zero weight solution: the sphere center of the super sphere cannot be made into a free variable irrelevant to a normal sample, and the sphere center can be set to be an average position after benign domain name mapping according to experience.

b) Hidden layer bias terms can learn that a constant function mapping results in a hypersphere collapse: the hidden layer does not use offset terms.

c) Network elements with bounded activation functions simulate offset terms in subsequent layers resulting in a hypersphere collapse: an unbounded activation function such as ReLU (or an activation function bounded only by 0) should be used preferentially.

The workflow of training depth support vector data description classification models is shown in fig. 2.

6) Judging whether the unresolved domain name is a DGA domain name: after the depth support vector data description classification model is built, the distance between the feature vector of each unresolved domain name to be detected and the center O of the hypersphere is judged only after the feature vector passes through the convolutional neural mapping network Q, if the feature vector is within the radius R of the hypersphere, the feature vector is a benign domain name, and otherwise, the feature vector is a DGA domain name. Specifically, the feature vector of the domain name to be detected in the step 4) is input into the depth support vector data description classification model in the step 5), so as to obtain whether each unresolved domain name is a DGA domain name.

The following illustrates a specific example, and the DGA domain name detection method based on depth support vector data description provided by the invention is used for solving the problems of single method and low detection rate of the existing work detection unknown DGA family.

1) Acquiring real DNS traffic: setting up a network probe in a campus network, and acquiring real DNS traffic for about 7 days.

2) Establishing a black and white list: the black-and-white list is built through a relatively authoritative approach, such as a security website or security enterprise with high authority, and an Internet public black-and-white list with high acceptance. The Alexa website traffic world ranking list TOP100,000 is specifically used to construct the whitelist here, because the higher the traffic world ranking, the higher the exposure to public, and the less likely the domain name will be to perform malicious actions. The black list is a DGA domain name black list constructed by adopting 360DGA disclosed by the Internet, and the black list comprises millions of DGA domain names of more than 50 DGA families and is always in an updated state.

3) Obtaining an unresolvable domain name from real DNS traffic: judging whether the resource record can be found for each domain name of the real DNS traffic in the 1), and if not, judging that the domain name can not be resolved. The non-resolvable domain names are then filtered using the trie tree in 2) to screen out known benign domain names and DGA domain names.

4) Feature extraction: extracting all the features in table 1 from the non-resolvable domain names obtained in the step 3), and obtaining a 42-dimensional feature vector for each non-resolvable domain name.

5) Acquiring a training set and a testing set: 10% of benign domain names are taken as training sets, 90% of benign domain names and all DGA domain names are taken as test sets, and ten-fold cross validation is selected to avoid the influence of different segmentation samples on results. The training set is used for training a model, and the testing set is used for simulating the domain name to be detected.

6) Constructing a depth support vector data description classification model: the number of iteration rounds t=10, the state characteristic dimension n=42, the number of samples of batch gradient descent m=128, and the structure used by the convolutional neural mapping network Q is shown in table 2.

Table 2 deep support vector data describes the Q structure of convolutional neural mapping network in classification model

7) Judging whether the domain names in the test set have malicious behaviors or not: f1 was used as an evaluation criterion here. For the ten fold cross validation results, the average value of F1 was taken as the final effect of the model.

The positive effects are as follows:

the positive effects produced by the method of the present invention are described by experiments. In order to embody the positive effects of the method, the detection performance of different single classification algorithms is firstly evaluated, then the performance of different DGA domain name detection methods is analyzed, and finally the detection performance of different single classification algorithms under different known benign domain name numbers is analyzed.

1) Performance of different single classifiers

The method of the invention is compared with single classification algorithms based on machine learning and deep learning, and the results of each detection algorithm are shown in table 3. From table 3, it can be seen that the performance of the depth support vector data description is optimal on three evaluation indexes of F1, recall and precision. Meanwhile, the depth support vector data description algorithm which takes the soft boundary as a learning target is set to limit the shrinkage of the super sphere to a certain extent.

TABLE 3 Performance of different Single classifiers

Algorithm	F1	Recall rate of recall	Accuracy rate of
				SVDD	0.9521	0.9690	0.9358
Isolation Forest	0.9520	0.9523	0.9517
				DCAE	0.9754	0.9520	0.9999
DCGAN	0.9819	0.9645	0.9999
				Deep SVDD(Soft-boundary)	0.9911	0.9829	0.9994
Deep SVDD(One-class)	0.9945	0.9891	0.9999

2) Performance of different DGA domain name detection methods

The method of the invention is compared with several current DGA detection methods, and the experimental results are shown in Table 4. From table 4, it can be seen that the depth support vector data description has the best performance on three evaluation indexes of F1, recall rate and accuracy rate, which are all superior to the existing mainstream method for detecting DGA domain name. And our method does not require any known DGA domain name for training, while the training process of the other methods requires the inclusion of a known DGA domain name.

TABLE 4 Performance of different DGA detection methods

Algorithm	F1	Recall rate of recall	Accuracy rate of
				ATT-CNN-BiLSTM	0.8994	0.8902	0.9087
M-LSTM.MI	0.9875	0.9837	0.9913
				SNN	0.9879	0.9768	0.9993
GAN	0.9901	0.9809	0.9996
				FANCI	0.9913	0.9830	0.9998
Deep SVDD(Soft-boundary)	0.9911	0.9829	0.9994
				Deep SVDD(One-class)	0.9945	0.9891	0.9999

3) Performance of different known benign domain name numbers

The method of the invention researches the performance of different single classification algorithms under the condition that the number of known benign domain names is reduced. The number of known benign domain names is adjusted by adjusting the proportion of benign domain names in the training set in the experiment, and the experimental result is shown in fig. 3. When the proportion of benign domain names in the training set is reduced to one percent, one thousandth and one thousandth, the performance of other algorithms can be seriously reduced, and the deep support vector data description can still achieve good performance. This illustrates that the method of the present invention still achieves better performance with a smaller number of known benign domain names.

Although the present invention has been described with reference to the above embodiments, it should be understood that the invention is not limited thereto, and that modifications and equivalents may be made thereto by those skilled in the art, which modifications and equivalents are intended to be included within the scope of the present invention as defined by the appended claims.

Claims

1. The DGA domain name detection method based on the depth support vector data description is characterized by comprising the following steps of:

2. The method of claim 1, wherein the method for obtaining the non-resolvable domain name is: and inquiring the resource records of each domain name in the real DNS traffic, and if no resource records exist, the domain name is an unresolved domain name.

3. The method of claim 1, wherein the language features include: whether or not to include numeric characters, vowel character ratios, numeric character ratios, number of letter categories, ratio of repeated characters, ratio of consecutive consonants, ratio of consecutive numbers, and longest significant consecutive substring length.

4. The method of claim 1, wherein the structural features comprise: the name length, number of sub-names, average sub-domain name length, whether there is a www prefix, whether the top-level domain name is valid, whether there is a single character sub-domain name, whether there is a top-level domain name string as a sub-domain name, a number as a sub-domain scale, hexadecimal characters as a sub-domain scale, an underlined character scale, and whether IP is included.

5. The method of claim 1, wherein the statistical features include 1-Gram statistics, 2-Gram statistics, 3-Gram statistics, and character entropy values.

6. The method of claim 1, wherein training the depth support vector data description classification model comprises:

initializing all network parameters of a convolutional neural mapping network;

7. The method of claim 1, wherein the convolutional neural mapping network is structured as: input layer + full connection layer + shape change layer + convolution layer + normalization layer + maximum pooling layer + shape change layer + full connection layer.

8. DGA domain name detection system based on depth support vector data description, comprising a memory and a processor, on which a computer program is stored, which processor, when executing the program, implements the steps of the method according to any one of claims 1-7.