CN112019651A

CN112019651A - DGA domain name detection method using depth residual error network and character-level sliding window

Info

Publication number: CN112019651A
Application number: CN202010872894.9A
Authority: CN
Inventors: 刘小洋; 苗琛香; 刘加苗
Original assignee: Chongqing University of Technology
Current assignee: Tianyi Safety Technology Co Ltd
Priority date: 2020-08-26
Filing date: 2020-08-26
Publication date: 2020-12-01
Anticipated expiration: 2040-08-26
Also published as: CN112019651B

Abstract

The invention provides a DGA domain name detection method by utilizing a deep residual error network and a character-level sliding window, which comprises the following steps: s1, acquiring domain name data, and preprocessing the acquired domain name data; s2, original feature extraction processing is carried out on the data processed in the step S1; s3, inputting the data processed in the step S2 into a deep residual error network layer for further feature extraction; s4, performing batch normalization on the data processed in step S3; s5, classifying the domain name. The invention can classify the domain names.

Description

DGA domain name detection method using depth residual error network and character-level sliding window

Technical Field

The invention relates to the field of DGA domain name detection, in particular to a DGA domain name detection method utilizing a deep residual error network and a character-level sliding window.

Background

Botnet (Botnet) refers to a method of infecting a large number of hosts with bot program (Botnet) viruses using one or more propagation means, thereby forming a one-to-many controllable network between a controller and the infected host. The botnet is constructed by mainly utilizing vulnerabilities of software or hardware equipment, social engineering (utilizing human weaknesses to complete target tasks) and other modes to enable a victim host to infect malicious bots without being discovered, and using one-to-many command and control (C & C) channels to control the bots to finish specified attack behaviors sent by a control host. With the increase of internet users and the lack of security awareness of users, botnets are one of the main threats of the internet. Botnets began to develop gradually with the advent of the internet, and the first large-scale botnets with malicious behavior was found by pretty park in month 6 1999, which maintained communication with infected botnet hosts through the IRC protocol. A controller in a botnet controls a very large number of botnet hosts through a specific channel to perform various network attacks, such as some common distributed denial of service attacks (DDoS), sending spam, stealing user privacy, encrypting lassos, network mining, and the like. To better control bots, attackers have designed various types of command control channels, and most botnet communications are delivered to bots by means of command control servers (C & C servers). Botnets were then examined and the study started to be scheduled. There are many methods for detecting botnet, and at present, the two methods mainly aim at network and host detection, namely, one method is through botnet host communication detection, and the other method depends on whether a host is implanted with a botnet program. Because current botnet programs are more and more hidden, and the security of the host depends largely on user security awareness, it becomes difficult to detect botnets by way of the host. And the control host of the botnet needs to communicate with each botnet host, and aiming at the characteristics of the command control server depended on by the botnet, once the domain name corresponding to the command control server depended on by the botnet is found, the botnet is disabled in a mode of blacklist, Sinkhole or directly capturing the control server. Botnets with fixed IP addresses or domain names are very easily discovered by security personnel if used, and for this reason botnets typically use domain name generation algorithms (DGAs) to randomly generate a large number of command control server domain names in order to combat security researchers. And according to the DGA algorithm, different domain names can be registered in different time periods so as to hide the actually used command controller domain name. This makes discovering botnet command control servers more difficult. The detection of DGA domain names has therefore become one of the important research topics for protection against botnets.

Disclosure of Invention

The invention aims to at least solve the technical problems in the prior art, and particularly creatively provides a DGA domain name detection method by using a depth residual error network and a character-level sliding window.

In order to achieve the above object, the present invention provides a DGA domain name detection method using a deep residual error network and a character-level sliding window, comprising the steps of:

s1, acquiring domain name data, and preprocessing the acquired domain name data;

s2, original feature extraction processing is carried out on the data processed in the step S1;

s3, inputting the data processed in the step S2 into a deep residual error network layer for further feature extraction;

s4, performing batch normalization on the data processed in step S3;

s5, classifying the domain name.

In a preferred embodiment of the present invention, the preprocessing the data to be processed in step S1 includes the following steps:

s11, making numeralization process to the domain name, using character level dictionary to map each character in the domain name into one-hot coding vector;

s12, encoding the V of the one-hot encoding vector₁The dimension vector maps to the d dimension.

In a preferred embodiment of the present invention, step S2 includes:

definition of

A character vector of the ith character in a DGA domain name sample; then use

RepresentsAn input DGA domain name;

then defining k as the length of the filter, introducing

As a convolution filter field size; for each position j in the sentence, there is a window vector w with k consecutive character vectors_jExpressed as:

w_j＝[a_j,a_j+1,...,a_j+k-1]，

the filter m is then convolved with each position window vector in a 'VALID' manner to generate a profile

Window vector w_jEach element A of the feature map of (1)_jIs generated as follows:

A_j＝f(w_j⊙m+b)，

for n filters with the same length, n feature maps can be generated for each window vector w_jThe characteristics are shown in a representation of,

W＝[A₁,A₂,...,A_n]。

in a preferred embodiment of the present invention, the calculation method for performing feature extraction by using the depth residual network layer in step S3 includes:

x_l＝x′_l-1+H(x_l-1)，

wherein, x'_l-1Denotes x_l-1A value obtained after down-sampling;

x_l-1represents the input of the l-1 layer residual block;

H(x_l-1) Representing that the two layers of convolution layers are subjected to feature extraction to obtain a result;

x_lrepresenting the input of the l-th layer of the residual block.

In a preferred embodiment of the present invention, the calculation method for batch normalization in step S4 is:

calculate the mean of one mini-batch size sample:

wherein m represents the number of input samples;

x_ian ith sample representing the input;

μ_Brepresents the sample mean;

calculate a mini-batch size sample variance:

wherein m represents the number of input samples;

x_ian ith sample representing the input;

μ_Brepresents the sample mean;

σ_Brepresents the sample variance;

for the ith sample x of the input_iNormalization:

wherein x is_iAn ith sample representing the input;

μ_Brepresents the sample mean;

σ_Brepresents the sample variance;

representing the fitting parameters;

represents a normalized value;

wherein γ represents a first training parameter;

represents a normalized value;

β represents a second training parameter;

y_iindicates the value obtained after Batch Normalization.

In a preferred embodiment of the present invention, in step S5, the method for classifying domain names includes:

wherein T represents a transpose of the matrix;

w_jrepresents the weight of the softmax function;

k represents the number of categories of the multi-classification;

p (y ═ j | x) represents the probability that the sample vector x belongs to the j-th DGA family;

comparing the probability values between P (y ═ 1| x), P (y ═ 2| x), P (y ═ 3| x), … …, and P (y ═ K | x): if the probability of P (y ═ 1| x) is the greatest, then the confidence x belongs to the category 1 DGA family; wherein P (y ═ 1| x) denotes the probability that the sample vector x belongs to the class 1 DGA family, P (y ═ 2| x) denotes the probability that the sample vector x belongs to the class 2 DGA family, P (y ═ 3| x) denotes the probability that the sample vector x belongs to the class 3 DGA family, … …, and P (y ═ K | x) denotes the probability that the sample vector x belongs to the class K DGA family;

if the probability of P (y ═ 2| x) is the greatest, then the confidence x belongs to the category 2 DGA family;

if the probability of P (y ═ 3| x) is the greatest, then the confidence x belongs to the category 3 DGA family;

……；

if the probability of P (y ═ K | x) is the greatest, the confidence x belongs to the K-th category DGA family.

In a preferred embodiment of the present invention, in step S5, the method for classifying domain names may be:

if s (x) > is 0.5, it indicates that the confidence x is the DGA domain name;

if S (x) <0.5, the confidence coefficient x is a legal domain name;

in a preferred embodiment of the present invention, the method further comprises calculating a loss value of the method, wherein the calculating method comprises:

b, training samples by one mini-batch number;

y⁽ⁱ⁾a label value representing the ith sample;

h_θ(x⁽ⁱ⁾) A value representing a model prediction;

x⁽ⁱ⁾representing the value of the ith sample.

In a preferred embodiment of the present invention, the method further comprises step S6,

s6, carrying out index measurement on the classification result in the step S5, wherein the index measurement comprises one or any combination of accuracy, precision and recall;

the calculation method of the accuracy rate comprises the following steps:

wherein TP indicates that the actual DGA is classified as DGA;

TN means that the actual legal record is classified as a legal record;

FP indicates that the actual legal record is classified as DGA;

FN indicates that the actual DGA is classified as a legitimate record;

acc represents the accuracy;

the precision ratio calculation method comprises the following steps:

TP indicates that the actual DGA is classified as DGA;

wherein FP indicates that the actual legal record is classified as DGA;

precision represents precision;

the method for calculating the recall rate comprises the following steps:

wherein TP indicates that the actual DGA is classified as DGA;

FN indicates that the actual DGA is classified as a legitimate record;

recall represents recall.

In summary, by adopting the above technical scheme, the present invention can classify domain names.

Additional aspects and advantages of the invention will be set forth in part in the description which follows and, in part, will be obvious from the description, or may be learned by practice of the invention.

Drawings

The above and/or additional aspects and advantages of the present invention will become apparent and readily appreciated from the following description of the embodiments, taken in conjunction with the accompanying drawings of which:

fig. 1 is a schematic diagram of PRNG-based DGA domain name generation according to the present invention.

FIG. 2 is a schematic diagram of the SW-DRN model of the present invention.

Fig. 3 is a schematic diagram of the residual block of the present invention.

FIG. 4 is a schematic diagram of the DGA domain of Dataset 02 of the present invention.

FIG. 5 is a schematic representation of different depths of a model of the present invention.

Fig. 6 is a schematic of the training loss of the present invention.

Detailed Description

Reference will now be made in detail to embodiments of the present invention, examples of which are illustrated in the accompanying drawings, wherein like or similar reference numerals refer to the same or similar elements or elements having the same or similar function throughout. The embodiments described below with reference to the accompanying drawings are illustrative only for the purpose of explaining the present invention, and are not to be construed as limiting the present invention.

1. Introduction to the word

1.1 background

The botnet programs for building the botnet are pre-programmed into a designed DGA algorithm, and a large number of DGA domain names are generated by using the DGA algorithm and a domain name list is periodically generated. Upon certain conditions being met (within a certain time period), botnet controllers may register certain domain names as command control servers for the botnets to access the domain names. And then, the zombie program on the infected zombie host is sequentially connected with the domain names in the table according to the domain name list, if the domain name resolution is tried successfully and the infected zombie host can receive the response of the affiliated zombie network protocol, the zombie host can successfully communicate with the command control server of the zombie network at the moment, and the command issued by the attacker can be completed. If the domain name which is carrying out botnet communication is discovered by network security researchers and is shielded by a network operator, at the moment, an attacker registers the preset next domain name in the DGA domain name list, so that the attacker can ensure that the domain name corresponding to the command control server can still be successfully analyzed, and the communication between the command control server and the botnet host is maintained, so that the detection resistance and the shielding property of the botnet are improved through the mode, and the concealment property of the botnet is improved. The technology that keeps a malicious zombie network running by constantly changing the domain name of the zombie network control server is called domain flux. Because of their good properties, domain fluxes using DGA domain names in botnets have become very popular.

The common approach to DGA domain names is to use a deterministic pseudo-random generator (PRNG) to generate a list of candidate domain names. The seed of the PRNG may be the current time, some magic number, exchange rate, etc. The random number generator may generate the string sequence as a domain name (e.g., configker, Ramnit, etc.) with a single uniformly distributed generator, for example, using a combination of shifts, exclusive ors, divisions, multiplications, and modular operations. It may also be a rule generator, which may select from some knowledge base (e.g. in a Suppobox). As shown in fig. 1, the DGA algorithm uses the current date as a seed and uses the PRNC to generate a sequence of characters for the DGA field as a domain name.

Domain flux technology is widely used by botnets, so DGA domain name detection is one of the important ways to discover botnets. Early DGA domain name detection modes are black lists, regular matching and the like, and the detection capability of complex and variable DGA domain names is limited. Later, with the rise of machine learning, the detection performance of feature engineering is gradually improved by utilizing massive domain name data and manually. The DGA domain name of botnet communication can be automatically found from massive domain name service request data by applying a machine learning technology.

In recent years, deep learning has been rapidly developed, and currently existing performance records are continuously refreshed in multiple tasks such as natural language processing, computer vision, voice processing and the like. And applied to specific industrial scenarios. The DGA domain name detection method based on deep learning is also paid attention by network security researchers. The method mainly utilizes an important advantage of deep learning, namely, the method can automatically find effective characteristics in data and classify the characteristics so as to judge whether a domain name is a DGA domain name or not, and the DGA domain name detection is completed in such a way. In addition, the DGA domain names generated by different botnet families can be classified into multiple categories according to the families. The DGA domain name is marked with a corresponding family label, and then a supervised learning mode is used for training a classification detector of the DGA domain name.

1.2 motivation for research

The DGA domain name training data are from a real network environment, and the difference between the collected DGA domain names of different zombie network families is large, so that the problem of data imbalance exists in the DGA domain names. The botnet family that results in small samples is lower in multiple performance metrics on the multi-classification problem. Meanwhile, the existing research method is found to be difficult to identify domain names with high character string randomness and similarity among botnet families. In order to solve the problems, the invention designs a deep learning model of a deep residual error neural network based on a character-level sliding window, and the deep learning model is applied to a network security scene with practical significance, namely DGA domain name detection and classification. The method adopts character-level word vectors to express the DGA domain name of a text, extracts the characteristics of the receptive fields with different sizes by using multi-size sliding windows, and then inputs a well-designed depth residual error network to carry out complex characteristic extraction. Through comparison experiments, the method further improves the DGA domain name detection and achieves the most advanced achievement at present.

1.3 major contributions

(1) The invention provides a depth residual error network model (SW-DRN) based on a character-level sliding window for detecting a DGA domain name, the model adopts character-level vector expression, a convolution kernel receptive field is enlarged by using a region convolution mode, and then a variable-length depth residual error neural network is designed elaborately to extract features. The provided SW-DRN can perform high-dimensional, complex and deep abstraction on the original characteristics and can extract the effective information hidden in the data.

(2) The validity of the proposed SW-DRN model is evaluated objectively. The invention establishes two data sets, collects DGA domain name in real network environment and establishes Dataset 01; meanwhile, synthesizing a DGA domain name by means of a domain name generation algorithm, and constructing a contrast Dataset 02.

(3) The SW-DRN model provided by the patent of the invention achieves the leading achievement at present on the two-classification and multi-classification tasks of the two data sets. Wherein the evaluation index (macro F-score) of the SW-DRN model on the multi-classification task of the two data sets is respectively 4.8 percent higher and 3.34 percent higher than that of the current DGA domain name detection model.

2. Related work

The detection of DGA domain names plays a very important role in the defense of botnet. Therefore, the detection of the DGA domain name becomes a very important research point in network security, and a plurality of research results are presented in the top-level meeting of information security. In 2010, Antonakakis et al proposed a domain name reputation system Notos, established various manual features by analyzing Passive Domain Name System (PDNS) data, ascertained associations between known malicious domain names and IP addresses by means of two-level clustering, and finally calculated the reputation score of a domain name and judged the probability that the domain name is a DGA domain name. The Notos system relies on a large amount of malicious domain name data and cannot detect a single domain name in real time. Bilge et al propose another DGA Domain name discovery System EXPOSURE. And this system uses for the first time features based on the DNS (domain name system) protocol, such as DNS resolution time features, DNS response-based features, and features of domain name resolution request strings. These manually extracted features are then used as input to a decision tree based classifier for supervised training. Yadav et al perform 1-gram and 2-gram feature statistics on the domain name string, and at the same time calculate the distributions of the DGA domain name and non-DGA domain name sets 1-gram and 2-gram, and use the difference between KL distance metric distributions, and use Jaccard distance metric to predict the difference between the result and the true value of the label through model clustering. Antonakakis et al propose finding potential DGA domain name families based on idden Markov model (HMM) clustering. However, when domain name families including a large number of domain names are classified, real-time detection of each domain name cannot be achieved, and the classification model needs artificial design features.

These DGA domain name detection methods are all based on machine learning algorithms and all require manually defined features. With the development of deep learning, a new opportunity is brought to DGA domain name detection. In 2016, Woodbridge et al applied deep learning to DGA domain name detection for the first time. And the method only uses the domain name character string as data input, and utilizes deep learning to automatically extract hidden features in the character string, so that the DGA domain name detection obtains a leap breakthrough. Achieving such performance relies on the long short-term memory (LSTM) model in the recurrent neural network, which can extract the valid timing features of the domain name string. Yu et al and Vinayakumar et al then performed DGA domain name detection comparison experiments on different depth learning frameworks, comparing various classical logical network (CNNs) with a recurrent network (RNN), and treating DGA domain name detection as a short text classification task for Native Language Processing (NLP). Lison et al use a bidirectional multi-layer recurrent neural network structure for the first time and train on large-scale DGA data, and finally reach 93% of the DGA domain name detection rate of the model. Yang et al propose a detection method for word-based DGA. First, a word-based DGA method is analyzed, and three types of characteristics, namely word characteristics, part-of-speech characteristics and word-related characteristics, are analyzed. Then 16 features were derived from the above analysis and two typical word-based DGA methods Matsnu and Suppobox were selected as test objects. Finally, a random forest classifier is used for detection. The comparative experiment results show that the method has better performance than the prior method. A novel lstm.mi algorithm is proposed by Tran et al that combines binary and multi-class classification models and takes into account the importance of class identification. Mi has been shown to significantly improve macro mean recall and accuracy compared to the original LSTM and other latest cost sensitive methods. A score of 98.49% F-score can be maintained on the non-DGA domain name class. Sivaguru et al propose to use information other than the domain name string in a comprehensive way to determine whether a domain name is a DGA domain name. Since the domain name detector is usually deployed in the network, it is necessary to process the domain name request in the network in real time. Catania et al and Highnam et al propose a novel hybrid neural network that analyzes domains and scores the likelihood that such algorithms generate domains. The method is the first parallel use of a Convolutional Neural Network (CNN) and a Long Short Term Memory (LSTM) network for DGA detection, improves the operation speed of domain name detection, and enhances the real-time property. Du Peng et al proposed a DGA domain name detection model for mixed word vectors, and compared the experiments using the mixed word vectors CNN-LSTM and CNN-MWE models. From the experimental results of these references, the method based on the deep learning architecture is generally superior to the machine learning method based on the artificial features. However, the DGA domain name detection method based on deep learning still has a great promotion space in the binary and multi-classification tasks of the DGA domain name family.

3. Proposed method

In the convolutional neural network, along with the deepening of the network depth, more complex and more abstract characteristics can be extracted from data, and meanwhile, the more obvious the data is squeezed, the performance of the model can be improved. In order to enable a DGA domain name detector to well and automatically find hidden features in a domain name character string, the invention provides a deep residual error neural network based on a character-level sliding window for classifying and identifying DGA domain names. Compared with the DGA domain name detection model provided by the prior art, the model provided by the invention has good performance, higher accuracy and F-score, and higher DGA family identification rate for a few samples. Fig. 2 is an architecture diagram of the proposed SW-DRN model for DGA domain name detection.

The model for DGA domain name detection of the deep residual error network based on the character-level sliding window, which is provided by the patent of the invention, can be divided into 7 levels, namely an original data input layer, a one-hot coding layer, a character embedding layer, a region convolution layer, a deep residual error network layer (convolution repetition module), a K-max pooling and full-link layer and a final output layer. The data flow starts from the bottom input layer in architecture fig. 2 to the specific class of output samples at the top. The proposed SW-DRN model will be briefly described below.

First is the SW-DRN model input layer, which accepts domain names from the character form, and the fixed length of the domain name is L characters. The domain name is then digitized and each character in the domain name is mapped to a one-hot encoded vector using a character-level dictionary (details in the data processing portion of the experiment). Since the model directly uses one-hot coding vectors to increase the calculation amount of the model, a character embedding layer is introduced, and the character embedding layer can be used for embedding the V of one-hot₁The dimension vector is mapped to d dimension, and d < V₁。

The invention adopts a multi-scale sliding window, and respectively selects three one-dimensional convolution kernels with different sizes, wherein the lengths of the three one-dimensional convolution kernels are respectively 1,3 and 5. And then performing concat operation on the extracted feature map. And then inputting the data into a deep residual error network layer for further feature extraction. The number of layers of the depth residual network layer can be changed according to the repetition number of the convolution repetition module in fig. 2, and the repetition number is represented by N. As N is greater, the number of signatures generated by each convolution increases, with a signature number N of 64 × N. Meanwhile, a maximum pooling layer is added at the tail part of the convolution repetition module, so that the length of the feature map is changed from the original L to half of the original length when passing through one convolution repetition module, the purpose is to increase the perception visual field of the feature map by reducing the length L under the condition that the length of a convolution kernel in the residual error network layer is not changed, help is provided for obtaining high-level and abstract global features subsequently, and the extraction of the relational features between different position characters in the DGA domain name can be realized.

After the features of the depth residual error network layer are extracted in the last step, K-max pooling sampling needs to be carried out on the obtained feature map, so that the remarkable features are extracted, overfitting of the model is relieved, and the generalization capability of the model is improved. Then input into three full-connection layers with 2048 neural units for feature combination, and finally carry out classification on input DGA samples according to classification task types (binary classification or multi-classification). The following of this section will further illustrate some of the contents of the proposed SW-DRN model.

3.1 embedding layer

The embedding layer carries out numerical characteristic further expression on the coding information of the DGA domain name, and maps the high-dimensional sparse matrix of the one-hot code into a low-dimensional dense matrix. The invention uses character level to carry out digital coding, and the embedding layer embeds a high-dimensional space with the dimension of the number of all characters into a continuous vector space with the dimension being much lower, and each character is mapped to a vector on a real number domain. After embedding characters into the DGA domain name, it can express similarity between characters, for example, the vector of the character '1' and the vector of the character '6' are closer in the vector space and farther in the vector space than the character 'b', because the distribution characteristic of '1' is more similar to '6'.

3.2 convolutional layers

Each layer of convolution in the convolutional neural network is composed of a plurality of convolution units, and parameters of each convolution unit are obtained through back propagation algorithm optimization. The samples for DGA domain names are all textual information, so one-dimensional convolution is used and slides across the text sequence and detects features at different positions. Definition of

For the ith character in a DGA domain name sampleA character vector of which a_iA character vector representing the ith character in a DGA domain name sample;

representing a real number field, d representing a dimension of a character vector; a is_iIs a real number; then use

Representing the input DGA domain name, wherein a represents the numeralization of the domain name, and L represents the length of the DGA domain name; d represents the dimension of the character vector;

the value of a is shown in a real number domain;

then defining k as the length of the filter, introducing

As a convolution filter field size. For each position j in the sentence, there is a window vector w with k consecutive character vectors_jExpressed as:

w_j＝[a_j,a_j+1,...,a_j+k-1] (1)

A_j＝f(w_j⊙m+b) (2)

wherein f () is a non-linear activation function; an element-by-element multiplication; b is a bias term;

it may be a sigmoid, hyperbalic reagent, etc. In the present patent, ReLU is chosen as the post-convolution nonlinear activation function. The present patent proposes that the model of the SW-DRN uses multiple filters to generate multiple signatures. For theN filters with the same length can generate n characteristic maps as each window vector w_jThe characteristics are shown in a representation of,

W＝[A₁,A₂,...,A_n] (3)

wherein A is_iRepresenting the profile generated by the ith filter,

each row of (a) represents a slave W_jIs represented by generating a feature map from the n filters of the window vector at position j.

The invention provides a SW-DRN model, wherein a plurality of filters with different sizes are used for a regional convolution layer, the difference is similar to an n-gram, the invention uses a multi-size sliding window for feature extraction in an embedded layer, the selection size of the sliding window is (1,3 and 5), and the extracted features of the filters with different sizes are fused. By using the multi-scale sliding window, the model can capture information with different granularities, so that the characteristic diagram is richer, and the performance of the model is improved.

The convolution kernel length in the convolution repetition module of the invention is a fixed value and is 3. The number of times N of the convolution repetition module is 4, and when the value of N is increased by 1, the number of filters N of the next convolution is increased by 2 times, so that the number of filters is respectively 64,128,256,512.

3.3 residual Block

The residual block is a basic unit forming a residual network, and the residual network is designed to prevent the model training from becoming difficult when the number of network layers is deepened, and the model has gradient explosion and gradient disappearance during the training. The introduction of the residual structure in the network model enables the problems of the model to be effectively solved. Fig. 3 shows the internal structure of the residual block in the model of SW-DRN proposed by the present patent.

The internal structure of the residual block has two parts of data flow, the first part is the normal data flow direction of the model, and the result H (x) is obtained by performing feature extraction on two convolutional layers_l-1) (ii) a The second part is model shortcut connection, which connects the input data x'_l-1And H (x)_l-1) Is subjected to addition operation, here x'_l-1Is to react with H (x)_l-1) Matching dimensions to x_l-1And (5) operation of down sampling. Under the stacking of the residual blocks in the architecture diagram of the proposed SW-DRN model, the data flow direction is as shown in equation (4):

x_l＝x′_l-1+H(x_l-1) (4)

wherein x'_l-1Denotes x_l-1A value obtained after down-sampling; h (x)_l-1) Representing that the two layers of convolution layers are subjected to feature extraction to obtain a result; x is the number of_l-1Represents the input of the l-1 layer residual block; x is the number of_lRepresenting the input to the l-th layer of the residual block (i.e., the output of the l-1 layer).

In the SW-DRN architecture diagram 2, the depth of the SW-DRN model changes due to different values of the times N of the convolution repetition module. The invention patent N-4 is designed according to the model depth in the literature. In order to explore the performance of the SW-DRN model at different depths, the SW-DRN model is correspondingly trained and tested at the depth of 9,17,29 and 49 layers respectively, the obtained comparison result is displayed in an experimental part, and the optimal number of layers of the SW-DRN model is selected.

And finally, how to increase or decrease the number of convolution layers. In a deep convolutional neural network, the depth of the model is increased by 1 each time the data stream passes through one convolutional layer. As shown in fig. 2, there are two layers of convolution in one residual block, and in fig. 2, there are two residual blocks in one convolution repetition module, when N is 4, we have 16 layers of depth residual network layer (2 × 2), and then there are 17 layers of SW-DRN model when we add the original region convolution layer. If we want to get a 9-layer SW-DRN model, we only need to reduce the number of convolutional layers in the residual block. Similarly, increasing the number of layers of the model also increases the number of convolutional layers in the residual block.

3.4 pooling layer

The invention uses two pooling modes, the maximal pooling is used in the convolution repetition module in fig. 2; and the K-max pooling is used for global information sampling at the tail end of the model.

Max pooling is used to extract salient features in the feature map, which extracts a maximum from within a filter, reduces the amount of model computation and prevents overfitting from increasing the generalization capability of the model. The difference between K-max pooling and maximum pooling is that K-max pooling extracts the first K maximum values from one filter, so that the model can capture more abundant characteristic information, and the model performance is improved.

3.5 batch normalization

Batch Normalization (Batch Normalization) is a technique used to improve the speed, performance, and stability of artificial neural networks. When inputting sample x_iWhen the data are input into the SW-DRN model, the distribution of samples gradually shifts or fluctuates along with the gradual increase of the network depth in the training process, so that the back propagation of the neural network is subjected to gradient disappearance. The technique thus normalizes the data from the input to it by re-centering and re-scaling. As shown in equation (5-8), the input to the Batch Normalization layer is a mini-Batch size, defined as B ═ x₁,x₂,x₃,...,x_m}. m represents the number of input samples.

Calculate the mean value μ of a mini-batch size sample_B：

Calculating a mini-batch size sample variance σ_B：

The mean and variance calculated by equations (5) and (6), respectively, are then used to normalize the x data.

Wherein the content of the first and second substances,

which represents a normalized value of the value of,

representing the normalized value, representing the fitting parameter, a very small number whose objective is to prevent the denominator of equation (7) from being 0.

Finally, the generalization performance of the model is improved and the problems of gradient explosion and gradient disappearance after the network is deepened are prevented by updating two trainable parameters gamma and beta in Batch Normalization_iTo obtain the value y after Batch Normalization_i. γ represents a first training parameter; β represents a second training parameter.

BN stands for the abbreviation Batch Normalization.

3.6 Classification

The SW-DRN model provided by the patent of the invention can perform the tasks of two-classification and multi-classification. When judging whether a domain name is a DGA domain name, the method is a two-classification task; when the DGA domain name needs to be classified into a specific family class, the process is a multi-classification task. The classification function ultimately used at the output will also be different due to the different types of tasks.

In the binary task, the output layer often adopts a Sigmoid function, which is a kind of activation function capable of mapping an input value between (0,1), and the threshold value of the classification of positive and negative samples of the invention is 0.5, that is, when the value after passing through the Sigmoid function > is 0.5, the positive sample, that is, the DGA domain name is represented; otherwise, if the value is less than 0.5, the negative sample is represented, i.e. the legal domain name is represented. The Sigmoid function is shown in equation (9):

x represents a sample vector or confidence.

In the multi-classification task, the output layer uses a Softmax function, which is another activation function that can map the input values into a one-dimensional vector with K values, where K represents the number of classes of the multi-classification. The values of K values in the vector are all (0,1), but the sum of the K values is always equal to 1, and finally the model selects the category output corresponding to the maximum value as the predicted value of the sample. As shown in equation (10), the confidence or probability that the sample vector x belongs to the j-th category DGA family is expressed.

T represents the transpose of the matrix;

w_jrepresents the weight of the softmax function; comparing the probability values between P (y ═ 1| x), P (y ═ 2| x), P (y ═ 3| x), … …, and P (y ═ K | x): if the probability of P (y ═ 1| x) is the greatest, then the confidence x belongs to the category 1 DGA family;

……；

3.7 loss function

The method uses a cross entropy loss function, also called a target function of a model, for calculating a loss value between a ground truth and a model predicted value, then performs back propagation, and updates all parameters theta in an SW-DRN model in a gradient manner, and finally obtains an approximate optimal solution of the model.

Equation (11) is a cross-entropy loss expression, which is suitable for the computation of two-class or multi-class loss functions.

B, training samples by one mini-batch number;

y⁽ⁱ⁾a label value representing the ith sample;

h_θ(x⁽ⁱ⁾) A value representing a model prediction;

x⁽ⁱ⁾a value representing the ith sample;

θ represents the weight trained in the definition model;

{(x⁽¹⁾,y⁽²⁾),...,(x^(B),y^(B)) Is a mini-batch data sample and its label value.

4. Experiment and result analysis

In this section, experiments of SW-DRN on DGA domain name detection will be introduced, and the framework for realizing the model adopts PyTorch (1), which is a framework for high-level deep learning programming, integrates the realization of various neural networks, and can quickly and conveniently realize the provided SW-DRN model. The pytorre framework is based on the Python programming language and simultaneously supports an accelerated training model on the GPU.

4.1 Experimental Environment

4.1.1 software and hardware Environment

TABLE 1 basic Experimental Environment

Development environment	Parameter(s)
		Processor with a memory having a plurality of memory cells	Intel(R)Core(TM)i7-4790k CPU@4.0GHz
Memory device	32GB
		Acceleration device	NVIDIA GeForce RTX 2080Ti
Operating system	Ubuntu 18.04.4LTS
		IDE	Pycharm
Developing languages	Python
		Third party library	numpy,matplotlib,pandas,sklearn,seaborn,tqdm,tensorboardX

4.1.2 Experimental hyperparameters

Hyper-parameters of the paper model: initializing a learning rate of 0.01; the learning rate parameter adjustment mode is that the original 1/2 is adjusted by every 50Epoch learning rate; a gradient parameter optimizer Adam, and wherein the parameters beta are 0.9 and 0.99. The Epoch is 200; model initial weights use kaiming normal; the dimension d of the embedding layer is 200.

4.2 data set

The invention proves the effectiveness of SW-DRN on different evaluation indexes through experiments. The data of the experiment are divided into two types, one containing the domain name DGA from the real network world, defined as Dataset 01; the other is to have an illegal domain name generated by means of the DGA domain name generation algorithm, which is defined as Dataset 02. The basic composition of these two data sets will be further described separately.

4.2.1Dataset 01

The Dataset 01 data set is composed of two parts, the first part is a legal domain name sample which comes from the website domain name with Alexa (2) visit amount of global ranking of 1,000,000; the other part uses 360Netlab DGA public data (3) as a DGA domain name sample.

By 1 month 2020, the 360Netlab DGA data contains 44 DGA domain name families, and each piece of domain name data contains a family class of a domain name, and start and end verification times. Because the data imbalance factor is considered to cause deviation to some categories in model training, the invention removes all domain name families with the sample number lower than 1,000 in the DGA family, and samples the domain name of the DGA family with the sample number greater than 20,000 and takes the value as 20,000, thus forming a positive sample in the Dataset 01 data set. On the other hand, in consideration of huge daily access, the domain names in the Alexa data are regarded as legal domain names without DGA in various DGA domain name recognition researches, and in order to match the number of positive samples and facilitate the training of the model, the Alexa domain names need to be sampled and selected, and the sampled samples are used as negative samples of 0Dataset 01. As shown in table 2: the invention carries out statistical analysis on a Dataset 01 data set, and the statistical analysis comprises the number of various samples, the length of the maximum domain name, the length of the shortest domain name and the number of different characters used by each class of the domain name, and the statistics provides support for the subsequent data processing of the invention.

TABLE 2 Dataset 01 Dataset statistics

4.2.2Dataset 02

The invention not only collects DGA domain name samples under the real network environment, but also generates the DGA domain name samples through a domain name generation algorithm and forms a data set Dataset 02 together with the domain name in Alexa as a legal domain name. The invention collects the main stream domain name generation algorithm (4) from the internet, and then generates DGA domain names of 33 different families according to different domain name generation algorithms and corresponding conditions. And Treemap of fig. 4 is used to represent the number of each family. Also to match the number of DGA samples the patent performs a sample extraction value of 600,000 for domain names in Alexa.

Further partitioning of the training set and test set is required for the obtained Dataset 01 and Dataset 02, respectively. The training set is data used for learning the model, and the testing set is used for verifying the performance of the learned model. The training set and test set partitioning were done in a ratio of 8: 2.

4.3 data processing

The model of SW-DRN proposed by the patent of the invention can only process the tensor after being digitalized, but can not directly process the data in the form of characters. However, in the field of natural language processing, word vector techniques are often employed for this problem. Firstly, a sentence needs to be participled to obtain independent words, then the words are vectorized, and the words are coded into vectors with the length of V by using one-hot coding. Where V is the size of the lexicon, and its value depends on the number of all the words appearing in the corpus. After one-hot encoding, all words in the dictionary are vector (o)₁,o₂,...,o_V) It is shown that only one element among the V elements is 1, and the remaining elements are 0. Since the one-hot encoding vector dimension is the dictionary size V, if the vector after one-hot encoding is directly input into the deep learning model, the problem that the parameter quantity of the model is huge and the data is sparse is caused. Meanwhile, one-hot coding ignores the existing semantic relation between words, which results in poor model training effect.

Considering the influence of directly inputting the one-hot encoded vector into the deep learning model, the one-hot encoded vector needs to be further processed. A word embedding layer is then introduced that will generate a word vector for each word or phrase. A word vector is a vector where each word maps to a real number domain with dimensions much lower than V, and different dimensions contain different semantic information. In fact, the word embedding layer also compresses dimensions of one-hot vectors, compresses high-dimensional sparse vectors into low-dimensional dense vectors, and reduces the number of calculated parameters. In common natural language processing tasks, Word vectors are often generated by using methods such as Word2 Vec.

In the DGA domain name detection problem, although legal domain names all have certain semantics and can be represented by Word vectors, a large number of DGA domain names are composed of seemingly irregular and disordered characters, and most of the domain names are not words existing in a natural language, so that the domain names are difficult to express by directly using a Word embedding method such as Word2 Vec. In order to solve the problem, the invention adopts a char-level mode, namely, each character in the domain name is regarded as an independent 'word', and a character vector is generated aiming at each character. A complete domain name is denoted X, formula (12) represents the composition of X, and c represents a character.

X＝(c₁,c₂,c₃,...,c_L) (12)

c_iRepresents: the ith character in the domain name;

wherein, L is defined as the length of the domain name, and the data processing in the invention fixes the domain name length, namely the maximum domain name length. And when the DGA domain name sample length is less than L, performing 0 filling from the end of the character string. And when the DGA domain name sample length exceeds L, adopting a tail truncation mode.

Through knowing the naming rule of the domain name, the domain name is also found to be insensitive to capital and small characters, so that capital characters in all domain names need to be converted into lower-case characters, the input characters are 'abcdefghijklmnnopqrstuvwxyz 0123456789-', 38 characters are counted in total, and V is used₁Representing the character set dictionary. Then using the character set dictionary to establish a dimension V₁One-hot encoding mapping of 38 dimensions, as shown in equation (13):

and finally, inputting the one-hot code into an embedding layer for dimension compression.

4.4 measurement index

The structure of the SW-DRN proposed by the text is shown in fig. 2, and the model has the tasks of two-classification and multi-classification, wherein the two-classification is used for judging whether the DGA domain name sample is used, and the multi-classification is used for further performing family classification on the DGA domain name. A problem for one multi-classification can usually be broken down into a plurality of two-classification problems. Table 3 is a two-class confusion matrix.

TABLE 3 Classification result confusion matrix

1) TP: the actual DGA is classified as DGA;

TN: classifying the actual legal record into a legal record;

FP: the actual legitimate recording is classified as a DGA, a condition also known as false alarm;

FN: the actual DGA is classified as a legitimate recording.

The following criteria will be used by the present patent to evaluate the performance of the SW-DRN model proposed by the present patent.

The accuracy is as follows:

precision ratio:

recall (detection rate):

in the multi-classification task, one multi-classification task is divided into K two-classification tasks, and then K two-classification confusion matrixes exist, so that two different modes, namely 'macro' and 'micro', are adopted when precision, recall and F-score are calculated.

The 'Macro' calculation method is that precision and recall are respectively calculated on each confusion matrix_i,recall_i) Then, the average value is calculated, and finally, macro _ P, macro _ R and macro _ F-score are calculated, wherein the calculation formula is as follows:

except for 'macro' and 'micro' calculation modes, corresponding elements of each confusion matrix are averaged to obtain the average values of TP, FP, TN and FN, which are respectively recorded as

In calculating the micro _ P, micro _ R and the micro _ F-score based on these average values, the calculation formulas are as follows:

whether macro _ F-score or micro _ F-score is a comprehensive factor formed by balancing two factors of corresponding precision and detection rate, which is an effective measure for effectively evaluating the comprehensive detection of a model. Wherein β is a weighting factor, and the values in the present patent are all 1.

4.5Comparative Model Analysis

The invention provides a deep residual error network model based on a character-level sliding window for detecting a DGA domain name, collects the DGA domain name in a real network environment to construct a data set Dataset 01 and defines the DGA domain name generated by a domain name generation algorithm as Dataset 02, so that the performance of the proposed model is better than that of the previous research method. Meanwhile, the problem of serious unbalance of the data set caused by the small number of samples of some classes in the data set is solved. The invention is characterized in that the data is undersampled, namely, the categories with small quantity are kept unchanged, and the categories with large quantity need to be subjected to data sampling.

The same comparative model was used for both two-and multi-class experiments on the Dataset 01 and Dataset 02. The invention adopts 5 comparative experimental models which are respectively LSTM, GRU, CNN-LSTM, Shallow-CNN and Attention-LSTM. The comparison of the classical recurrent neural network LSTM and the improved GRU model achieves good results in multiple tasks such as text classification, machine translation, text summarization and the like in the field of natural language processing. Relevant researchers followed LSTM with attention mechanism that focused on important features and ignored feature selection that was not important to the task. Subsequently, researchers find that the one-dimensional convolutional neural network is superior to the cyclic neural network in multiple natural language processing tasks, and the speed of the convolutional neural network in GPU accelerated training is higher than that of the time sequence cyclic neural network. The second part of the related work shows that the machine learning model used for detecting the DGA domain name two-class and multi-class is obviously not used, and the deep learning model is good in effect, because manual feature extraction needs to be carried out manually when the machine learning model is used, the operation complexity and difficulty are increased, and the requirement on safety researchers is high.

And the deep learning model can automatically extract the characteristics of the data, so that the efficiency is greatly improved. Therefore, in the patent experiment of the invention, 5 deep learning models are selected as the provided SW-DRN model for comparison experiment.

The experimental results of the invention patent are all the performances of the model on the test set. The experimental contents in this section mainly refer to the results of the proposed SW-DRN and experimental control group models classified in data sets Dataset 01 and Dataset 02 and multi-classification, and are presented in tabular form. The results of the two classifications are shown in tables 4 and 5, in which AUC evaluation is added in addition to Acc, precision, call and F-score, so as to evaluate the performance of the SW-DRN model from multiple dimensions. The multi-classification results are shown in tables 6 and 7, and due to table limitation, the evaluation of all models cannot be shown, so that three representative evaluation indexes selected by the patent of the invention are precision (abbreviated as prec), call (call is also called Detection Rate, DR and F-score when the call is detected in the domain name of DGA) respectively. Whether the two-classification model or the multi-classification model is evaluated, the corresponding numerical value is between 0 and 1, the higher the value of the evaluation indexes is, the more excellent the performance of the model is represented, but the evaluation indexes cannot be used as the standard of the evaluation model alone and need comprehensive consideration. The patent of the invention adopts the percentage as the unit of evaluation for the visual display of data.

The results of the two classifications on the Dataset 01 are shown in Table 4. It can be seen that the SW-GRU model and the comparison model provided by the invention patent achieve good results on 5 evaluation indexes. The deep learning model is proved to have very excellent performance in DGA domain name detection. Because the characteristics of DGA and legal domain name in the Dataset 01 data set are relatively easy to distinguish and each performance index is in the range of 99%, the SW-DRN provided by the invention and other comparison models are slightly advanced on the classification task. However, the SW-DRN provided by the patent of the invention has no advantages, the number of samples in the Dataset 01 data set test set is large, the SW-DRN model on the comprehensive evaluation index F-score is 0.06% higher than the optimal Attention-LSTM in the control model, and the samples are just the parts which are difficult to distinguish by other models. The SW-DRN model not only uses windows with multiple sizes to collect original characteristic information, but also designs a deeper convolutional neural network with residual errors, so that the model can extract very abstract and high-level characteristic information, and DGA domain names which look like high randomness can be obviously distinguished.

TABLE 4 results in Dataset 01 dichotomy in percent (%)

Model	Acc	precision	recall	F-score	AUC
						LSTM	99.13	99.13	99.12	99.12	99.12
GRU	98.81	98.80	98.77	98.79	98.77
						CNN-LSTM	99.08	99.07	99.07	99.06	99.06
Shallow-CNN	98.65	98.65	98.61	98.63	98.61
						Attention-LSTM	99.14	99.13	99.13	99.13	99.13
SW-DRN	99.20	99.19	99.19	99.19	99.19

Table 5 shows the results of the evaluation of each model on the Dataset 02 data set. The SW-DRN model provided by the patent of the invention is superior to the comparison model in 5 performance indexes. But it is also pointed out that the proposed SW-DRN model does not outperform the other models much, with the LSTM and CNN-LSTM models being inferior to the SW-DRN model only. The performance of the SW-DRN model in the Dataset 02 does not achieve the good performance of the Dataset 01, mainly the DGA family number in the Dataset 02 Dataset is more, and the covered classes are more comprehensive.

TABLE 5 comparison of results in Dataset 02 dichotomy in percent (%)

Model	Acc	precision	recall	F-Score	AUC
						LSTM	97.35	97.36	97.36	97.35	97.36
GRU	96.14	96.19	96.17	96.14	96.17
						CNN-LSTM	97.21	97.21	97.22	97.21	97.23
Shallow-CNN	96.92	96.93	96.93	96.92	96.93
						Attention-LSTM	92.42	92.45	92.44	92.42	92.44
SW-DRN	97.71	97.72	97.72	97.71	97.73

TABLE 6 Dataset 01 multiple classification in percent (%)

TABLE 7 Dataset 02 multiple classification in percent (%)

In the multi-classification experiment of DGA domain names, a classification model needs to perform detailed classification detection on the domain names of various families of the DGA. The experimental results of 3 evaluation indexes of SW-DRN and the control model on the data set Dataset 01 with 21 DGA domain names are shown in Table 6. According to the experimental result, the SW-DRN model is respectively 4.8 percent higher and 1.14 percent higher than the control model in the F-score (comprehensive index of balance precision and parallel) of multi-classification overall evaluation indexes macro and micro. Meanwhile, the analysis of the in-table data shows that the proposed SW-DRN model achieves the leading performance on most DGA domain name families, and even if the provided SW-DRN model fails to surpass the comparison model on some DGA families, the performance is closely followed. And then, through the detection performance of different models in table 6 in each type of DGA domain name, the phenomenon that the LSTM, GRU, LSTM-Attention and CNN-LSTM models have zero detection in the identification of small sample domain name families is found, which means that the DGA families with 0 value on the evaluation index of the 4 comparison models cannot be identified. The null detection phenomenon for some DGA families occurs mainly because the number of samples in the data set is too small, resulting in a data imbalance problem. Although the invention patent undersamples the data set Dataset 01, the problem of deep learning which exists all the time cannot be solved completely, because some categories of data are difficult to collect. The SW-DRN model provided by the patent of the invention can well solve the problem of small sample identification, the model is superior to the existing DGA domain name detection method based on deep learning in small sample identification, and the F-score values of 4 kinds of small sample DGA families, such as shifu, qadars, china and dyre, exceed 98 percent, thereby achieving good detection effect. The SW-DRN model is made to have excellent performance mainly in that a convolution kernel with a multi-size sliding window is adopted on a character-level embedding layer, and data under different views can be sensed by using convolution kernels with different sizes, so that very rich original characteristic information can be captured; then, the depth residual error network fits the original characteristic data in a higher-dimensionality and more complex mode; finally, abstract and high-level features are finally extracted to respectively judge the family types of the DGA.

The SW-DRN model and the control model proposed by the present invention will be analyzed in the data set Dataset 02 multi-classification performance, and the experimental results are shown in Table 7. Since the Dataset 02 data set is constructed by means of the domain name generation algorithm, and is not subject to the problem of difficult data acquisition, 33 different DGA algorithms are selected for generating the corresponding domain names. As can be easily found from the data in the experimental table 7, the SW-DRN model provided by the invention still can be higher than the F-score of macro and micro on the overall multi-classification index by 3.34% and 3.49% compared with the comparison model. The control LSTM, GRU, LSTM-Attention, CNN-LSTM models were also found to exhibit zero detection in some DGA families by experimental data controls. Compared with the comparison models such as LSTM, GRU and CNN-LSTM, the zero detection condition not only appears in the DGA family of small samples, but also simultaneously appears in the DGA family with sufficient sample numbers such as dircrypt, nymaim, bubble and the like. The reason is that the DGA domain name detection model based on deep learning at present has high randomness to dircrypt, nymaim and fobber, and similar DGA families are difficult to distinguish. The SW-DRN model provided by the patent of the invention has no zero detection condition on 33 types of DGA families, and has better detection conditions on most types, and only few types cannot reach ideal scores on evaluation indexes.

4.6 model and depth of loss

The depth of the SW-DRN model proposed by the present patent depends on the number of internal convolution layers of the residual block and the number of convolution repetition blocks in the model. The invention takes the repetition times as a fixed value N to 4, so the model depth is directly determined by the internal convolution number of the residual block. The number of layers of the SW-DRN model is increased or decreased by increasing or decreasing the convolution layer in the residual block, the number of layers of the SW-DRN model is set to be 9,17,29 and 49, and the two-classification and multi-classification experiments are respectively carried out on the data set 01 and the data set 02 data sets. The results of the experiment are shown in FIG. 5, and the SW-DRN models of 9,17,29 and 49 layers are displayed in the evaluation indexes of Acc, precision, call and F-score. From the experimental results shown in fig. 5, when the model layers are from 9 to 17 to 29, the indexes of the model are gradually increased, and the performance improvement of the model is almost stopped when the number of layers is 49, and if the depth of the model is further increased, the overfitting is caused and the performance of the model can be directly reduced. Therefore, the patent of the invention finally selects an optimal model layer number as 29 layers.

The method uses a cross entropy loss function as shown in a formula (11) to calculate the loss value between the predicted value and the reality of the model, updates the parameters of the model by gradient back propagation, and sequentially learns the model. As shown in FIG. 6, the left graph shows the behavior of loss value at each epoch during the training of the comparison model, and the right graph shows the behavior of loss value during the training of SW-DRN model with different network layer numbers. The first epoch loss values of the patent comparison model are all higher and are all larger than 2.7, the loss values gradually start to decline with the iterative training of the model, the performance of the model is gradually improved, but the loss decline becomes slow after the model passes 50epoch, and finally the loss hardly continues to decline after 100epoch, reaches a stationary period, and the final loss values of the comparison model are all more than 2.4. And meanwhile, the result of the two-classification and multi-classification experiment on the data set is combined, so that the lower the loss value of the model in training, the better the classification result of the model is. The right SW-DRN model training loss map just demonstrates that the first epoch loss value of the SW-DRN model is lower than 0.5, and the subsequent loss value between 1-100 epochs gradually decreases along with the continuous learning of the model, and then 101- "200" is a stationary period. The loss value of the obtained model finally trained by the SW-DRN is very small, which shows that the model well learns the training data, so that the model is very excellent in performance on a test set. Further analyzing the loss map trained by the SW-DRN model of the right image, and finding that the loss descending speeds of different layers are different. The loss of the model decreases faster with fewer layers at the first 50epoch because the model parameters are smaller with fewer layers and the model update gradient is larger. With the increase of epoch, the SW-DRN model with more layers exceeds the SW-DRN model with less layers by means of stronger fitting ability.

Through analysis of a comparison experiment, the provided SW-DRN model is not only superior to other models in a two-classification task, but also achieves striking performance in a multi-classification task, and particularly has obvious advantages in the identification of DGA domain name families with small samples, the identification of domain names with higher character randomness and the identification of DGA domain names with higher similarity. The SW-DRN model acquires the perception fields of multiple scales in a regional convolution mode, so that the extracted features are greatly enriched; meanwhile, the features are input into a variable-length depth residual error neural network to perform complex and nonlinear transformation in a further step.

According to the invention, due to the fact that the difficulty of collecting part of categories of the DGA domain name family in the patent experimental data Dataset 01 is high, the obtained extremely small amount of DGA family data is not enough for model training, and therefore, all main stream DGA family domain names at present cannot be fully obtained. For this purpose, the invention synthesizes Dataset 02 through DGA domain name algorithm for comparison with Dataset 01. As can be seen from the Dataset 02 experiment, the provided SW-DRN model is superior to other control models in identification of more DGA domain name family categories, and shows strong classification performance of the SW-DRN model. Although the SW-DRN model and other models provided by the patent of the invention have certain advantages in the identification of small sample DGA domain name families, the SW-DRN model cannot completely make the identification rate of all small sample families as high as that of other families. The method has the advantages that the problems caused by imbalance among data categories can not be completely solved no matter the method is used for detecting the DGA domain name based on machine learning or deep learning, the model obtains knowledge by means of data training, the model can be biased to data quantity multi-sample category learning in the training process, and the performance of the model for identifying small sample categories is poor; secondly, the randomness of the domain name character strings is high, the DGA family categories such as dirrcypt, prosikefan, qakbot, nymaim and pykspa are difficult to judge by using the information of the domain name character strings alone, and domain name similarity among the DGA domain names is high, so that the domain name categories are difficult to distinguish.

In an increasingly complex network world, preventing the domain name resolution (DGA) connection between a command control server and a zombie host in a zombie network is an important subject, and the invention provides a deep residual error neural network model (SW-DRN) based on a character-level sliding window. Experiments prove that the SW-DRN model is superior to a control model in two classification tasks, namely F-score indexes of Dataset 01 and Dataset 02 reach 99.19 percent and 97.71 percent respectively. Meanwhile, the SW-DRN model obtains the most excellent performance in a multi-classification task, and the identification between a small sample DGA domain name family and a DGA domain name with high randomness and easy confusion is further improved compared with the existing DGA domain name classification model, and the evaluation index of the macro F-score is improved by 3.34% and 4.8%. The invention further tests the provided SW-DRN model, realizes different depth exploration of SW-DRN through a variable-length residual error module, and finds that the model has the optimal performance when the depth is 29 layers through the graph 5. Finally, as can be seen from the Loss graph of the training of fig. 6, the SW-DRN model proposed by the present invention has strong learning ability and strong generalization ability on data. Although the model proposed by the patent of the invention achieves good performance on each evaluation index, the model is not perfect, and further improvement space is provided, especially for the DGA domain name detection, whether data can be processed in real time or not is very important for the current network node, which is just the work required by the next step.

While embodiments of the invention have been shown and described, it will be understood by those of ordinary skill in the art that: various changes, modifications, substitutions and alterations can be made to the embodiments without departing from the principles and spirit of the invention, the scope of which is defined by the claims and their equivalents.

Claims

1. A DGA domain name detection method using a depth residual error network and a character-level sliding window is characterized by comprising the following steps:

s4, performing batch normalization on the data processed in step S3;

s5, classifying the domain name.

2. The method for detecting domain name DGA using deep residual error network and character level sliding window according to claim 1, wherein the preprocessing the data to be processed in step S1 includes the following steps:

3. The DGA domain name detection method using a deep residual error network and a character level sliding window according to claim 1, wherein step S2 comprises:

definition of

A character vector of the ith character in a DGA domain name sample; then use

Representing the incoming DGA domain name;

then defining k as the length of the filter, introducing

w_j＝[a_j,a_j+1,...,a_j+k-1]，

A_j＝f(w_j⊙m+b)，

for n filters with the same length, n feature maps can be generated for each window vector w_jIs characterized by the fact that W is [ A ]₁,A₂,...,A_n]。

4. The method for detecting domain name DGA using depth residual error network and character level sliding window according to claim 1, wherein the calculation method for extracting features using depth residual error network layer in step S3 is:

x_l＝x′_l-1+H(x_l-1)，

wherein, x'_l-1Denotes x_l-1A value obtained after down-sampling;

x_l-1represents the input of the l-1 layer residual block;

x_lrepresenting the input of the l-th layer of the residual block.

5. The DGA domain name detection method using the deep residual error network and the character level sliding window according to claim 1, wherein the calculation method of batch normalization in step S4 is:

calculate the mean of one mini-batch size sample:

wherein m represents the number of input samples;

x_ian ith sample representing the input;

μ_Brepresents the sample mean;

calculate a mini-batch size sample variance:

wherein m represents the number of input samples;

x_ian ith sample representing the input;

μ_Brepresents the sample mean;

σ_Brepresents the sample variance;

for the ith sample x of the input_iNormalization:

wherein x is_iAn ith sample representing the input;

μ_Brepresents the sample mean;

σ_Brepresents the sample variance;

representing the fitting parameters;

represents a normalized value;

wherein γ represents a first training parameter;

represents a normalized value;

β represents a second training parameter;

y_iindicates the value obtained after Batch Normalization.

6. The method for detecting domain name DGA using deep residual error network and character level sliding window according to claim 1, wherein in step S5, the domain name classification method is:

wherein T represents a transpose of the matrix;

w_jrepresents the weight of the softmax function;

k represents the number of categories of the multi-classification;

……；