CN112492059A

CN112492059A - DGA domain name detection model training method, DGA domain name detection device and storage medium

Info

Publication number: CN112492059A
Application number: CN202011288625.4A
Authority: CN
Inventors: 马莉雅; 雷君; 龙泉; 何能强; 李鹏超; 金红; 陈晓光; 杨满智; 蔡琳; 尚程; 王利丽
Original assignee: National Computer Network and Information Security Management Center; Eversec Beijing Technology Co Ltd
Current assignee: National Computer Network and Information Security Management Center; Eversec Beijing Technology Co Ltd
Priority date: 2020-11-17
Filing date: 2020-11-17
Publication date: 2021-03-12

Abstract

The invention relates to a DGA domain name detection model training method, a DGA domain name detection device and a storage medium, wherein the model training method comprises the following steps: step S1, obtaining domain name information of a plurality of domain name samples; step S2, for DGA domain names and non-DGA domain names in the domain name samples, respectively training a recurrent neural network for feature extraction according to at least part of the domain name information; calculating the ratio of the output of the recurrent neural network corresponding to the DGA domain name to the output of the recurrent neural network corresponding to the non-DGA domain name as the characteristic of at least part of the domain name information; and step S3, inputting the characteristics into a classifier for training to obtain a DGA domain name classifier, so as to judge whether the domain name to be detected is from the domain name generation algorithm DGA by using the trained DGA domain name classifier.

Description

DGA domain name detection model training method, DGA domain name detection device and storage medium

Technical Field

The invention relates to the technical field of network security, in particular to a DGA domain name detection model training method, a DGA domain name detection device and a storage medium.

Background

The statements herein merely provide background information related to the present disclosure and may not necessarily constitute prior art.

In a large-scale network system, especially in a cross-regional internet system, if a certain node or a certain network element in the node encounters a security threat attack, the security threat attack may spread to other network elements in neighboring nodes and nodes, thereby causing a security monitoring device deployed in the network to generate a large amount of repeated, useful or useless security threat alarm information. For network security management personnel, it is very important how to quickly and accurately locate specific network element devices which are attacked by the security threat from the massive, repeated, useful or useless security threat alarm information, and analyze and solve the problem.

The invention relates to the field of network attack recognition monitoring, in particular to a DGA (domain name generation algorithm) which is a technical means for generating a command and control (C & C) domain name by using random characters so as to avoid domain name blacklist detection.

In the event of a network attack, a network which can be controlled one-to-many is formed between the controller and the infected host, and the infected host receives the instruction of the attacker through a C & C. In order to contact with the C & C server and avoid an intrusion detection system, a common attacker generates a large number of domain names by using a random algorithm and tries to connect the domain names one by one, the attacker also selects a part of the domain names to register in a DNS (domain name server), and once a certain domain name is successfully connected, the attacker can contact with the C & C server. Therefore, the malicious domain name is also called a website domain name for spreading botnet viruses, trojans, worms or illegal activities such as fraud and pornography. The malicious Domain name Algorithm is called Domain name Generation Algorithm (DGA), the input of the Algorithm is called Seeds, covers dates, social network search hot words, random numbers or dictionaries, generates a string of special character prefixes (such as gvev 44 bvtey), adds TLD to obtain the final Domain name resource, and the Domain name is called agd (advanced Generated Domain). The network security defenders need to shield all AGDs in order to completely shut down the botnet, which is very costly.

Detecting a DGA domain name is an important step of bot detection techniques, and because DGA algorithm rules are random, generating malicious domain names sometimes appears to be non-malicious or normal. Therefore, the method for detecting the DGA is lack of automation and can be divided into two types of judgment by manual experience and judgment by a manual feature extraction combined program. However, the two methods have low detection rate, high DGA false alarm rate and no practicability.

The current more advanced method is to use machine learning to train a large number of normal and DGA domain names, and then detect and identify the location DGA domain name. However, the machine learning method in the prior art needs a lot of feature engineering. In other prior art, a method based on the CNN is adopted to train the DGA sample, but the CNN is not matched with the character string feature extraction scene with the time-sequence dependency relationship.

The existing DGA domain name detection method has the following problems and disadvantages:

(1) the manual feature extraction is combined with manual experience judgment, the workload is huge, and the method cannot adapt to the attack mode that the mass DGA domain names change continuously.

(2) A DGA domain name detection method based on traditional machine learning is provided, but the existing machine learning method needs a large amount of feature engineering, the workload and the calculation cost are very large, meanwhile, the detection accuracy rate completely depends on feature processing, and a large amount of time is consumed for manual feature research.

(3) The existing method has the problems of high false alarm rate and low accuracy rate.

Disclosure of Invention

The invention aims to provide a novel DGA domain name detection model training method, a DGA domain name detection device and a storage medium.

The purpose of the invention is realized by adopting the following technical scheme. The DGA domain name detection method provided by the invention comprises the following steps: step S1, obtaining domain name information of a plurality of domain name samples; step S2, for DGA domain names and non-DGA domain names in the domain name samples, respectively training a recurrent neural network for feature extraction according to at least part of the domain name information; calculating the ratio of the output of the recurrent neural network corresponding to the DGA domain name to the output of the recurrent neural network corresponding to the non-DGA domain name as the characteristic of at least part of the domain name information; step S3, inputting the features into a classifier for training to obtain a DGA domain name classifier, and determining whether a domain name comes from a domain name generation algorithm by using the DGA domain name classifier.

The object of the invention can be further achieved by the following technical measures.

In the DGA domain name detection model training method, the output of the recurrent neural network corresponding to the DGA domain name and the output of the recurrent neural network corresponding to the non-DGA domain name in step S2 are both a fit of a likelihood function p (x | θ); wherein x is at least a portion of the domain name information, θ is a parameter of the recurrent neural network, and p (x | θ) represents a probability of x given θ.

In the DGA domain name detection model training method, the domain name information includes a character string of a domain name; the training, in the step S2, of the DGA domain name and the non-DGA domain name in the domain name samples, according to at least a part of the domain name information, respectively, for a recurrent neural network for feature extraction includes: and for one or more of the DGA domain name and the non-DGA domain name, adopting a character-level cyclic neural network to extract features, and adopting character-level cross entropy as a loss function to perform back propagation.

In the DGA domain name detection model training method, the extracting features by using the character-level recurrent neural network includes: the output of each time beat of the recurrent neural network, except for the last time beat, is the probability of the next character occurring; the output of the last time beat of the recurrent neural network is a fit of the likelihood function representing the probability that the string of the domain name is a DGA or a non-DGA; the backward propagation by using the character-level cross entropy as the loss function comprises the following steps: the recurrent neural network performs back propagation of a loss function at each time beat, specifically including calculating a cross entropy between an output of a recurrent unit and a true value of a next beat in the character string of the domain name at each time beat and performing back propagation.

In the DGA domain name detection model training method, the output of the recurrent neural network corresponding to the DGA domain name or the output of the recurrent neural network corresponding to the non-DGA domain name is:

wherein x is a character string of the at least part of the domain name information, and x is_iIs the ith character in the character string, and theta is a parameter of the recurrent neural network.

Before the DGA domain name detection model training method, in step S2, the method further includes: encoding each character in the character string of the domain name in a unique heating mode; training the recurrent neural network for feature extraction according to at least a part of the domain name information in the step S2 includes: training the recurrent neural network according to the character string of the domain name subjected to the one-hot encoding.

In the DGA domain name detection model training method, the step S1 specifically includes: acquiring a main domain name and a sub domain name of the domain name sample; the step S2 specifically includes: for the main domain name and the sub domain name, respectively training the recurrent neural network corresponding to the DGA domain name and the recurrent neural network corresponding to the non-DGA domain name, and respectively determining the ratio corresponding to the main domain name and the ratio corresponding to the sub domain name as the characteristics of the main domain name and the sub domain name; the step S3 specifically includes: inputting the characteristics of the main domain name and the characteristics of the sub domain name into a classifier for training.

In the DGA domain name detection model training method, the step S1 further includes: extracting a top-level domain name of the domain name sample; prior to the step S3, the method further includes: coding the top-level domain name in a single-hot mode to obtain a top-level domain name single-hot vector as the characteristic of the top-level domain name; the step S3 specifically includes: and inputting the characteristics of the main domain name, the characteristics of the sub domain name and the characteristics of the top-level domain name of the domain name samples into a classifier together for training to obtain the DGA domain name classifier.

In the DGA domain name detection model training method, the step S3 specifically includes: training a logistic regression classifier for two classes using the features as inputs and using the domain name, which is calibrated to be DGA, as supervisory information.

In the DGA domain name detection model training method, the recurrent neural network used in step S2 is a recurrent neural network based on long-term and short-term memory or a recurrent neural network based on gated cyclic units.

The purpose of the invention is realized by adopting the following technical scheme. The DGA domain name detection method provided according to the disclosure is characterized by comprising the following steps: acquiring domain name information of a domain name to be detected in the same way as the step S1 of the DGA domain name detection model training method; inputting at least a part of the domain name information of the domain name to be detected into the recurrent neural network corresponding to the DGA domain name and the recurrent neural network corresponding to the non-DGA domain name obtained in step S2 of the DGA domain name detection model training method as any one of the above, and calculating a ratio of outputs of the two recurrent neural networks as a feature of at least a part of the domain name information of the domain name to be detected; inputting the characteristics of the domain name to be detected into the DGA domain name classifier obtained in step S3 of the DGA domain name detection model training method as described in any one of the above embodiments, so as to determine whether the domain name to be detected is from a domain name generation algorithm.

The purpose of the invention is realized by adopting the following technical scheme. According to this disclosure, a DGA domain name detection model training device includes: a memory for storing non-transitory computer readable instructions; and the processor is used for executing the computer readable instructions, so that the processor can realize any one of the DGA domain name detection model training methods when being executed.

The purpose of the invention is realized by adopting the following technical scheme. According to this disclosure, a DGA domain name detection device includes: a memory for storing non-transitory computer readable instructions; and a processor for executing the computer readable instructions, such that the processor when executing implements any of the aforementioned DGA domain name detection methods.

The purpose of the invention is realized by adopting the following technical scheme. A computer-readable storage medium according to the present disclosure is provided for storing non-transitory computer-readable instructions which, when executed by a computer, cause the computer to perform any one of the aforementioned DGA domain name detection model training methods or DGA domain name detection methods.

Compared with the prior art, the invention has obvious advantages and beneficial effects. By the technical scheme, the DGA domain name detection model training method, the DGA domain name detection device and the storage medium provided by the invention at least have the following advantages and beneficial effects:

1. the invention can omit the complicated process of characteristic engineering. Whereas if existing methods are used to generate a long list of features (e.g., length, vowel, consonant, and n-gram models) and use these features to identify DGA-generated domain names and non-DGA-generated domain names. It would be an extremely laborious process that would require security personnel to update and create new feature libraries in real time.

2. The method of manually extracting features is not easy to cope with the diversified DGA generation means, and once an attacker grasps a fixed manual feature extraction filtering rule, the attacker can easily evade detection by updating the DGA. The method disclosed by the invention is based on the automatic characterization learning capability of the recurrent neural network (RNN for short), and can be more quickly adapted to the opponents which are continuously changed.

3. The domain names generated by the current DGA are increasingly falsified and are often related to natural language and have time sequence characteristics. Compared with a DGA detection method based on the CNN, the DGA domain name with time sequence characteristics is modeled by the RNN, and the method is more practical in application.

4. The invention automatically judges the DGA domain name and provides the DGA domain name as threat information for the reference of safety operation and maintenance personnel. The huge investment of manpower and material resources for manual detection of domain name generated by DGA is greatly reduced.

5. Another advantage of the present invention is that only domain names are identified without using any context functions, such as nxdates, the generation of which often requires additional expensive infrastructure (such as network sensors and third party reputation systems).

6. Compared with the common RNN method, the method can shorten the training time and improve the training efficiency. The method for switching the domain name better accords with the reality that different parts of the domain name play different roles.

The foregoing description is only an overview of the technical solutions of the present invention, and in order to make the technical means of the present invention more clearly understood, the present invention may be implemented in accordance with the content of the description, and in order to make the above and other objects, features, and advantages of the present invention more clearly understandable, the following preferred embodiments are described in detail with reference to the accompanying drawings.

Drawings

FIG. 1 is a schematic flow chart of a DGA domain name detection model training method according to an embodiment of the present invention;

FIG. 2 is a schematic flowchart of a DGA domain name model training method and a DGA domain name detection method according to another embodiment of the present invention;

FIG. 3 is a schematic flow chart of feature extraction provided by an embodiment of the present invention;

FIG. 4 is a schematic diagram of a DGA domain name detection model training apparatus according to an embodiment of the present invention;

fig. 5 is a schematic diagram of a DGA domain name detection apparatus according to an embodiment of the present invention.

Detailed Description

To further illustrate the technical means and effects of the present invention adopted to achieve the predetermined objects, the following detailed description will be given to the specific embodiments, structures, features and effects of the DGA domain name detection model training method, the DGA domain name detection device and the storage medium according to the present invention with reference to the accompanying drawings and the preferred embodiments.

Please note that the present invention relates to Domain names generated by Domain Generation Algorithm (DGA), referred to as DGA Domain names, or DGA generated Domain names.

FIG. 1 is a schematic flow chart diagram of an embodiment of a DGA domain name detection model training method of the present invention. FIG. 2 is a schematic flow chart diagram of one embodiment of a DGA domain name model training method and detection method of the present invention. It should be noted that although fig. 2 illustrates a recurrent neural network (abbreviated RNN) based on long-short term memory (abbreviated LSTM), the process is also applicable to a general recurrent neural network. Referring to fig. 1 and fig. 2, the DGA domain name detection model training method of the present invention mainly includes the following steps:

step S1, domain name information of a plurality of domain name samples is acquired. Optionally, the domain name samples are domain names that are calibrated to be DGA domain names, that is, data with known categories, so that the calibration can be used as supervision information for supervision and learning.

Step S2, for the DGA domain name and the non-DGA domain name in the domain name samples, respectively training a recurrent neural network (RNN for short) for feature extraction according to at least a part of information in the domain name information, so as to extract the feature of invariance of the DGA domain name and the non-DGA domain name according to at least a part of information in the domain name information by the recurrent neural network corresponding to the DGA domain name and the non-DGA domain name; and calculating the ratio of the output of the recurrent neural network corresponding to the DGA domain name to the output of the recurrent neural network corresponding to the non-DGA domain name as the characteristic of at least part of the domain name information.

The feature of at least a part of the domain name information is not referred to as a detection feature or a DGA feature. Note that, in some examples, a ratio of outputs of the foregoing two recurrent neural networks may be obtained as a detection feature for a part of the domain name information; and extracting the characteristics of the other part of the information in the domain name information in other modes to be used as the detection characteristics of the other part of the information.

Optionally, for the DGA domain name and the non-DGA domain name, the same type of recurrent neural networks with the same structure are used for training, and certainly, the recurrent neural networks corresponding to the trained DGA domain name and the non-DGA domain name are not the same.

Step S3, inputting the characteristics of at least a part of the domain name information of the domain name samples into a classifier for training to obtain a DGA domain name classifier, which is used to determine whether a domain name to be detected is from a domain name generation algorithm (DGA) by using the trained DGA domain name classifier. Specifically, the step S3 includes inputting the ratio of the output of the recurrent neural network corresponding to the DGA domain name obtained in the step S2 to the output of the recurrent neural network corresponding to the non-DGA domain name into a classifier for training.

It is noted that since the classifier of the neural network two classification generally employs a logistic regression classifier, in some embodiments, the step S3 specifically includes training in the logistic regression classifier.

In some embodiments of the present invention, the training of the recurrent neural network for feature extraction according to at least a part of the domain name information in the foregoing step S2 to obtain the features of the at least a part of the information specifically includes: and respectively training a first part of domain name information and a second part of domain name information in the domain name information to corresponding recurrent neural networks for feature extraction so as to obtain the features of the first part of domain name information and the features of the second part of domain name information. Note that the number of divided portions is not limited, and may be, for example, three portions, four portions, or more.

In some embodiments of the invention, the output of the recurrent neural network corresponding to the DGA domain name and the output of the recurrent neural network corresponding to the non-DGA domain name in step S2 are both a fit to a likelihood function. Note that the features extracted by the recurrent neural network corresponding to the DGA domain name and the non-DGA domain name are invariance features of the DGA domain name and the non-DGA domain name, and the features can be fitted to the likelihood function p (x | θ). Wherein x is at least a part of the aforementioned domain name information, the parameter θ is a parameter of the recurrent neural network, and p (x | θ) represents a probability of x after the parameter θ is given. Alternatively, p (x | θ) can be considered as the probability that a part of the domain name information x is DGA given by the parameter θ of the recurrent neural network. In some examples, since the probability that a domain name is a DGA has the following relationship p (x ═ DGA) ═ 1-p (x ═ non-DGA) to the probability of a non-DGA, p (x | θ) can also be considered to be the probability of a non-DGA.

Thus, the output p (x | θ) of the recurrent neural network corresponding to the DGA domain name is described above_dga) Output p (x | theta) of recurrent neural network corresponding to non-DGA domain name_non-dga) The ratio Λ (x) of these two outputs, which may be considered a generalized likelihood ratio, the aforementioned step S3 in fact involves performing a generalized likelihood ratio test. In general, the likelihood ratio is defined as the ratio of the maximum value of the likelihood function with constraint to the maximum value of the likelihood function without constraint. A obedient chi-squared distribution statistic may be constructed based on the likelihood ratios.

In some embodiments of the invention, the domain name information of the domain name sample comprises a string of the domain name, the string comprising an ordered arrangement of multi-bit characters. For the DGA domain name and the non-DGA domain name in the domain name samples in the foregoing step S2, training the recurrent neural network for feature extraction according to at least a part of the domain name information includes: for one or more of the DGA domain name and the non-DGA domain name, character-level cyclic neural network is adopted for feature extraction, and character-level cross entropy is adopted as a loss function for back propagation.

In some embodiments of the invention, the character level refers to individual characters in the domain name based string. In some specific examples of the invention, the character-level RNN employed is: each time beat in RNN corresponds to one bit of character in the input string, the loss function of the training phase is the cross entropy of the output of each beat and the true value of the next beat, error back propagation is performed, and if the length of the string is n, back propagation is needed n times.

In an alternative embodiment, the aforementioned feature extraction using the character-level recurrent neural network includes: in a recurrent neural network, the output of the recurrent unit of each time beat (also referred to as time step) of the recurrent neural network is the probability of the next character occurring, except for the last time beat; the output of the last time beat of the recurrent neural network is used as a fit to the aforementioned likelihood function to represent the probability that the string input to the recurrent neural network is DGA or non-DGA.

Note that the output of the recurrent neural network in the aforementioned step S2 is distinguished from the output of the classifier in step S3. In some alternative examples, the output of the recurrent neural network in step S2 can be regarded as the probability that a segment of a string in the domain name is DGA or non-DGA, and the output of the classifier in step S3 can be regarded as the probability that the entire domain name is DGA or non-DGA.

In an alternative embodiment, the foregoing back-propagating using character-level cross entropy as a loss function comprises: the recurrent neural network performs back propagation of the loss function at each time beat, specifically includes calculating the cross entropy between the output of the recurrent unit and the true value of the next beat in the character string of the domain name at each time beat and performing back propagation, thereby accelerating the training speed.

In an alternative embodiment, the output p (x | θ) of the recurrent neural network corresponding to the DGA domain name_dga) Or the output p (x | theta) of the recurrent neural network corresponding to the non-DGA domain name_non-dga) Comprises the following steps:

wherein x is a character string of at least a part of the domain name information, x_iIs the ith character in the character string, and theta is a parameter of the recurrent neural network.

The method provided by the invention can improve the training effect and accelerate the training speed by adopting the character-level recurrent neural network to extract the characteristics and adopting the character-level cross entropy as a loss function to carry out back propagation.

In some embodiments of the present invention, before the step S2, the method further includes: each character in the string of domain names is one-hot encoded. And training the recurrent neural network for feature extraction according to at least a part of the domain name information in the aforementioned step S2 includes: and training the recurrent neural network according to the character strings of the domain names subjected to the one-hot coding.

In some embodiments of the present invention, the aforementioned step S1 includes: and acquiring a main domain name and a sub domain name of the domain name sample. Optionally, the main domain name is a second-level domain name in a domain name system level; optionally, the sub-domain name is a sub-domain name of the second-level domain name, and may be a third-level domain name.

Specifically, the main domain name and the sub domain name of the obtained domain name sample can be a main domain name character string and a sub domain name character string respectively, and a character-level recurrent neural network can be adopted to perform feature extraction in the subsequent steps, and character-level cross entropy is adopted as a loss function to perform back propagation. In fact, in some embodiments, the domain name information utilized by the method of the present invention includes a character string of a domain name, and the first partial domain name information and the second partial domain name information are a main domain name character string and a sub domain name character string, respectively.

Further, the step S2 specifically includes: for the main domain name and the sub domain name, the steps in the foregoing example are respectively utilized to train the recurrent neural network for feature extraction corresponding to the DGA domain name and the recurrent neural network for feature extraction corresponding to the non-DGA domain name, and the ratio of the foregoing two network outputs corresponding to the main domain name and the ratio of the foregoing two network outputs corresponding to the sub domain name are respectively obtained and respectively used as the features of the main domain name and the features of the sub domain name. The step S3 specifically includes: the features of the main domain name and the features of the sub-domain name are input into a classifier for training. In one specific example, this step is performed based on a string of the main domain name and a string of the sub domain name.

In some embodiments of the present invention, the aforementioned step S1 further includes: the top level domain name of the domain name sample is extracted. Before the foregoing step S3, the method further includes: and coding the top-level domain name in a one-hot mode to obtain a top-level domain name one-hot vector as the characteristic of the top-level domain name. And the aforementioned step S3 includes: and inputting the characteristics of the main domain name, the characteristics of the sub domain names and the characteristics of the top-level domain name of the plurality of domain name samples into a classifier for training so as to obtain a trained DGA domain name classifier. In one specific example, this step is performed based on a string of the main domain name and a string of the sub domain name.

It should be noted that, in some examples, the foregoing encoding procedure using one-hot (one-hot) method may also be based on machine learning, for example, the obtained encoding result is obtained by training.

In some embodiments of the present invention, the aforementioned step S3 includes: the method comprises the steps of taking the characteristics of at least part of domain name information of a plurality of domain name samples as input, and using a domain name which is marked to be a DGA domain name or not as supervision information to train a Logistic regression classifier. Optionally, the logistic regression classifier is a classifier for two classes.

In some embodiments of the present invention, as shown in fig. 2, the recurrent neural network utilized in step S2 in the foregoing embodiments is a long-short term memory (LSTM) -based recurrent neural network, and the recurrent units in the recurrent neural network are LSTM units. It should be noted that the present invention is not limited to using LSTM for the selected recurrent neural network, for example, in other embodiments, the recurrent neural network is a recurrent neural network based on gated recurrent units (referred to as GRUs), or other recurrent neural networks are used.

In some embodiments of the present invention, the present invention further exemplifies a DGA domain name detection method, which mainly includes the following steps:

acquiring domain name information of a domain name to be detected in the same way as the step S1 of the DGA domain name detection model training method in any one of the embodiments;

determining the characteristics of at least part of the domain name information of the domain name to be detected by using the trained recurrent neural network obtained in step S2 of the DGA domain name detection model training method according to any one of the embodiments; specifically, at least a part of the domain name information of the domain name to be detected is input to the recurrent neural network corresponding to the DGA domain name and the recurrent neural network corresponding to the non-DGA domain name obtained in step S2 of the DGA domain name detection model training method according to any of the foregoing embodiments, and the ratio of the outputs of the two recurrent neural networks is calculated as the characteristic of at least a part of the domain name information of the domain name to be detected;

the characteristics of at least a part of the domain name information of the domain name to be detected are input into the DGA domain name classifier obtained in step S3 of the DGA domain name detection model training method according to any of the embodiments described above, so as to determine whether the domain name to be detected is from a domain name generation algorithm (DGA).

In some embodiments of the DGA domain name detection method of the present invention, the obtained domain name information of the domain name to be detected includes a main domain name and a sub-domain name, which may specifically be a main domain name character string and a sub-domain name character string; when extracting the characteristics, respectively inputting the main domain name and the sub domain name into a circulating neural network corresponding to the trained DGA domain name and a circulating neural network corresponding to the non-DGA domain name, and calculating the output ratio of the two circulating neural networks to obtain the characteristics of the main domain name and the sub domain name of the domain name to be detected; and inputting the characteristics of the main domain name and the characteristics of the sub-domain name into a trained classifier, and judging whether the domain name to be detected comes from a domain name generation algorithm or not according to the output of the classifier.

Further, in some embodiments of the DGA domain name detection method of the present invention, the method further includes: acquiring a top-level domain name of a domain name to be detected; coding the top-level domain name in a single-hot mode to obtain a top-level domain name single-hot vector as the characteristic of the top-level domain name; and inputting the characteristics of the main domain name, the characteristics of the sub-domain name and the characteristics of the top-level domain name of the domain name to be detected into a trained classifier together for training so as to judge whether the domain name to be detected comes from a domain name generation algorithm according to the output of the classifier.

For detailed description and technical effects of embodiments of the DGA domain name detection method according to the present invention, reference may be made to corresponding descriptions in the embodiments of the DGA domain name detection model training method, which are not described herein again.

In some examples, the present invention uses a combination of character-level RNN and logistic regression to detect whether a domain name is from a DGA.

In some examples, instead of directly using the RNN to predict whether a domain name belongs to a DGA domain name, an RNN is established that trains the main and sub-domain names separately as inputs. And training two RNNs by taking the domain name as an input, respectively training the domain name generated by the DGA and the domain name generated by the non-DGA by one RNN, and then performing Generalized Likelihood Ratio Test (GLRT) and DGA detection.

Specifically, a character string (which may be a character string of a main domain name or a sub-domain name in the present invention) is given, the output of a certain character of each RNN to the character string is actually the probability of the occurrence of the next character, the output of the last time beat is used as the fitting of a likelihood function, and then a generalized likelihood ratio test is performed to determine whether the character belongs to a DGA domain name. And (3) outputting the LSTM-generalized likelihood ratio detection models of the sub-domain names and the domain names, integrating one-hot vector training of the top-level domain names, combining the outputs of the three types, inputting the outputs into a logistic regression model together, and finally forming the classification prediction of the DGA.

In some examples, as shown in fig. 2, the model training phase includes:

(1) a domain name is divided into a main domain name and a sub domain name, and a top-level domain name is extracted.

(2) Each character in each domain name string is one-hot encoded.

(3) For the main domain name and the sub-domain name of two types of domain names (whether DGA or non-DGA), two LSTM-based RNNs are respectively trained, namely four LSTM models in total. The two types of domain names are respectively a main domain name and a sub domain name, the four models are based on the RNN of characters, the output is the prediction of the next character, and the output of the last time beat can be regarded as a DGA/non-DGA likelihood function p (x | theta).

(4) And calculating two generalized likelihood ratios by the output of the four LSTM models, and then jointly transmitting the generalized likelihood ratios to a Logistic regression classifier by combining one-hot codes of top-level domain names, and training by using whether the calibrated domain name is a DGA domain name or not as supervision information.

In some examples, as shown in fig. 2, the model prediction phase includes:

(2) Each character in each domain name string is one-hot encoded.

(3) And for the domain name, respectively inputting the coded main domain name and the coded sub domain name into two corresponding RNNs obtained by training according to the method shown in the model training stage, wherein the two RNNs respectively correspond to the main domain name and the sub domain name, and calculating the likelihood ratio. Note that instead of ratioing the RNN outputs of the main and sub-domain names, the generalized likelihood ratios of the main and sub-domain names are calculated separately.

(4) And (3) for the domain name, inputting the generalized likelihood ratio of the main domain name, the generalized likelihood ratio of the sub-domain name and the one-hot code of the top-level domain name obtained in the step (3) into a logistic regression classifier together, and outputting the value, namely the probability of whether the input domain name is the DGA domain name or not, and also the classification result. For example, the classification result is: p (input domain name is DGA) is 0.9, and P (input domain name is not DGA) is 0.1.

Note that in some examples, after prediction, the model may be further optimized using the prediction results.

In some DGA detection methods, the RNN may be trained directly to classify a string. This has the problem that the back propagation requires that each string be completely entered, which slows down the training.

In some embodiments of the invention, character-level LSTM is used as the RNN model, cross entropy is used as the loss function for back propagation, and cross entropy between LSTM output and the true value of the next beat of the string is calculated for back propagation in each beat, which speeds up training. The cross-entropy loss can be expressed as:

where c represents the sample number in the set, y_cRepresents the sample label, x represents the characteristics of the sample (one-hot vector), θ_LSTMParameters representing LSTM. For general RNN, the θ_LSTMMay be expressed as theta.

FIG. 3 is a block flow diagram of feature extraction provided by an embodiment of the present invention, which schematically illustrates an RNN-based generalized likelihood ratio test model proposed by the present invention. Note that although fig. 3 illustrates an LSTM-based RNN, the process is also applicable to general RNNs. In some embodiments of the invention, for a main domain name and a sub-domain name, each training two LSTM models, a DGA sample trains one LSTM model, a non-DGA sample trains one model, and the trained output may be considered as a likelihood function p (x | θ). One domain name sample is simultaneously input into two LSTMs, the output is fitting of likelihood function, and thus the ratio of the output can be regarded as a generalized likelihood ratio.

The two LSTMs are called LSTM DGA model and LSTM Non-DGA model, respectively, and the output of both models can be expressed as the following formula:

where x is a domain name string sample, x_iIs the ith character in the sample string; theta is a parameter of some LSTM and is the front theta_LSTMA unified expression of (1). The likelihood functions are calculated for models LSTM DGA and LSTM Non-DGA respectively by applying the formula, and are not recorded as p (x | theta)_dga) And p (x | theta)_non-dga). The generalized likelihood ratio is then calculated:

because the top-level domain names are very short (usually 2 to 3 characters), it is not necessary to train the RNN for feature extraction, and the invention directly adopts one-hot vector to encode the top-level domain names, matching 249 most frequently used top-level domain names. If there is no match, the top-level domain name is encoded as "others", collectively constituting 250 binary features. The invention adopts a top-level domain name list published by http:// publicicsuffix. The most common top-level domain names are.com,. org,. ru,. net, and. info, etc. We find that the top-level domain names of ru, info, biz, and cc contain more DGA domain names, essentially 3 times the number of non-DGA domain names. Because the top-level domain name feature vectors are independent, the model of the invention can also be used for obtaining which top-level domain name generation algorithms are more likely to be adopted.

In summary, the invention provides a DGA domain name generation distinguishing method based on RNN. The domain name feature extraction part firstly divides the domain name into a domain name, a sub-domain name and a top-level domain name, performs feature extraction on the domain name and the sub-domain name by adopting a character-level recurrent neural network, and performs model training by adopting character-level prediction cross entropy as a loss function. And calculating and predicting the generalized likelihood ratio of the DGA-LSTM and the non-DGA-LSTM to be used as a DGA judgment basis. And then, inputting the one-hot vector of the top-level domain name into a logistic regression model for secondary classification, and finally outputting the probability that one character string is the DGA domain name.

The invention provides a robust monitoring method for a network system encountering security threat attack.

1. The domain name feature extraction method based on the character-level RNN is characterized in that the loss function back propagation is carried out on each time beat, and the output of the whole character string and the cross entropy of whether the DGA domain name is used as the loss function or not are not carried out.

That is, some embodiments of the invention employ a character-level RNN of: the loss function in the training stage is the cross entropy of the output of each beat and the true value of the next beat, and the error is propagated reversely, if the length of the character string is n, the error needs to be propagated reversely n times. In other examples, the loss function of the RNN model training stage uses the cross entropy between the output value after the last character of the string is input into the RNN and the string classification truth value, with each string being propagated back once.

2. Two LSTM models (instead of one) are trained and generalized likelihood ratios are calculated for the two models as the main features to determine if a DGA domain name is present.

3. And performing logistic regression training by combining the domain name, the generalized likelihood ratio of the sub-domain name and the one-hot vector of the top-level domain name for monitoring the DGA domain name.

It should be noted that the DGA domain name detection method provided by the present invention can be used to classify DGA homologies attacks and determine whether the DGA homologies attacks are present. Wherein, the isogenous attack means that an attacker uses DGA to serially generate a domain name, and if the attacker fails to connect to the resolved IP, another domain name is generated by using DGA, and then the attacker tries again until the domain name is successful.

Fig. 4 is a schematic block diagram illustrating a DGA domain name detection model training apparatus according to one embodiment of the present invention. As shown in fig. 4, the DGA domain name detection model training apparatus 100 according to the embodiment of the present disclosure includes a memory 101 and a processor 102.

The memory 101 is used to store non-transitory computer readable instructions. In particular, memory 101 may include one or more computer program products that may include various forms of computer-readable storage media, such as volatile memory and/or non-volatile memory. The volatile memory may include, for example, Random Access Memory (RAM), cache memory (cache), and/or the like. The non-volatile memory may include, for example, Read Only Memory (ROM), hard disk, flash memory, etc.

The processor 102 may be a Central Processing Unit (CPU) or other form of processing unit having data processing capabilities and/or instruction execution capabilities, and may control other components in the DGA domain name detection model training apparatus 100 to perform desired functions. In an embodiment of the present disclosure, the processor 102 is configured to execute the computer readable instructions stored in the memory 101, so that the DGA domain name detection model training apparatus 100 performs all or part of the aforementioned steps of the DGA domain name detection model training method according to the embodiments of the present disclosure.

Fig. 5 is a schematic block diagram illustrating a DGA domain name detection apparatus according to one embodiment of the present invention. As shown in fig. 5, the DGA domain name detection apparatus 200 according to the embodiment of the present disclosure includes a memory 201 and a processor 202.

The memory 201 is used to store non-transitory computer readable instructions. In particular, memory 201 may include one or more computer program products that may include various forms of computer-readable storage media, such as volatile memory and/or non-volatile memory. The volatile memory may include, for example, Random Access Memory (RAM), cache memory (cache), and/or the like. The non-volatile memory may include, for example, Read Only Memory (ROM), hard disk, flash memory, etc.

The processor 202 may be a Central Processing Unit (CPU) or other form of processing unit having data processing capabilities and/or instruction execution capabilities, and may control other components in the DGA domain name detection apparatus 200 to perform desired functions. In one embodiment of the present disclosure, the processor 202 is configured to execute the computer readable instructions stored in the memory 201, so that the DGA domain name detection apparatus 200 performs all or part of the aforementioned steps of the DGA domain name detection method according to the embodiments of the present disclosure.

Those skilled in the art should understand that, in order to solve the technical problem of how to obtain a good user experience, the present embodiment may also include well-known structures such as a communication bus, an interface, and the like, and these well-known structures should also be included in the protection scope of the present invention.

For the detailed description and the technical effects of the present embodiment, reference may be made to the corresponding descriptions in the foregoing embodiments, which are not repeated herein.

An embodiment of the present invention further provides a computer storage medium, where a computer instruction is stored in the computer storage medium, and when the computer instruction runs on a device, the device executes the above related method steps to implement the DGA domain name detection model training method or the DGA domain name detection method in the above embodiments.

Embodiments of the present invention further provide a computer program product, which when running on a computer, causes the computer to execute the above-mentioned related steps to implement the DGA domain name detection model training method or the DGA domain name detection method in the above-mentioned embodiments.

In addition, the embodiment of the present invention further provides an apparatus, which may specifically be a chip, a component or a module, and the apparatus may include a processor and a memory connected to each other; when the device runs, the processor can execute the computer execution instructions stored in the memory, so that the chip can execute the DGA domain name detection model training method or the DGA domain name detection method in the above method embodiments.

The apparatus, the computer storage medium, the computer program product, or the chip provided by the present invention are all configured to execute the corresponding methods provided above, and therefore, the beneficial effects achieved by the apparatus, the computer storage medium, the computer program product, or the chip may refer to the beneficial effects in the corresponding methods provided above, and are not described herein again.

Although the present invention has been described with reference to a preferred embodiment, it should be understood that various changes, substitutions and alterations can be made herein without departing from the spirit and scope of the invention as defined by the appended claims.

Claims

1. A DGA domain name detection model training method is characterized by comprising the following steps:

step S1, obtaining domain name information of a plurality of domain name samples;

step S2, for DGA domain names and non-DGA domain names in the domain name samples, respectively training a recurrent neural network for feature extraction according to at least part of the domain name information; calculating the ratio of the output of the recurrent neural network corresponding to the DGA domain name to the output of the recurrent neural network corresponding to the non-DGA domain name as the characteristic of at least part of the domain name information;

step S3, inputting the features into a classifier for training to obtain a DGA domain name classifier, and determining whether a domain name comes from a domain name generation algorithm by using the DGA domain name classifier.

2. The DGA domain name detection model training method of claim 1, wherein: the output of the recurrent neural network corresponding to the DGA domain name and the output of the recurrent neural network corresponding to the non-DGA domain name in the step S2 are both a fit of a likelihood function p (x | θ); wherein x is at least a portion of the domain name information, θ is a parameter of the recurrent neural network, and p (x | θ) represents a probability of x given θ.

3. The DGA domain name detection model training method of claim 2, wherein:

the domain name information comprises a character string of a domain name;

the training, in the step S2, of the DGA domain name and the non-DGA domain name in the domain name samples, according to at least a part of the domain name information, respectively, for a recurrent neural network for feature extraction includes:

and for one or more of the DGA domain name and the non-DGA domain name, adopting a character-level cyclic neural network to extract features, and adopting character-level cross entropy as a loss function to perform back propagation.

4. The DGA domain name detection model training method of claim 3, wherein:

the character-level recurrent neural network is adopted for feature extraction, and the method comprises the following steps: the output of each time beat of the recurrent neural network, except for the last time beat, is the probability of the next character occurring; the output of the last time beat of the recurrent neural network is a fit of the likelihood function representing the probability that the string of the domain name is a DGA or a non-DGA;

the backward propagation by using the character-level cross entropy as the loss function comprises the following steps: the recurrent neural network performs back propagation of a loss function at each time beat, specifically including calculating a cross entropy between an output of a recurrent unit and a true value of a next beat in the character string of the domain name at each time beat and performing back propagation.

5. The DGA domain name detection model training method of claim 4, wherein:

the output of the recurrent neural network corresponding to the DGA domain name or the output of the recurrent neural network corresponding to the non-DGA domain name is:

6. The DGA domain name detection model training method of claim 3,

prior to the step S2, the method further includes: encoding each character in the character string of the domain name in a unique heating mode;

training the recurrent neural network for feature extraction according to at least a part of the domain name information in the step S2 includes: training the recurrent neural network according to the character string of the domain name subjected to the one-hot encoding.

7. The DGA domain name detection model training method of claim 2,

the step S1 specifically includes: acquiring a main domain name and a sub domain name of the domain name sample;

the step S2 specifically includes: for the main domain name and the sub domain name, respectively training the recurrent neural network corresponding to the DGA domain name and the recurrent neural network corresponding to the non-DGA domain name, and respectively determining the ratio corresponding to the main domain name and the ratio corresponding to the sub domain name as the characteristics of the main domain name and the sub domain name;

the step S3 specifically includes: inputting the characteristics of the main domain name and the characteristics of the sub domain name into a classifier for training.

8. The DGA domain name detection model training method of claim 7,

the step S1 further includes: extracting a top-level domain name of the domain name sample;

prior to the step S3, the method further includes: coding the top-level domain name in a single-hot mode to obtain a top-level domain name single-hot vector as the characteristic of the top-level domain name;

the step S3 specifically includes: and inputting the characteristics of the main domain name, the characteristics of the sub domain name and the characteristics of the top-level domain name of the domain name samples into a classifier together for training to obtain the DGA domain name classifier.

9. The DGA domain name detection model training method of claim 1, wherein the step S3 specifically comprises:

training a logistic regression classifier for two classes using the features as inputs and using the domain name, which is calibrated to be DGA, as supervisory information.

10. The DGA domain name detection model training method of claim 1, wherein:

the recurrent neural network utilized in the step S2 is a recurrent neural network based on long-short term memory or a recurrent neural network based on gated recurrent units.

11. A DGA domain name detection method is characterized by comprising the following steps:

acquiring domain name information of a domain name to be detected in the same way as in step S1 of the DGA domain name detection model training method according to any one of claims 1 to 10;

inputting at least a part of the domain name information of the domain name to be detected into the recurrent neural network corresponding to the DGA domain name and the recurrent neural network corresponding to the non-DGA domain name obtained in step S2 of the DGA domain name detection model training method according to any one of claims 1 to 10, and calculating a ratio of outputs of the two recurrent neural networks as a feature of at least a part of the domain name information of the domain name to be detected;

inputting the characteristics of the domain name to be detected into the DGA domain name classifier obtained in step S3 of the DGA domain name detection model training method according to any one of claims 1 to 10, so as to determine whether the domain name to be detected is from a domain name generation algorithm.

12. A DGA domain name detection model training device comprises:

a memory for storing non-transitory computer readable instructions; and

a processor for executing the computer readable instructions such that the computer readable instructions, when executed by the processor, implement the DGA domain name detection model training method of any one of claims 1 to 10.

13. A DGA domain name detection apparatus comprising:

a memory for storing non-transitory computer readable instructions; and

a processor for executing the computer readable instructions such that the computer readable instructions, when executed by the processor, implement the DGA domain name detection method of claim 11.

14. A computer storage medium comprising computer instructions which, when run on a device, cause the device to perform a DGA domain name detection model training method as claimed in any one of claims 1 to 10 or to perform a DGA domain name detection method as claimed in claim 11.