CN107645503B

CN107645503B - Rule-based method for detecting DGA family to which malicious domain name belongs

Info

Publication number: CN107645503B
Application number: CN201710855704.0A
Authority: CN
Inventors: 程华才; 范渊; 李凯
Original assignee: Hangzhou Dbappsecurity Technology Co Ltd
Current assignee: Hangzhou Dbappsecurity Technology Co Ltd
Priority date: 2017-09-20
Filing date: 2017-09-20
Publication date: 2020-01-24
Anticipated expiration: 2037-09-20
Also published as: CN107645503A

Abstract

The invention relates to the field of network security APT detection, and aims to provide a rule-based method for detecting a DGA family to which a malicious domain name belongs. The rule-based detection method of the DGA family to which the malicious domain name belongs is used for analyzing and detecting the malicious domain name and identifying the DGA family of viruses infected by the attacked computer equipment in the network. According to the method, the characteristic calculation of a large number of requested abnormal domain names in a bot program in a short time is carried out, the calculation result is matched with the domain name characteristic rule generated by the known DGA algorithm, the DGA family type related to the bot program infected by a certain computer device in the current network is rapidly identified, and the follow-up tracing of network attacks and the development of the cleaning work and the remedial measure of the bot program are facilitated.

Description

Rule-based method for detecting DGA family to which malicious domain name belongs

Technical Field

The invention relates to the field of network security APT (Advanced Persistent Threat) detection, in particular to a rule-based method for detecting a DGA family to which a malicious domain name belongs.

Background

The Domain Name System (DNS), one of the important infrastructures for internet services, serves as a distributed database for mapping Domain names and IP addresses to each other, so that users can more conveniently connect to and access the internet without remembering the IP address number strings that can be directly read by machines. Most current internet applications require the domain name system to perform addressing and conversion from domain name to IP address before specific services can be developed.

Botnet (Botnet, which is a research direction in the field of network security, refers to a network that uses one or more propagation means to infect a large number of hosts into a bot program (Botnet, which belongs to a malicious program, such as a worm or a trojan virus, etc.), so that a network that can be controlled in one-to-many manner is formed between a controller and the infected host, and most of them are still using DNS to acquire resources, locate servers, receive instructions, etc. In order to improve the self-survival ability, achieve better hiding and flexible effects and prolong the survival time, the botnets utilize some escape detection technologies: DGA Domain name Generation Algorithm (also commonly called DGA Algorithm), which is commonly used for bots to communicate with C & C servers (Command & Control) of their controllers, generally uses a private random string Generation Algorithm, uses date or other variable parameters as random seeds (i.e. input parameters of the Algorithm), generates some random strings every period (e.g. every day, every week, every 10 days, etc.), then an attacker registers as a Domain name of the C & C server (called malicious Domain name or DGA Domain name) by using a part of the random strings, generates these random Domain names in the malicious program according to the same Algorithm, then the malicious program tries to initiate a DNS Domain name resolution request, when a request for resolving a Domain name returns a success, the malicious program will continue to try to communicate with the IP address returned by the resolution, if the communication succeeds, it means that the malicious program finds the C & C server controlling itself, and further performs other operations, such as: and receiving a follow-up task instruction from the C & C server, uploading the acquired internal network information to the C & C server, and the like.

Since there are various botnets (botnets communicating through domain names generated by DGA algorithms are also called DGA families, a DGA family generally corresponds to a DGA algorithm, or a group of similar DGA algorithms), there are correspondingly various DGA algorithms, such as: the characteristics of the generated domain names are different, but the domain names generated by the same DGA domain name generation algorithm show the same or similar characteristics in terms of lexical terms: the length of the SLD (second Level Domain) (for example, a fixed length or an interval range), the character set used by the SLD (i.e., which possible alphabetic characters and numeric characters are included), the range of the TLD (Top Level Domain) (for example, a fixed first Level Domain or a plurality of selectable first Level domains, or is uncertain and determined by the input parameters of the algorithm), and so on.

After detecting possible worm or trojan viruses in the network, technicians can trace the source of the network attack, finally identify and remove the viruses and the botnet possibly related to the viruses and take possible remedial measures. Different viruses, clean-up methods, and remedies may vary, and thus, identifying the DGA family of viruses that are infected by computer devices in a network is a very important step in the overall network attack tracing process.

Disclosure of Invention

The invention mainly aims to overcome the defects in the prior art and provide a rule-based method for detecting a DGA family to which a malicious domain name belongs. In order to solve the technical problem, the solution of the invention is as follows: a detection method of DGA family to which a rule-based malicious domain name belongs is provided, which is used for analyzing and detecting the malicious domain name and identifying the type of DGA family (the botnet which communicates through domain names generated by DGA algorithms is also called as DGA family, one DGA family generally corresponds to one DGA algorithm or a group of similar DGA algorithms) of viruses infected by computer equipment under attack in the network, and the detection method of the DGA family to which the rule-based malicious domain name belongs comprises the following steps:

the method comprises the following steps: collecting data related to DGA algorithms on the Internet, and obtaining domain name samples generated by the DGA algorithms;

step two: for each DGA algorithm, selecting domain name set features to be analyzed to finally form a total feature list, selecting at least one training sample for each DGA algorithm, calculating and summarizing feature values corresponding to the training samples for each feature in the feature list, and finally forming a feature matrix, wherein the method specifically comprises the following substeps:

step (2A): taking a domain name set (for example, 100 domain names generated by the prosikefan algorithm every day) generated by each DGA algorithm in the step one in a time period of one day, one week or the like as a sample, namely taking one domain name set as one sample, and listing domain name features to be analyzed;

the domain name features to be analyzed refer to one or more features that can distinguish the domain name in the domain name set from the general domain name and domain names generated by other DGA algorithms, such as: SLD length, whether the SLD character string contains vowel alphabetic characters, the minimum ratio and the maximum ratio of the vowel alphabetic characters in the SLD, whether the SLD contains digital characters, the number of domain names generated by the algorithm in a time period, a TLD list and the like;

step (2B): obtaining domain name features to be analyzed of each sample through the step (2A) (namely, for each DGA algorithm, selecting one sample, listing the domain name features to be analyzed of the sample, if 50 DGA algorithms can be collected in the step one, selecting 50 samples, listing the domain name features to be analyzed of each sample), then carrying out union operation on the domain name features, and finally obtaining a feature list;

for example, for the domain name sample generated by the DGA algorithm pykspa, the domain name features to be analyzed are:

list of TLDs: TLD, Top Level Domain, representing a first Level Domain name; the TLD list indicates which primary domain names the domain name generated by the DGA algorithm may adopt;

SLD Length Range: the SLD (Second Level Domain) represents a secondary Domain name, and the SLD length range refers to the Domain name generated by the DGA algorithm and the length range of the secondary Domain name part;

SLD character value range: i.e., which letters a secondary domain name may be composed of;

time period for algorithm to generate domain name: that is, the algorithm generates a different domain name set at intervals, and in the time period, the domain names requested to be resolved by the malicious program using the algorithm all belong to the set;

the algorithm generates the number of domain names in a time period for generating the domain name: namely, the number of the generated domain names in a time period for generating the domain names by the algorithm;

the minimum ratio of vowels contained in the SLD: namely, at least a percentage of domain names generated by the DGA algorithm pykspa contain vowel alphabetic characters in the SLD;

for example, for the domain name sample generated by the DGA algorithm qadars, the domain name features to be analyzed are:

a TLD list;

SLD length range;

the value range of the SLD characters;

the algorithm generates a time period of the domain name;

the number of the domain names generated by the algorithm in a time period for generating the domain names is the same as the domain name characteristics needing to be analyzed by the algorithm pykspa;

whether the SLD has the condition of switching between alphabetic characters and numeric characters: i.e. in the secondary domain name, alphabetic characters are switched with numeric characters, such as domain name: 05 qj09mf4d2b.com;

and performing union operation on domain name features to be analyzed on the domain name samples generated by the two DGA algorithms to form a feature list containing seven features:

TLD list, SLD length range, SLD character value source, time period of domain name generation by algorithm, number of domain names generated by algorithm in the time period of domain name generation, minimum ratio of vowel letters contained in SLD, and whether the SLD has letter character and number character switching condition;

step (2C): for at least two domain name sets generated by each DGA algorithm (namely, a plurality of samples are selected for each DGA algorithm, and at least two samples are selected), dividing the domain name sets into two parts: one part of the training samples is used for creating rules, and the number of the training samples is set to be M; the other part is used as a test sample for testing the identification accuracy of the rule;

respectively calculating the values of the features (referring to each feature in the feature list obtained in the step (2B) or summarizing the values of the features through the domain name in the sample) aiming at M training samples, calculating or summarizing the values of the N features for each sample by setting N features in the feature list obtained in the step (2B), adding the DGA family type to which each sample belongs, and forming a feature matrix with M rows and N +1 columns after calculating the M training samples; wherein M, N is a natural number greater than zero;

for example, assuming that the feature list obtained in step (2B) is as follows, in a specific embodiment, the feature list may include tens of features: TLD list, SLD length range, SLD character value source, time period of domain name generation by algorithm, number of domain names generated by algorithm in the time period of domain name generation, minimum ratio of vowel letters contained in SLD, and whether the SLD has the condition of switching between alphabetic characters and numeric characters;

the following example calculates and generalizes the values of the individual features of the algorithms pykspa and qadars from the feature list:

the algorithm pykspa generates 800 domain names per day, 4 for example:

hrxzdi.net,llwfnz.info,tknutifsxwh.com,kqcmxjplngd.org

for 800 domain names, the following characteristic values are obtained through calculation and induction, the left side of the character' can be regarded as a characteristic name, and the right side is a characteristic value:

list of TLDs: com, net, org, info; namely, the domain names generated by the algorithm, the first-level domain names may be four;

SLD Length Range: 6-12; namely, the domain name generated by the algorithm, wherein the length range of the secondary domain name is 6-12;

the character value range of the SLD is as follows: a to z; namely the domain name generated by the algorithm, the character source of the secondary domain name is a-z, 26 letter characters;

time period for algorithm to generate domain name: day; namely, the algorithm generates a domain name set every day;

the number of domain names generated by the algorithm in the time period of generating the domain name is as follows: 800; namely, the algorithm generates 800 domain names in a domain name set every day;

the minimum ratio of vowels contained in the SLD: 70 percent; that is, at least 70% of the domain names generated by the algorithm contain vowel alphabetic characters in the secondary domain name;

whether the SLD has the condition of switching between alphabetic characters and numeric characters: if not; namely, the domain name generated by the algorithm has no condition of switching between alphabetic characters and numeric characters;

DGA family type: pykspa

The algorithm qadars generates 1800 per week domain names, for example 4:

7wpyj01ijol2.org,k9ijkhiz8hy7.org,jkhu7w123whu.net,1if05u3gtevs.com

the characteristic values obtained by calculating and summarizing 1800 domain names are as follows:

list of TLDs: com, net, org; namely, the domain names generated by the algorithm, the first-level domain names may be three types;

SLD Length Range: 12; namely, the domain name generated by the algorithm, the length bit of the secondary domain name is fixed, and 12-bit characters are generated;

the character value range of the SLD is as follows: a to z,0 to 9; namely the domain name generated by the algorithm, the character sources of the secondary domain name are a-z and 0-9;

time period for algorithm to generate domain name: week; namely, the algorithm generates a domain name set every week;

the number of domain names generated by the algorithm in the time period of generating the domain name is as follows: 1800; that is, the algorithm generates 1800 domain names in the domain name set every week;

the minimum ratio of vowels contained in the SLD: 95 percent; that is, at least 95% of the domain names generated by the algorithm contain vowel alphabetic characters in the secondary domain name;

whether the SLD has the condition of switching between alphabetic characters and numeric characters: is that; namely, the algorithm generates a domain name, and the condition of switching between alphabetic characters and numeric characters exists;

DGA family type: qadars

For the above example, a feature matrix of two rows (i.e., two DGA algorithms) and eight columns (i.e., seven feature values plus one column of the DGA family type) can be obtained, and the feature matrix can be regarded as a datamation description of the sample;

step three: the rule creating function creates a DGA family type classification rule according to the feature matrix of the training sample; the DGA family type classification rule refers to a classification rule created by processing an output result (a feature matrix with M rows and N +1 columns) in the step 2C through machine learning algorithms such as a decision tree C4.5 algorithm and a K-nearest neighbor algorithm, and the classification rule can identify the DGA family type to which a certain domain name set belongs;

after the DGA family type classification rule is established, the DGA family type classification rule is stored in a configuration file or a relation database and is loaded and used by a detection module; the detection module is used for detecting the DGA family type of a specific domain name set sample;

for example, a plurality of domain names accessed by a certain computer device within a time period of 30 minutes and the like are regarded as a domain name set, that is, a sample, the value of the feature of the sample (that is, each feature in the feature list finally obtained in step 2B) is calculated and summarized, and the value of the feature of the sample is used as the input of the detection module;

the detection module is a program with a DGA family type detection function, and is a program capable of detecting the DGA family type of a specific domain name set sample; for the input feature value of a domain name set sample, the detection module judges the DGA family type of the domain name set sample according to the established DGA family type classification rule or other conditions; for example, the domain name is a normal domain name, that is, the domain name in the domain name set is not generated by a DGA algorithm, and the description of step eight can be viewed;

for DGA family type classification rules, two creating manners are proposed in point 4 below, wherein fig. 3 in the specification is an exemplary model for creating DGA family classification rules by using a decision tree, the model only includes 7 DGA family types, namely pykspa, madmax, shifu, qadars, mirai, rovnix and mutfet, and the used features only include 3 features, namely an SLD length range, a TLD list and an SLD character value range; in particular embodiments, the DGA family types may reach dozens, and the features used may also reach more than 10; for the features of an input domain name set sample, the detection module searches the branches of feature matching from the tree root to the bottom in a first-level and first-level manner according to the specific feature values until a specific DGA family type is found, or obtains an unknown result (indicating that the search fails, namely the DGA family type of the feature matching is not found, and corresponding description is provided in step eight); another way to use the configuration rule is described in the following 4 th way (3B);

step four: the acquisition module acquires DNS protocol flow and HTTP protocol flow to obtain original flow data;

the acquisition module is used for acquiring network flow, can directly acquire data from a network card and also can directly receive flow data sent by other systems;

the DNS protocol flow refers to a request for analyzing the IP address corresponding to the domain name and a domain name analysis result returned by the DNS server, which are sent to the DNS server by the computer equipment in order to acquire the IP address corresponding to the domain name; by collecting DNS protocol traffic of computer equipment needing protection (namely traffic related to domain name resolution sent and received by the computer equipment needing protection), and detecting whether domain names requested to be resolved by the computers are generated by a certain DGA algorithm, whether the computer equipment is infected with malicious programs such as viruses and trojans, and the type of a DGA family related to the malicious programs are judged;

the HTTP protocol flow is used for recording HTTP operations which are further possible to request (for example, downloading an updated version of the malicious program and uploading collected sensitive information to a C & C server) after the malicious program is detected to request to analyze the malicious domain name and return successfully, so that subsequent risk detection is facilitated;

step five: the protocol analysis module analyzes DNS protocol flow and HTTP protocol flow according to protocol specifications, restores original network behavior information and obtains flow data which can be processed by the subsequent functional module, namely source IP, destination IP, a source port, a destination port, a domain name requested to be analyzed, a domain name analysis result, request time, HTTP request operation, return information and the like;

the protocol analysis module can analyze the information of both communication parties from the network flow data according to the protocol specification, wherein the information comprises a source IP, a destination IP, a source port, a destination port, request content and response information;

step six: the protocol analysis module filters the domain name requested to be analyzed by the computer equipment needing to be protected (namely the computer equipment in the enterprise internal network protected by the detection method) by using a domain name white list library, if the domain name can be found in the domain name white list library, the domain name is considered to be a normal and common domain name, the DGA family type to which the domain name belongs is not detected any more, and the next domain name is processed continuously; if the domain name is not found in the domain name and white name list library, the domain name and the computer IP address requesting to resolve the domain name need to be sent to a detection module, and the processing of the step seven is carried out;

the domain name white list library is a behavior that some common domain names which are definitely not threatened and have no malice are stored in a file or a relation database, and if the domain names which are requested to be analyzed by the computer equipment belong to the domain name white list library, the domain names are considered to be normal;

step seven: the detection module detects the DGA family type according to the rule, and specifically comprises the following sub-steps:

step (7A): the detection module loads and uses the DGA family type classification rule created in the third step, receives the computer (i.e., the computer device in the enterprise internal network protected by the detection method) IP and the domain name requested to be resolved by the computer sent in the sixth step, uses a plurality of domain names requested to be resolved by the computer within a period of 10 minutes or half an hour and the like as a domain name set sample (if a plurality of computers are in the enterprise internal network, a plurality of domain names requested to be resolved by each computer within a period of time are respectively used as a sample, and then sequentially processes the samples), and calculates and summarizes the features of the domain name set sample, that is, the features in the feature list, by combining with the feature list obtained in the step (2B): what the TLD list is, what the SLD length range is, what the SLD character value range is.

Step (7B): matching the DGA family type classification rule loaded by the detection module by using the sample characteristics obtained by calculation in the step (7A);

if the matching is successful, the domain name included in the domain name set sample is requested to be resolved by a malicious program related to the DGA family, namely, the malicious program requests to resolve the domain name so as to achieve the purpose of communicating with the C & C server, and further, the computer (namely, the computer equipment in the internal network of the enterprise protected by the detection method) is infected with the malicious program; if the matching fails, continuing to execute the step eight;

step eight: if the DGA family type with matched features is not found according to the domain name requested to be analyzed, the method is divided into two cases (how to judge whether the domain name is normal or malicious: the domain name can be searched on the Internet through a search engine, the registration information, the record information and the information change record of the domain name are checked, the domain name is inquired on a security website, whether information related to network attack and virus exists or not, and whether the domain name is safe or not can be detected by using a third-party security detection tool):

(1) if the domain name requested to be resolved is a normal domain name, supplementing and updating a domain name white list library;

(2) the domain name requested to be resolved is a malicious domain name, and the method is divided into three situations:

A. this is an existing virus program, but the domain name features it requests are not in the rules that have been created;

B. this is a new variant of an existing virus program, or the virus program uses different input parameters to the existing DGA algorithm, resulting in a large variation of the characteristics of the output domain name set;

C. this is a completely new virus program that requests that the domain name characteristics being resolved not have complete knowledge on the network.

In the above three cases, researchers related to network security can determine which one the malicious programs infected on the computer device belong to by performing reverse analysis, but there is no simple method to directly determine which one the malicious programs belong to, and information on network security needs to be frequently concerned or updated information of the website of the network security enterprise, such as information related to worms, trojans and botnets; if it is the case C, that is, a new virus or trojan horse type appears and it uses a different DGA algorithm from the existing one to generate a malicious domain name, the steps one to three need to be repeated for the DGA algorithm of this kind: searching related algorithms or virus samples, operating the algorithms, or submitting the virus samples to a sandbox for simulation operation, acquiring network flow, acquiring a domain name requested to be analyzed by the virus samples, acquiring a domain name set generated by a DGA algorithm called by the virus samples, taking the domain name set as a sample, selecting sample characteristics, and recreating rules.

In the present invention, the step one specifically includes the following substeps:

step (1A): searching the network for data related to a DGA algorithm, including program code (code of a specific computer programming language) for generating domain names, and pseudo code for describing the algorithm;

if the code related to the DGA algorithm is found on the network, then:

for the obtained program code used for generating the domain name, after the program code is operated, the domain name output by the program is obtained;

for the obtained pseudo code describing the algorithm, a specific computer programming language is used for writing the pseudo code into executable code, and then the executable code is operated to obtain an output domain name, or the domain name output when the algorithm is actually executed is estimated according to the description of the algorithm;

then jumping to the step (1D) for execution;

if no code related to the DGA algorithm is found on the network, continuing to execute the step (1B);

step (1B): searching a domain name sample generated by the DGA algorithm, wherein the domain name sample can show the main characteristics of the domain name generated by the DGA algorithm;

if the domain name sample generated by the DGA algorithm is found, jumping to the step (1D) for execution;

if the domain name sample generated by the DGA algorithm is not found, continuing to execute the step (1C);

step (1C): downloading a malicious program sample related to the DGA algorithm from a virus detection website or a network security website, then putting the malicious program sample into a sandbox for operation, capturing network traffic by using a network sniffer (such as a network traffic acquisition tool like tcpdump and wireshark) in the operation process to obtain a domain name requested to be analyzed by the malicious program sample in the operation process, acquiring the domain name sample, and operating the malicious program sample in the sandbox for multiple times to obtain the main characteristics of the domain name requested to be analyzed by the malicious program sample;

a sandbox, also called a sandbox, is a virtual system program that allows you to run a browser or other program in a sandbox environment so that changes made by the run can be subsequently deleted; the system creates an independent operation environment similar to a sandbox, and a program running in the system cannot permanently influence a hard disk; the system is an independent virtual environment and can be used for testing untrusted application programs or internet surfing behaviors;

step (1D): checking whether a known DGA algorithm does not obtain a domain name set sample related to the DGA algorithm, and if so, repeating the step 1A; and stopping the circulation until all the known DGA algorithms obtain the domain name set samples currently.

In the present invention, in said step one, the current typical DGA family types include but are not limited to these (in alphabetical order, the following DGA family type name information comes from the internet): bamital, banjori, blackhole, chip, configer, cryptolocker, dircrypt, dyre, emott, bubble, gamover, gps, locker, madmax, matsnuu, mirai, murofet, cures, nymam, promaim, prosikefan, pykspa, qadars, ramnit, ranbyus, rovnix, shifu, simda, subpox, symmi, temprede, tiba, toffee, wvak, vidor virus.

In the present invention, in the third step, the rule creation specifically includes the following two ways:

mode (3A): selecting a classification algorithm; for example, a decision tree C4.5 algorithm may be selected, which: a decision tree is a predictive model representing a mapping between object attributes and object values, each node in the tree representing an object and each diverging path representing a possible attribute value, and each leaf node corresponding to the value of the object represented by the path traversed from the root node to the leaf node; the decision tree has only single output, if a plurality of outputs are required, an independent decision tree can be established to process different outputs; decision trees in data mining are a technology frequently used, can be used for analyzing data and also can be used for prediction, a machine learning technology for generating the decision trees from the data is called decision tree learning, generally speaking, the decision trees are decision trees, and a decision tree algorithm is divided into a plurality of versions, namely ID3, C4.5, Cart and the like;

learning the feature matrix calculated in the step (2C), and outputting a classification model which is a DGA family type classification rule finally required; for example, a decision tree model created by a decision tree algorithm according to the feature matrix;

in order to make the model (i.e. the classification rule) more accurate, a plurality of groups of training samples (i.e. the training samples prepared in step 2C) are prepared, a plurality of training is performed (i.e. different training samples are learned by a classification algorithm for a plurality of times) to obtain a plurality of models, then the models are respectively used for testing the test samples (i.e. the test samples prepared in step 2C), and finally the model with the highest identification accuracy rate of the DGA family type after testing is selected;

mode (3B): converting the feature matrix obtained in the step (2C) into a configurable mode, namely, taking the features in the feature list finally obtained in each step (2B) as a configurable attribute, and configuring a rule for each type of DGA algorithm, namely, as a DGA family type classification rule;

for example, for the algorithm qadars exemplified in step (2C), the following rule may be configured (in the following rule, the character '#' indicates that the letter following the current line is a comment, the configuration of the character ': left may be regarded as a feature name, i.e., one attribute that may be configured, and the value of the feature on the right of the character': right):

com, net, org # corresponding feature TLD list

SLD SIZE 12# corresponding feature SLD Length Range

VALUE SOURCE: character value source of alpha-z, 0-9 # corresponding characteristic SLD

ALPHA DIGIT SWITCH # 1 corresponds to whether the feature SLD has a case where alphabetic characters are switched with numeric characters,

# configuration 1 here indicates that there is a letter to number character switch

MIN VOWEL RATIO 95# SLD with minimum RATIO of VOWEL letters

COUNT 1800/W # this type of algorithm generates 1800 different domain names per week, W: indicating that the time period for generating the domain name is one week (week)

... # in embodiments, other possible features are configured as needed

DGA qadrs # rule corresponding DGA family type

After the rules are established, testing the test samples prepared in the step (2C), and testing the accuracy of classifying the actual samples by the established rules; if the accuracy rate does not reach the expected set effect threshold value, analyzing the reason and correspondingly processing; determining whether characteristics need to be reselected or not according to specific reasons, whether training samples need to be reselected or not and the like; if the accuracy reaches the expected set effect threshold, for example, the accuracy of identifying the DGA family type for the test sample reaches 95%, the rule creation is successful.

In the present invention, the matching method in the step (7B) is specifically as follows:

if the DGA family type classification rule is a decision tree, after the rule is matched, if the matching is successful, an output value can be obtained, the output value is the type of the DGA family, and if not, the matching is failed;

if the DGA family type classification rule does not use the decision tree model, the detection module uses the sample characteristics to match all the rules (i.e. the rules configured by the mode (3B)), calculates the matching degree (for example, the score is used to represent the matching degree) in the matching process, and eliminates the rules with unmatched characteristics one by one: if the matching degree of at least one rule exceeds the threshold value of the matching degree, selecting the rule with the highest matching degree, namely the DGA family type which is most matched with the domain name set characteristics (namely the sample characteristics calculated in the step (7A)); if the matching degree of none of the rules exceeds the threshold value of the matching degree, the matching is failed; the matching degree here is set with a threshold value according to the experience of the test, for example, even the rule with the highest matching degree requires that at least 90% of the domain names in the domain name set to be detected match the features described by the rule.

The working principle of the invention is as follows: each typical botnet has its own DGA algorithm to generate domain names for the control end (C & C server) to communicate with the bots on the infected hosts, and the different DGA algorithms result in different respective characteristics for the domain name sets output after the algorithms are executed. Before communicating with a control end, a bot program usually requests to resolve a malicious domain name which is obviously different from a common domain name in a short time, if the domain names requested to be resolved by a certain computer device (identified by IP addresses) are collected at the entrance and exit of a network, the domain names are used as a domain name set, the characteristics of the domain names are calculated, and then the characteristics are matched with the characteristics of the domain names generated by a known DGA algorithm, the DGA family type related to the bot program infected by the computer device is most likely to be the highest matching degree, which is favorable for the follow-up network attack tracing and the bot program clearing work.

Compared with the prior art, the invention has the beneficial effects that:

according to the method, the characteristic calculation of a large number of requested abnormal domain names in a bot program in a short time is carried out, the calculation result is matched with the domain name characteristic rule generated by the known DGA algorithm, the DGA family type related to the bot program infected by a certain computer device in the current network is rapidly identified, and the follow-up tracing of network attacks and the development of the cleaning work and the remedial measure of the bot program are facilitated.

Drawings

FIG. 1 is a flow chart of the present invention for creating DGA family type classification rules.

Fig. 2 is a flowchart of detecting the DGA family type to which the suspicious malicious domain name belongs according to the present invention.

Fig. 3 is an exemplary model of DGA family classification rules created using decision trees as set forth in the present invention.

Detailed Description

It should be noted that the method for detecting a DGA family to which a malicious domain name belongs is an application of computer technology in the technical field of information security. In the implementation process of the invention, the application of a plurality of software functional modules is involved. The applicant believes that the skilled person will be fully enabled to implement the invention by applying his own software programming skills without the possibility of incomprehension or irreproducibility, in combination with the prior art, after having read the application file in detail, with an accurate understanding of the principle of realisation of the invention and the objects thereof. The aforementioned software functional modules include but are not limited to: the network traffic collection module, the network traffic protocol analysis module, the DGA family type classification rule creation module, the DGA family type detection module and the like can be realized in various specific modes, and the applications of the invention belong to the category, so that the applications are not listed one by one.

The domain name and white name list library used in the invention can be saved by using a text file, and also can be saved by using a relational database management system (RDBMS) such as MySQL, Oracle and the like.

The DGA family type classification rule created in the invention can be stored by using a text file, and can also be stored by using a relational database management system such as MySQL, Oracle and the like.

The decision tree C4.5 classification algorithm is an optional classification algorithm, and when the classification algorithm is used for creating rules in specific implementation, other classification algorithms can be selected according to actual conditions.

The result of the protocol resolution (original network behavior information, source IP address, source port, destination IP address, destination port, domain name and request time for requesting resolution, domain name resolution result, HTTP request operation, request time and return information, etc.), and the information such as DGA family type to which the malicious domain name belongs may be stored by using a relational database management system such as MySQL, Oracle, etc., or may be stored by using a non-relational database of a distributed computing framework based on NoSQL.

The invention is described in further detail below with reference to the following detailed description and accompanying drawings:

the method for detecting the DGA family to which the malicious domain name belongs based on the rules shown in fig. 1 and fig. 2 is used for quickly detecting the DGA family type related to the bot infected by a certain computer device in the network, and is beneficial to the subsequent tracing and tracing of network attacks and the development of the cleaning work and the remedial measure of the bot. The detection method specifically comprises the following steps:

the method comprises the following steps: some DGA algorithm-related data is gathered on the internet:

(1) looking up DGA algorithm-related material on the network, and obtaining domain name samples generated by these DGA algorithms.

(2) And (3) operating the program code obtained in the step (1) (if necessary, adding input parameters), and obtaining the domain name output by the program after operating the program. If the pseudo code of the algorithm is obtained in step (1), the code which can be executed by the implementation needs to be written by using a specific computer programming language. Or the domain name output when the algorithm is actually executed is deduced according to the description of the algorithm.

(3) If no DGA algorithm-dependent code is found, then the domain name samples generated by the DGA algorithm need to be looked up (the samples need to be as representative as possible of the main features of the domain name generated by the DGA algorithm).

(4) And if the step (2) and the step (3) do not obtain the domain name sample generated by the DGA algorithm. Then an attempt is made to download a malicious program sample related to the DGA algorithm from a virus detection website or a network security website, and then the malicious program sample is placed in a sandbox to be run, and during the running process, a network sniffer (e.g., a network traffic collection tool such as tcpdump, wireshark, etc.) is used to grab network traffic, so as to obtain a domain name that the sample may request to be resolved during the running process. Because the domain names generated by the DGA algorithm at different dates may be different, if the sample of the domain name is collected only a few times after the DGA algorithm is run in the sandbox, the sample needs to be run in the sandbox many times to reflect the main characteristics of the domain name requested to be resolved by the malware sample.

(5) Repeating the above steps, a domain name sample is obtained including but not limited to the following DGA family types (alphabetically, the following DGA family type name information is derived from the internet): bamital, banjori, blackhole, chip, configer, cryptolocker, dircrypt, dyre, emott, bubble, gamover, gps, locker, madmax, matsnuu, mirai, murofet, cures, nymam, promyikefan, pykspa, qadars, ramnit, ranbyus, rovnix, shifu, simda, subpox, symmi, temprede, tiba, toffee, wvak, vidor virus.

Step two: selecting domain name set characteristics to be analyzed:

(6) for the domain name set generated by each DGA algorithm in step (5) (e.g., 100 domain names generated by prosikefan algorithm every day), as a sample (i.e., one domain name set as a sample), the domain name set features that may need to be analyzed are listed, i.e., these features can distinguish the domain name in the sample set from the domain name generated by the ordinary domain name and other DGA algorithms, which may include but are not limited to the following features:

the SLD length span; for example: is a fixed length, or a plurality of possible fixed lengths, or an interval range;

the value range of the characters in the SLD character string; for example, 26 letters, or some fixed alphabetic characters are combined in random order and contain any one of the numeric characters '0' to '9', or only contain the 16-system characters '0' to '9' and 'a' to 'f';

whether vowel alphabetic characters are included in the SLD;

the minimum and maximum rates of vowel alphabet characters in the SLD;

whether the SLD contains numeric characters;

the minimum and maximum rates of digital character occupancy in the SLD;

whether there is a common character string in the SLD; if yes, whether the value of the public character string is invariable or determined by the input parameter of the algorithm, and whether the position of the public character string in the SLD is fixed;

whether the situation that the alphabetic characters and the numeric characters are frequently switched exists in the SLD; if so, the minimum switching times and the maximum switching times are respectively the same, or the ratio of the switching times to the SLD length, the minimum value and the maximum value are respectively the same;

the minimum ratio of the number of SLDs at least containing one vowel in the sample set to the total number of domain names in the sample set;

taking the value of TLD; is a fixed value, or randomly selects one from several fixed TLDs, or is uncertain and is determined by the input parameters of the algorithm;

the number of domain names generated by the algorithm in a time period;

the algorithm generates a time period for the domain name.

(7) And (6) repeating the step (6), and performing set union operation on one or more features listed by each DGA algorithm to finally form a feature list.

(8) And (3) dividing the domain name sets generated by each DGA algorithm by using the feature list obtained in the step (7) (namely, selecting a plurality of samples and at least two samples for each DGA algorithm) into two parts, wherein one part is used as a training sample for creating the rule, and the other part is used as a test sample for testing the identification accuracy of the rule. And (3) calculating or inducing the value of the feature according to each feature in the feature list obtained finally in the step (2B) and aiming at the training sample, and if N features exist in the feature list, calculating or inducing the values of the N features for each domain name set. And adding the DGA family type to which each domain name set belongs, and forming a characteristic matrix with M rows and N +1 columns after M training samples are calculated.

Step three: the rule creation function creates DGA family type classification rules according to the feature matrix of the training samples:

(9) and (4) selecting a classification algorithm, for example, selecting a decision tree C4.5 algorithm, learning the feature data obtained by the calculation in the step (8), wherein a decision tree model created by the decision tree algorithm according to the feature data is the finally required DGA family type classification rule. In order to make the model more accurate, a plurality of samples can be prepared, and training and testing can be carried out for a plurality of times. And finally, selecting the rule with the highest accuracy after test identification.

If a classification algorithm such as a decision tree is not used, the feature matrix obtained in step (8) can be converted into a configurable manner. Each feature is used as a configurable attribute, and a rule is configured for each type of DGA algorithm. In this way, the rule is created by using the test sample to perform an accuracy test on the created rule, and the threshold of the partial attribute is adjusted, in a specific implementation, the rule may be described by using a syntax similar to the following (in the following rule, the character '#' indicates that the text after the current line is a comment, the character ': left configuration may be regarded as a feature name, that is, one attribute that can be configured, and the character': right side is a value of the feature):

type of DGA: bamital # DGA family

The TLD is the value range of co.cc, cz.cc, info, org # TLD, the TLD of the domain name output by the algorithm takes one of the four TLDs

SLD SIZE 32# Domain name the length of the SLD part is fixed and is 32 characters

COUNT 104/D # this type of algorithm, generates 104 different domain names per day, D: indicating that the time period for generating the domain name is one day

ALPHA DIGIT SWITCH # 1 SLD in which the alphabetic and numeric characters are frequently switched

HASH # SLD VALUE is accompanied by a HASH character (i.e., '0' -9 'and' a '-f')

HASH POS 0# SLD start position of HASH string, 0: indicating that it starts with the first character of the SLD

HASH LENGTH Length of hash string in # 32 SLD

... # in an embodiment, other possible features are configured as needed

EXP 51bdc61022f0108b7053c5518ae87761.cz.cc, b7422ac536814a6bc6af0cf574e5d60d.info,00c58006323de055d35ef57ff97f8036.co.cc,9 bbf4817211f069d3befe28af3e0ebf.org # Domain name sample

After the rules are created, the rules are stored in a configuration file or a relational database and are loaded and used by a detection module.

(10) After the rules are created, the test sample prepared in step (8) is used for testing to test the accuracy of the created rules for classifying the actual samples. If the accuracy rate does not achieve the expected effect, the reason is analyzed, whether the characteristics need to be reselected or not is determined according to the specific reason, whether the training sample needs to be reselected or not is determined, and the like. Rule creation is successful if the accuracy is as expected, for example, 95% for the test sample to identify the DGA family type.

Step four: the acquisition module acquires DNS protocol and HTTP protocol traffic to obtain original traffic data.

The DNS traffic is used for collecting traffic related to domain name resolution sent and received by computer equipment protected by the detection method, whether the computer equipment is infected with malicious programs (namely viruses and trojans) or not and DGA family types related to the malicious programs are judged by detecting domain names requested to be resolved by the computers and whether the domain names are generated by a certain DGA algorithm, and the HTTP traffic is used for recording HTTP operations (for example, downloading updated versions of the malicious programs and uploading collected sensitive information to a C & C server) which are further possibly requested after the malicious programs are detected to request to resolve the malicious domain names and return success, so that subsequent risk detection is facilitated.

Step five: the protocol analysis module analyzes DNS protocol flow and HTTP protocol flow according to protocol specifications, generates original network behavior information, and obtains flow data which can be processed by the subsequent functional module, namely source IP, destination IP, source port, destination port, domain name requested for analysis, result of domain name analysis, request time, HTTP request operation, return information and the like.

Step six: and the protocol analysis module filters the domain name requested to be analyzed by the computer equipment in the enterprise by using the domain name white list library, namely, if the domain name requested to be analyzed by the computer equipment in the enterprise can be found in the domain name white list library, the domain name is considered to be a normal and common domain name, and the DGA family type to which the domain name belongs is not detected any more.

Step seven: the detection module detects the DGA family type according to the rule:

(11) after the rule is established, the detection module loads and uses the rule. When the detection module detects that a certain computer frequently requests to resolve suspicious domain names in actual flow, the detection program takes the domain names requested to be resolved by the computer in a period of 10 minutes or half an hour and the like as a sample, and calculates the features of the sample by combining the feature list obtained in the step (7).

(12) And (4) matching the loaded rules by using the sample characteristics calculated in the step (11).

If the classification rule is a decision tree, an output value is obtained after matching the rule, and the output value is the type of the DGA family. Or the matching fails, the step eight is entered.

If the classification rule does not use the decision tree model, the detection program may use the sample features to match all the rules, calculate the matching degree during the matching process, for example, use the score to represent the matching degree, eliminate the rules whose features are not matched one by one, and finally select the rule with the highest matching degree, i.e., the DGA family type that is the most matched with the domain name set features. The degree of matching may be set according to experience of testing, for example, even if the rule with the highest degree of matching is the rule, at least 90% of the domain names in the domain name set to be detected are required to match the features described by the rule.

Step eight: no treatment of the DGA family type case was detected.

If, from the domain name that is requested to be resolved, no DGA family type for a feature match is found, then the following two cases are possible:

(13) and if the domain name requested to be resolved is a normal domain name, supplementing and updating the domain name white list library.

(14) The domain name requested to be resolved is a malicious domain name, and the method is divided into three conditions:

A. this is an existing virus program, but the domain name characteristics it requests are not in the rules that have been created.

B. This is a new variation of an existing virus program or a virus program uses different input parameters to an existing DGA algorithm, resulting in a wide variation of the characteristics of the outgoing domain name set.

For the above three cases, information on network security or updated information (information related to worms, trojans and botnets) of network security enterprises need to be concerned frequently. If a new virus or Trojan horse type appears and uses a different DGA algorithm from the existing DGA algorithm to generate a malicious domain name, the DGA algorithm of the type needs to be processed according to the steps (1) to (10) as follows: searching related algorithms or virus samples, operating the algorithms, or submitting the virus samples to a sandbox for simulation operation (acquiring network traffic and acquiring domain names requested to be analyzed by the virus samples), obtaining a domain name set generated by a DGA algorithm called by the virus samples, taking the domain name set as the samples, selecting sample characteristics, and recreating rules.

In the above detection mode, the first to third steps require the participation of a data analyst: collecting related data of a sample (domain name set), obtaining the sample, selecting features, training by using a training sample, and testing by using a testing sample. And step one and step eight, personnel engaged in network security research are required to participate, and the work of tracing and analyzing the network attack and analyzing the malicious program sample is involved.

The function based on the method for detecting the DGA family type of the malicious domain name can be deployed as a functional module or a subsystem of a certain network security detection system, such as an APT intrusion detection system, and is generally deployed at an inlet and an outlet of an enterprise network to monitor and analyze the network traffic of the whole enterprise.

Finally, it should be noted that the above-mentioned list is only a specific embodiment of the present invention. It is obvious that the present invention is not limited to the above embodiments, but many variations are possible. All modifications which can be derived or suggested by a person skilled in the art from the disclosure of the present invention are to be considered within the scope of the invention.

Claims

1. A detection method of DGA family to which a rule-based malicious domain name belongs is used for analyzing and detecting the malicious domain name and identifying the category of the DGA family infected by viruses of computer equipment under attack in a network, and is characterized in that the detection method of the DGA family to which the rule-based malicious domain name belongs comprises the following steps:

step (2A): taking the domain name set generated by each DGA algorithm in the step one in a time period as a sample, and listing the domain name features to be analyzed;

step (2B): obtaining domain name features to be analyzed of each sample through the step (2A), and then performing union set operation on the domain name features to finally obtain a feature list;

step (2C): dividing at least two domain name sets generated by each DGA algorithm into two parts: one part of the training samples is used as training samples, and the number of the training samples is set as M; the other part is used as a test sample;

respectively calculating the values of the features aiming at M training samples, and if N features exist in the feature list obtained in the step (2B), calculating the values of the N features for each sample, and adding the DGA family type to which each sample belongs, wherein M training samples form a feature matrix with M rows and N +1 columns after calculation; wherein M, N is a natural number greater than zero;

step three: the rule creating function creates a DGA family type classification rule according to the feature matrix of the training sample;

the detection module is a program with a DGA family type detection function, and is a program capable of detecting the DGA family type of a specific domain name set sample; for the input feature value of a domain name set sample, the detection module judges the DGA family type of the domain name set sample according to the established DGA family type classification rule or other conditions; the other conditions are that the domain name requested to be analyzed does not find a DGA family type with matched features, and the domain name is a normal domain name or a malicious domain name;

the DNS protocol flow refers to a request for analyzing the IP address corresponding to the domain name and a domain name analysis result returned by the DNS server, which are sent to the DNS server by the computer equipment in order to acquire the IP address corresponding to the domain name; the method comprises the steps of judging whether computer equipment is infected with a malicious program or not and judging a DGA family type related to the malicious program or not by collecting DNS protocol traffic of the computer equipment needing protection, detecting whether domain names requested to be analyzed by the computers are generated by a DGA algorithm or not;

the HTTP protocol flow is used for recording HTTP operations which are possibly requested after the malicious program is detected to request to analyze the malicious domain name and return successfully, so that subsequent risk detection is facilitated;

step five: the protocol analysis module analyzes DNS protocol flow and HTTP protocol flow according to the protocol specification, original network behavior information is restored, and flow data which can be processed by the subsequent functional module is obtained;

step six: the protocol analysis module filters the domain name requested to be analyzed by the computer equipment needing to be protected by using a domain name white list library, if the domain name can be found in the domain name white list library, the domain name is considered to be a normal and common domain name, the DGA family type to which the domain name belongs is not detected, and the next domain name is processed continuously; if the domain name is not found in the domain name and white name list library, the domain name and the computer IP address requesting to resolve the domain name need to be sent to a detection module, and the processing of the step seven is carried out;

step (7A): the detection module loads and uses the DGA family type classification rule established in the third step, receives the computer IP sent in the sixth step and the domain name requested to be analyzed by the computer, and takes a plurality of domain names requested to be analyzed by the computer in a period of time as a domain name set sample, and calculates and summarizes the characteristics of the domain name set sample by combining the characteristic list obtained in the step (2B); the period of time for forming the domain name set sample is predetermined according to conditions and requirements inside the network;

if the matching is successful, the domain name included in the domain name set sample is requested to be resolved by the malicious program related to the DGA family, and further the computer is infected with the malicious program; if the matching fails, continuing to execute the step eight;

step eight: if no DGA family type with matched features is found according to the domain name requested to be resolved, the following two cases are divided:

2. The method for detecting the DGA family to which the malicious domain name belongs based on the rules of claim 1, wherein the step one comprises the following sub-steps:

step (1A): searching data related to a certain DGA algorithm on a network, wherein the data comprises program codes used for generating domain names and pseudo codes describing the algorithm;

if the code related to the DGA algorithm is found on the network, then:

then jumping to the step (1D) for execution;

step (1B): searching a domain name sample generated by the DGA algorithm;

step (1C): downloading a malicious program sample related to the DGA algorithm from a virus detection website or a network security website, then putting the malicious program sample into a sandbox for operation, capturing network traffic by using a network sniffer in the operation process to obtain a domain name which is requested to be analyzed by the malicious program sample in the operation process, acquiring a domain name sample, and operating the malicious program sample in the sandbox for multiple times to obtain the main characteristics of the domain name which is requested to be analyzed by the malicious program sample;

3. The method according to claim 1, wherein in the step one, the current typical DGA family type includes: bamital, banjori, blackhole, chip, configer, cryptolocker, dircrypt, dyre, emott, bubble, gamover, gps, locker, madmax, matsnuu, mirai, murofet, cures, nymam, promaim, prosikefan, pykspa, qadars, ramnit, ranbyus, rovnix, shifu, simda, subpox, symmi, temprede, tiba, toffee, wvak, vidor virus.

4. The method according to claim 1, wherein in the third step, the rule creation specifically includes the following two ways:

mode (3A): selecting a classification algorithm, learning the feature matrix calculated in the step (2C), and outputting a classification model which is a finally required DGA family type classification rule;

in order to make the model more accurate, a plurality of groups of training samples can be prepared and trained for a plurality of times to obtain a plurality of models, then the models are respectively used for testing the test samples, and finally the model with the highest DGA family type identification accuracy rate after testing is selected;

after the rules are established, testing the test samples prepared in the step (2C), and testing the accuracy of classifying the actual samples by the established rules; if the accuracy rate does not reach the expected set effect threshold value, analyzing the reason and correspondingly processing; if the accuracy reaches the expected set effect threshold, the rule creation is successful.

5. The method for detecting the DGA family to which the malicious domain name belongs based on the rule of claim 1, wherein the matching method in the step (7B) is specifically as follows:

if the DGA family type classification rule does not use a decision tree model, the detection module uses sample characteristics to match all rules, and in the matching process, the matching degree is calculated, and the rules with unmatched characteristics are eliminated one by one: if the matching degree of at least one rule exceeds the threshold value of the matching degree, selecting the rule with the highest matching degree, namely the DGA family type which is most matched with the domain name set characteristics; and if the matching degree of none of the rules exceeds the threshold value of the matching degree, the matching is failed.