CN110177123B

CN110177123B - Botnet detection method based on DNS mapping association graph

Info

Publication number: CN110177123B
Application number: CN201910534665.3A
Authority: CN
Inventors: 张小松; 牛伟纳; 熊智鹏; 谢鑫; 蒋天宇; 葛洪麟
Original assignee: University of Electronic Science and Technology of China
Current assignee: University of Electronic Science and Technology of China
Priority date: 2019-06-20
Filing date: 2019-06-20
Publication date: 2020-09-18
Anticipated expiration: 2039-06-20
Also published as: CN110177123A

Abstract

The invention relates to a botnet detection method based on a DNS mapping association graph, which comprises the following steps: A. filtering DNS traffic and response data packet traffic containing A records, and preprocessing the filtered response data packet traffic; B. extracting associated mapping relations of the preprocessed response data packet traffic by taking the full domain name and the IP as keywords respectively, constructing a bipartite graph component set by taking the full domain name and the IP as central nodes respectively, and combining graph components in each bipartite graph component set; C. analyzing elements in the bipartite graph set, and extracting graph feature vectors; D. taking the published Fast-flux and Domain-flux botnet sets as data input, executing the steps A to C, dividing the data into a training set and a test set according to the extracted graph feature vectors, and obtaining a classification model by using a LightGBM algorithm; E. and (5) finishing the botnet detection of the flow to be detected by applying the classification model. The method can be used for simultaneously detecting Fast-flux and Domain-flux botnets, and has high detection accuracy.

Description

Botnet detection method based on DNS mapping association graph

Technical Field

The invention relates to a detection method of network security, in particular to a botnet detection method based on a DNS mapping association diagram.

Background

Botnets are platforms of comprehensive attack methods developed on the basis of malicious codes such as traditional computer viruses, trojan horses, network worms, spyware and the like. In recent years, the novel botnet program can also apply the technologies such as 0DAY bug, phishing, p2p and the like to the propagation of the botnet program, so that the traditional host, the mobile equipment, even cloud equipment and a router are infected to become a host controlled by the botnet, commonly called 'broiler'. Botnets are still one of the current major threats to the internet, and are increasingly developed. Detection of botnets has become a significant challenge for security field personnel.

The DNS is an abbreviation of Domain Name System (Domain Name System), which is an important service in the internet System, and is a distributed database capable of mapping Domain names and IP addresses in association, similar to a phone book, and records the correspondence between Domain names and IP addresses.

The Fast-flux technology refers to a technology that the incidence relation between a domain name and an IP address changes constantly, a Network deployed by the Fast-flux technology is called as FFSN (Fast-flux Service Network) for short, and the FFSN can allocate a plurality of (or even thousands of) IP addresses to a legal domain name by constantly changing DNS records, so that high availability of the domain name is guaranteed.

Domain-flux refers to the dynamic change of Domain name by the controller of the botnet to avoid detection. The key to this is the domain name generation algorithm (DGA), which randomly generates a large number of domain names using a seed, and then zombie hosts initiate DNS requests one by one to attempt a communication connection, only a portion of which will be responded to. The communication nodes of the attacker and the controlled host are also dynamically changed, and detection can be well avoided.

Disclosure of Invention

The invention provides a botnet detection method based on a DNS mapping association graph, which aims at the detection of Fast-flux and Domain-flux botnets and has higher detection accuracy.

The invention relates to a botnet detection method based on a DNS mapping association graph, which comprises the following steps:

DNS traffic filtering and preprocessing: according to a flow mirror image of equipment at an outlet of a network to be tested, DNS flow is filtered according to a preset rule, response data packet flow containing an A record (A (Address) record is an IP address record corresponding to a specified host name (or domain name)) is filtered, and then the filtered response data packet flow is preprocessed;

B. and (3) map mapping association processing: according to the preprocessed response data packet flow, respectively taking a full Domain Name (FQDN) and an IP (Internet protocol) as keywords (key) according to DNS query response, extracting the associated mapping relation therein, respectively constructing a bipartite graph component set taking the full Domain Name and the IP as central nodes, and respectively merging graph components in each bipartite graph component set;

C. analyzing and extracting the characteristic of the graph assembly: analyzing the elements in the bipartite graph component set, and extracting graph feature vectors by combining information obtained by preprocessing;

D. and (4) classifying graph components: taking the published Fast-flux and Domain-flux botnet sets as data input, executing the steps A to C, completing the standardization of the data according to the extracted graph feature vectors, dividing the standardized data into a training set and a test set, and obtaining a classification model by using a LightGBM algorithm; LightGBM is a fast, high-performance, distributed, excellent gradient boosting framework that was sourced by Microsoft in 2017 and can be used for machine learning tasks such as sorting, classification, regression, etc. The method is based on a decision tree algorithm, adopts an optimal leaf wisdom strategy to split leaf nodes, and improves the speed by about 10 times compared with a mainstream classification algorithm on the premise of not reducing the accuracy, and reduces the occupied memory by about 3 times on the contrary.

E. Inputting the information of the flow to be detected into a classification model, calculating whether the flow to be detected is malicious flow or not through the classification model, and calculating the category of the malicious flow (Fast-flux or Domain-flux botnet) through the classification model if the flow to be detected is calculated to be the malicious flow.

Through tests, the method can cover the detection of Fast-flux and Domain-flux two types of botnets simultaneously, and has higher detection accuracy.

Further, the preprocessing in step a includes performing secondary filtering on the response packet traffic recorded in step a according to the full domain name and the white list of the IP, and extracting a plurality of pieces of field information of each record in the record a with the timestamp of the traffic as the ID, including the timestamp, the source MAC address, the destination MAC address, the source IP, the destination IP, the TTL value, the source port, the destination port, and the like.

Furthermore, when merging the graph components in the bipartite graph component set in step B, merging the graph components in a manner corresponding to the bipartite graph component set centered on the domain name and the bipartite graph component set centered on the IP.

Specifically, when graph components with the global names as the central nodes are merged, firstly, the difference DD between the similar domain names is calculated according to the hierarchical characteristics of the global domains, and then two similar graph components are merged by adopting a k-means clustering algorithm, wherein the difference DD between the similar domain names is calculated as follows:

wherein, ω is_λAn intermediate value calculated for the domain name disparity, λ being the hierarchy of the domain name, X and Y each representing a full domain name, X_λLayer λ, Y, representing the full domain name X_λA lambda-th layer representing a full domain name Y, e.g. full domain name www.baidu.com, the first layer being com, | X_λI represents X_λLength, | Y_λI represents Y_λThe length of (d), wherein | X | represents the number of X levels, | Y | represents the number of Y levels, α is a predetermined parameter, the initialization α is 2, α is used as a balance weight, the initial value is an empirical value, and then, the optimal adjustment dd can be performed_λAnd Ω are the median values of the calculation process, respectively.

Specifically, when graph components with an IP as a central node are merged, similar services are provided by IP addresses adjacent to the central node as conditions, similarity IS of the two IPs IS calculated under the condition that a specific time span IS satisfied, and the similar graph components are merged when a threshold value IS reached; the time span refers to the time interval of data processing in actual implementation, and IS usually 12 hours as an initial value, wherein the similarity IS of two IPs IS calculated as:

in the above formula, X represents an IP address of a central node of the graph component, Y represents the adjacent IP address, and X_mDenotes the value of X, Y_mDenotes the value of Y, X_tTime stamp of X, Y_tThe timestamp indicating Y, α and β respectively indicate the preset parameter, the initial value is 1.8 and 0.2, λ indicates the class difference of two IP addresses, for example, the class difference of a class-a IP address and a class-B IP address is 1, and the class difference of a class-a IP and a class-C IP is 2.

On the basis, the graph component characteristic analysis in the step C comprises the following steps:

C1. analyzing the structural characteristics of the graph assembly: calculating the number of nodes in the graph assembly, including a universal name node and an IP node, and calculating the maximum degree and the average degree of all central nodes;

C2. analyzing the full domain name node characteristics: calculating the Whois information of the full domain name according to the public data of the Whois database by using the information after the flow preprocessing of the step A; whois information is public information of domain names and IPs, indicating its basic relevant information.

C3. Analyzing the characteristics of the IP nodes: calculating the Whois information of the IP node according to the public data of the Whois database by using the information after the flow preprocessing of the step A;

C4. analyzing the characteristics of the connecting edges: the nodes in the graph component are connected through connecting edges, one connecting edge is a primary DNS query response, and TTL information (Time To Live, the field specifies the maximum network segment number allowed To pass before an IP packet is discarded by a router) including the average value and the variance value of the connecting edge is selected as the characteristic of the connecting edge;

C5. calculating blacklist characteristics: the blacklist comprises a full domain name blacklist and an IP blacklist, when the characteristics of the blacklist of the graph assembly are analyzed, the number of full domain name marks of the graph assembly, the number of marked second-level domain names + top-level domain names (2-LD + TLD), the maximum number of marked full domain name nodes, the number of marked IP nodes, the maximum number of marked IP nodes and the ratio of the marked nodes to the total nodes are calculated by combining the published blacklist library.

Further, the Whois information of the full domain name stated in the step C2 includes the creation time, the number of updates, the integrity, the maximum number of layers of the full domain name, the number of tie layers, the number of categories of the top level domain name (TLD), the number of categories of the secondary domain name (2-LD), and the maximum length, the average length, the number of words included, and the degree of character repetition of the secondary domain name (2-LD) characters.

Further, the Whois information of the IP node described in step C3 includes the status of the IP node, the update time, the country to which the node belongs, the number of Autonomous System Numbers (ASN) of the IP node, and the ratio of the number of Autonomous System Numbers (ASN) to the IP.

The botnet detection method based on the DNS mapping association graph has the advantages that:

1. the method can simultaneously cover the detection of Fast-flux and Domain-flux botnets.

2. And the response packet flow recorded by the A flow filtering record aiming at the DNS flow is greatly reduced in the data volume of subsequent processing.

3. A new DNS traffic processing idea is provided by constructing a bipartite graph set taking a full domain name and an IP as central nodes.

4. The combination of different algorithms is respectively carried out on the full Domain name and the IP, so that the data set of image components is greatly reduced, and the technical characteristics of Fast-flux and Domain-flux are better met.

5. By analyzing the characteristics of the DNS mapping association graph, the accuracy of botnet detection is greatly improved, and the method is also suitable for processing mass data of a high-speed network.

The present invention will be described in further detail with reference to the following examples. This should not be understood as limiting the scope of the above-described subject matter of the present invention to the following examples. Various substitutions and alterations according to the general knowledge and conventional practice in the art are intended to be included within the scope of the present invention without departing from the technical spirit of the present invention as described above.

Drawings

Fig. 1 is a flowchart of a botnet detection method based on a DNS mapping association map according to the present invention.

Detailed Description

The present embodiment adopts a Linux-based distributed operating system, CentOS system, with a version number of 7.6.1810.

As shown in fig. 1, the botnet detection method based on the DNS mapping association map of the present invention includes:

DNS traffic filtering and preprocessing: the equipment at the network outlet to be tested comprises a switch, a router and the like, wherein flow is led into a specific server network port by configuring a port mirror image, a PF _ RING Packet is installed on the server, if the data volume is large, flow collection at the level of 10Gbps can be realized by adopting a mode of PF _ RING + Zero Copy, DNS flow is filtered according to BPF (Berkeley Packet Filter) rules, and response data Packet flow containing A records (A (address) records are used for specifying IP address records corresponding to a host name (or a domain name)) is filtered.

And then preprocessing the filtered response data packet flow, including secondarily filtering the response data packet flow recorded by the A according to a full domain name and a white list of IP, and extracting a plurality of field information of each record in the A record by taking a time stamp of the flow as an ID, wherein the field information comprises the time stamp, a source MAC address, a destination MAC address, a source IP, a destination IP, a TTL value, a source port, a destination port and the like.

B. And (3) map mapping association processing: and for the flow of the preprocessed response data packet, according to DNS query response, respectively taking a full Domain Name (FQDN) and an IP as keywords (key), extracting the associated mapping relation therein, respectively constructing a bipartite graph component set taking the full Domain Name and the IP as central nodes, and respectively merging graph components by respectively adopting a corresponding mode for the bipartite graph component set taking the full Domain Name as the center and the bipartite graph component set taking the IP as the central node.

When graph components with domain names as central nodes are combined, firstly, the difference DD between the similar domain names is calculated according to the hierarchical characteristics of the full domain names, and then two similar graph components are combined by adopting a k-means clustering algorithm, wherein the difference DD between the similar domain names is calculated as follows:

wherein, ω is_λAn intermediate value calculated for the domain name disparity, λ being the hierarchy of the domain name, X and Y each representing a full domain name, X_λLayer λ, Y, representing the full domain name X_λLambda-th layer, | X, representing the full domain name Y_λI represents X_λLength, | Y_λI represents Y_λThe length of (d), wherein | X | represents the number of X levels, | Y | represents the number of Y levels, α is a predetermined parameter, the initialization α is 2, α is used as a balance weight, the initial value is an empirical value, and then, the optimal adjustment dd can be performed_λAnd Ω are the median values of the calculation process, respectively.

When graph components with an IP as a central node are combined, similar services are provided by IP addresses adjacent to the central node as conditions, the similarity IS of the two IPs IS calculated under the condition of meeting a specific time span, and the similar graph components are combined when a threshold value IS reached; the time span refers to the time interval of data processing in actual implementation, and IS usually 12 hours as an initial value, wherein the similarity IS of two IPs IS calculated as:

in the above formula, X represents an IP address of a central node of the graph component, Y represents the adjacent IP address, and X_mDenotes the value of X, Y_mDenotes the value of Y, X_tTime stamp of X, Y_tThe time stamp indicating Y, α and β respectively indicate preset parameters, the initial values are 1.8 and 0.2, respectively, and λ indicates a class difference value of two IP addresses, for example, a class a and B difference value is 1.

C. Analyzing and extracting the characteristic of the graph assembly: and analyzing the elements in the bipartite graph component set, and extracting graph feature vectors by combining the information obtained by preprocessing. Wherein the graph component feature analysis comprises:

C2. analyzing the full domain name node characteristics: calculating the Whois information of the full domain name according to the public data of the Whois database by using the information after the flow preprocessing of the step A, wherein the Whois information comprises the creation time, the updating times, the integrity, the maximum number of layers of the full domain name, the number of tie layers, the number of TLD (top level domain name) types, the number of 2-LD (second level domain name) types, the maximum length and the average length of 2-LD (second level domain name) characters, the number of words and the character repetition degree and the like;

C3. analyzing the characteristics of the IP nodes: calculating the Whois information of the IP node according to the public data of the Whois database by using the information after the flow preprocessing of the step A, wherein the Whois information comprises the complete state, the updating time, the country, the individual, the region, the ASN (autonomous system number) quantity of the IP node, the ratio of the ASN (autonomous system number) quantity to the IP and the like;

C5. calculating blacklist characteristics: the blacklist comprises a full domain name blacklist and an IP blacklist, when the characteristics of the blacklist of the graph assembly are analyzed, the number of full domain name marks of the graph assembly, the number of marked 2-LD + TLD (second-level domain name + top-level domain name), the maximum degree of the full domain name nodes, the number of the marked IP nodes, the maximum degree of the marked IP nodes and the ratio of the marked nodes to the total nodes are calculated by combining the published blacklist library.

D. And (4) classifying graph components: the published Fast-flux and Domain-flux botnet sets are used as data input, and a mixed data set containing real flow is constructed through flow replay of TCPReplay in a laboratory environment. Wherein the Fast-flux public dataset is pure Fast-flux malicious traffic in CTU-13 and sample traffic of Strom, Waledoc and Zeus botnet in ISOT. The Domain-flux public dataset is the ISOT HTTP Botnet dataset constructed by Alenazi A et al. And executing the step A to the step C, completing the standardization of data according to the extracted graph feature vector, dividing the standardized data into a training set and a test set, and obtaining a classification model by using a LightGBM algorithm.

E. Inputting the information of the flow to be detected into a classification model, calculating whether the flow to be detected is malicious flow or not through the classification model, and if the flow to be detected is calculated to be the malicious flow, calculating the category of the malicious flow through the classification model, wherein the category of the malicious flow is Fast-flux or Domain-flux botnet.

Claims

1. The botnet detection method based on the DNS mapping association diagram is characterized by comprising the following steps:

DNS traffic filtering and preprocessing: according to a flow mirror image of equipment at an outlet of a network to be tested, DNS flow is filtered according to a preset rule, then response data packet flow containing A records is filtered, and then the filtered response data packet flow is preprocessed;

B. and (3) map mapping association processing: according to the preprocessed response data packet flow, respectively taking a full domain name and an IP as keywords according to DNS query response, extracting the association mapping relation therein, respectively constructing a bipartite graph component set taking the full domain name and the IP as central nodes, and respectively merging graph components in each bipartite graph component set;

C. analyzing and extracting the characteristic of the graph assembly: analyzing the elements in the bipartite graph component set, and extracting graph feature vectors by combining information obtained by preprocessing; the analysis comprises the following steps:

C2. analyzing the full domain name node characteristics: calculating the Whois information of the full domain name according to the public data of the Whois database by using the information after the flow preprocessing of the step A;

C4. analyzing the characteristics of the connecting edges: the nodes in the graph component are connected through connecting edges, one connecting edge is a primary DNS query response, and TTL information including an average value and a variance value of the connecting edge is selected as a connecting edge characteristic;

C5. calculating blacklist characteristics: the blacklist comprises a full domain name blacklist and an IP blacklist, when the characteristics of the blacklist of the graph assembly are analyzed, the number of full domain name marks of the graph assembly, the number of second-level domain names and top-level domain names to be marked, the maximum degree of the full domain name nodes to be marked, the number of IP nodes to be marked, the maximum degree of the IP nodes to be marked and the ratio of the marked nodes to the total nodes are calculated by combining the published blacklist library;

D. and (4) classifying graph components: taking the published Fast-flux and Domain-flux botnet sets as data input, executing the steps A to C, completing the standardization of data according to the extracted graph feature vectors, dividing a training set and a testing set, and obtaining a classification model by using a LightGBM algorithm;

E. and inputting the information of the flow to be detected into a classification model, calculating whether the flow to be detected is malicious flow or not through the classification model, and calculating the category of the malicious flow through the classification model if the flow to be detected is calculated to be the malicious flow.

2. The botnet detection method based on DNS mapping association map according to claim 1, characterized by: and the preprocessing of the step A comprises the steps of carrying out secondary filtering on the flow of the response data packet recorded in the step A according to the full domain name and the white list of the IP, and extracting a plurality of pieces of field information of each record in the step A by taking the time stamp of the flow as the ID.

3. The botnet detection method based on DNS mapping association map according to claim 1, characterized by: when merging the graph components in the bipartite graph component set in the step B, merging the graph components in a mode respectively corresponding to the bipartite graph component set taking the universe name as the center and the bipartite graph component set taking the IP as the center node.

4. The botnet detection method based on DNS mapping association map according to claim 3, characterized by: when the graph components with the global names as the central nodes are combined, firstly, the difference DD between the similar domain names is calculated according to the hierarchical characteristics of the full domain names, and then two similar graph components are combined by adopting a k-means clustering algorithm, wherein the difference DD between the similar domain names is calculated as follows:

wherein, ω is_λAn intermediate value calculated for the domain name disparity, λ being the hierarchy of the domain name, X and Y representing respectivelyA full domain name, X_λLayer λ, Y, representing the full domain name X_λLambda-th layer, | X, representing the full domain name Y_λI represents X_λLength, | Y_λI represents Y_λThe length of (a), X represents the number of X levels, | Y represents the number of Y levels, | α is a preset parameter, the initialization α is 2, dd_λAnd Ω are the median values of the calculation process, respectively.

5. The botnet detection method based on DNS mapping association map according to claim 3, characterized by: when graph components with an IP as a central node are combined, similar services are provided by IP addresses adjacent to the central node as conditions, the similarity IS of the two IPs IS calculated under the condition of meeting a specific time span, and the similar graph components are combined when a threshold value IS reached; wherein, the similarity IS of the two IPs IS calculated as:

in the above formula, X represents an IP address of a central node of the graph component, Y represents the adjacent IP address, and X_mDenotes the value of X, Y_mDenotes the value of Y, X_tTime stamp of X, Y_tA time stamp indicating Y, α and β respectively indicate preset parameters, and λ indicates a category difference value of two IP addresses.

6. The botnet detection method based on DNS mapping association map according to claim 1, characterized by: the Whois information of the full domain name stated in step C2 includes the creation time, the update times, the integrity, the maximum number of layers, the number of tie layers, the number of top-level domain name categories, the number of secondary domain name categories of the full domain name, and the maximum length, the average length, the number of words included, and the character repetition degree of the secondary domain name characters.

7. The botnet detection method based on DNS mapping association map according to claim 1, characterized by: the Whois information of the IP node described in step C3 includes the state of the IP node, the update time, the country to which the node belongs, the number of autonomous system numbers of the node IP, and the ratio of the number of autonomous system numbers to the IP.