Disclosure of Invention
The invention provides a botnet detection method based on a DNS mapping association graph, which aims at the detection of Fast-flux and Domain-flux botnets and has higher detection accuracy.
The invention relates to a botnet detection method based on a DNS mapping association graph, which comprises the following steps:
DNS traffic filtering and preprocessing: according to a flow mirror image of equipment at an outlet of a network to be tested, DNS flow is filtered according to a preset rule, response data packet flow containing an A record (A (Address) record is an IP address record corresponding to a specified host name (or domain name)) is filtered, and then the filtered response data packet flow is preprocessed;
B. and (3) map mapping association processing: according to the preprocessed response data packet flow, respectively taking a full Domain Name (FQDN) and an IP (Internet protocol) as keywords (key) according to DNS query response, extracting the associated mapping relation therein, respectively constructing a bipartite graph component set taking the full Domain Name and the IP as central nodes, and respectively merging graph components in each bipartite graph component set;
C. analyzing and extracting the characteristic of the graph assembly: analyzing the elements in the bipartite graph component set, and extracting graph feature vectors by combining information obtained by preprocessing;
D. and (4) classifying graph components: taking the published Fast-flux and Domain-flux botnet sets as data input, executing the steps A to C, completing the standardization of the data according to the extracted graph feature vectors, dividing the standardized data into a training set and a test set, and obtaining a classification model by using a LightGBM algorithm; LightGBM is a fast, high-performance, distributed, excellent gradient boosting framework that was sourced by Microsoft in 2017 and can be used for machine learning tasks such as sorting, classification, regression, etc. The method is based on a decision tree algorithm, adopts an optimal leaf wisdom strategy to split leaf nodes, and improves the speed by about 10 times compared with a mainstream classification algorithm on the premise of not reducing the accuracy, and reduces the occupied memory by about 3 times on the contrary.
E. Inputting the information of the flow to be detected into a classification model, calculating whether the flow to be detected is malicious flow or not through the classification model, and calculating the category of the malicious flow (Fast-flux or Domain-flux botnet) through the classification model if the flow to be detected is calculated to be the malicious flow.
Through tests, the method can cover the detection of Fast-flux and Domain-flux two types of botnets simultaneously, and has higher detection accuracy.
Further, the preprocessing in step a includes performing secondary filtering on the response packet traffic recorded in step a according to the full domain name and the white list of the IP, and extracting a plurality of pieces of field information of each record in the record a with the timestamp of the traffic as the ID, including the timestamp, the source MAC address, the destination MAC address, the source IP, the destination IP, the TTL value, the source port, the destination port, and the like.
Furthermore, when merging the graph components in the bipartite graph component set in step B, merging the graph components in a manner corresponding to the bipartite graph component set centered on the domain name and the bipartite graph component set centered on the IP.
Specifically, when graph components with the global names as the central nodes are merged, firstly, the difference DD between the similar domain names is calculated according to the hierarchical characteristics of the global domains, and then two similar graph components are merged by adopting a k-means clustering algorithm, wherein the difference DD between the similar domain names is calculated as follows:
wherein, ω isλAn intermediate value calculated for the domain name disparity, λ being the hierarchy of the domain name, X and Y each representing a full domain name, XλLayer λ, Y, representing the full domain name XλA lambda-th layer representing a full domain name Y, e.g. full domain name www.baidu.com, the first layer being com, | XλI represents XλLength, | YλI represents YλThe length of (d), wherein | X | represents the number of X levels, | Y | represents the number of Y levels, α is a predetermined parameter, the initialization α is 2, α is used as a balance weight, the initial value is an empirical value, and then, the optimal adjustment dd can be performedλAnd Ω are the median values of the calculation process, respectively.
Specifically, when graph components with an IP as a central node are merged, similar services are provided by IP addresses adjacent to the central node as conditions, similarity IS of the two IPs IS calculated under the condition that a specific time span IS satisfied, and the similar graph components are merged when a threshold value IS reached; the time span refers to the time interval of data processing in actual implementation, and IS usually 12 hours as an initial value, wherein the similarity IS of two IPs IS calculated as:
in the above formula, X represents an IP address of a central node of the graph component, Y represents the adjacent IP address, and XmDenotes the value of X, YmDenotes the value of Y, XtTime stamp of X, YtThe timestamp indicating Y, α and β respectively indicate the preset parameter, the initial value is 1.8 and 0.2, λ indicates the class difference of two IP addresses, for example, the class difference of a class-a IP address and a class-B IP address is 1, and the class difference of a class-a IP and a class-C IP is 2.
On the basis, the graph component characteristic analysis in the step C comprises the following steps:
C1. analyzing the structural characteristics of the graph assembly: calculating the number of nodes in the graph assembly, including a universal name node and an IP node, and calculating the maximum degree and the average degree of all central nodes;
C2. analyzing the full domain name node characteristics: calculating the Whois information of the full domain name according to the public data of the Whois database by using the information after the flow preprocessing of the step A; whois information is public information of domain names and IPs, indicating its basic relevant information.
C3. Analyzing the characteristics of the IP nodes: calculating the Whois information of the IP node according to the public data of the Whois database by using the information after the flow preprocessing of the step A;
C4. analyzing the characteristics of the connecting edges: the nodes in the graph component are connected through connecting edges, one connecting edge is a primary DNS query response, and TTL information (Time To Live, the field specifies the maximum network segment number allowed To pass before an IP packet is discarded by a router) including the average value and the variance value of the connecting edge is selected as the characteristic of the connecting edge;
C5. calculating blacklist characteristics: the blacklist comprises a full domain name blacklist and an IP blacklist, when the characteristics of the blacklist of the graph assembly are analyzed, the number of full domain name marks of the graph assembly, the number of marked second-level domain names + top-level domain names (2-LD + TLD), the maximum number of marked full domain name nodes, the number of marked IP nodes, the maximum number of marked IP nodes and the ratio of the marked nodes to the total nodes are calculated by combining the published blacklist library.
Further, the Whois information of the full domain name stated in the step C2 includes the creation time, the number of updates, the integrity, the maximum number of layers of the full domain name, the number of tie layers, the number of categories of the top level domain name (TLD), the number of categories of the secondary domain name (2-LD), and the maximum length, the average length, the number of words included, and the degree of character repetition of the secondary domain name (2-LD) characters.
Further, the Whois information of the IP node described in step C3 includes the status of the IP node, the update time, the country to which the node belongs, the number of Autonomous System Numbers (ASN) of the IP node, and the ratio of the number of Autonomous System Numbers (ASN) to the IP.
The botnet detection method based on the DNS mapping association graph has the advantages that:
1. the method can simultaneously cover the detection of Fast-flux and Domain-flux botnets.
2. And the response packet flow recorded by the A flow filtering record aiming at the DNS flow is greatly reduced in the data volume of subsequent processing.
3. A new DNS traffic processing idea is provided by constructing a bipartite graph set taking a full domain name and an IP as central nodes.
4. The combination of different algorithms is respectively carried out on the full Domain name and the IP, so that the data set of image components is greatly reduced, and the technical characteristics of Fast-flux and Domain-flux are better met.
5. By analyzing the characteristics of the DNS mapping association graph, the accuracy of botnet detection is greatly improved, and the method is also suitable for processing mass data of a high-speed network.
The present invention will be described in further detail with reference to the following examples. This should not be understood as limiting the scope of the above-described subject matter of the present invention to the following examples. Various substitutions and alterations according to the general knowledge and conventional practice in the art are intended to be included within the scope of the present invention without departing from the technical spirit of the present invention as described above.
Detailed Description
The present embodiment adopts a Linux-based distributed operating system, CentOS system, with a version number of 7.6.1810.
As shown in fig. 1, the botnet detection method based on the DNS mapping association map of the present invention includes:
DNS traffic filtering and preprocessing: the equipment at the network outlet to be tested comprises a switch, a router and the like, wherein flow is led into a specific server network port by configuring a port mirror image, a PF _ RING Packet is installed on the server, if the data volume is large, flow collection at the level of 10Gbps can be realized by adopting a mode of PF _ RING + Zero Copy, DNS flow is filtered according to BPF (Berkeley Packet Filter) rules, and response data Packet flow containing A records (A (address) records are used for specifying IP address records corresponding to a host name (or a domain name)) is filtered.
And then preprocessing the filtered response data packet flow, including secondarily filtering the response data packet flow recorded by the A according to a full domain name and a white list of IP, and extracting a plurality of field information of each record in the A record by taking a time stamp of the flow as an ID, wherein the field information comprises the time stamp, a source MAC address, a destination MAC address, a source IP, a destination IP, a TTL value, a source port, a destination port and the like.
B. And (3) map mapping association processing: and for the flow of the preprocessed response data packet, according to DNS query response, respectively taking a full Domain Name (FQDN) and an IP as keywords (key), extracting the associated mapping relation therein, respectively constructing a bipartite graph component set taking the full Domain Name and the IP as central nodes, and respectively merging graph components by respectively adopting a corresponding mode for the bipartite graph component set taking the full Domain Name as the center and the bipartite graph component set taking the IP as the central node.
When graph components with domain names as central nodes are combined, firstly, the difference DD between the similar domain names is calculated according to the hierarchical characteristics of the full domain names, and then two similar graph components are combined by adopting a k-means clustering algorithm, wherein the difference DD between the similar domain names is calculated as follows:
wherein, ω isλAn intermediate value calculated for the domain name disparity, λ being the hierarchy of the domain name, X and Y each representing a full domain name, XλLayer λ, Y, representing the full domain name XλLambda-th layer, | X, representing the full domain name YλI represents XλLength, | YλI represents YλThe length of (d), wherein | X | represents the number of X levels, | Y | represents the number of Y levels, α is a predetermined parameter, the initialization α is 2, α is used as a balance weight, the initial value is an empirical value, and then, the optimal adjustment dd can be performedλAnd Ω are the median values of the calculation process, respectively.
When graph components with an IP as a central node are combined, similar services are provided by IP addresses adjacent to the central node as conditions, the similarity IS of the two IPs IS calculated under the condition of meeting a specific time span, and the similar graph components are combined when a threshold value IS reached; the time span refers to the time interval of data processing in actual implementation, and IS usually 12 hours as an initial value, wherein the similarity IS of two IPs IS calculated as:
in the above formula, X represents an IP address of a central node of the graph component, Y represents the adjacent IP address, and XmDenotes the value of X, YmDenotes the value of Y, XtTime stamp of X, YtThe time stamp indicating Y, α and β respectively indicate preset parameters, the initial values are 1.8 and 0.2, respectively, and λ indicates a class difference value of two IP addresses, for example, a class a and B difference value is 1.
C. Analyzing and extracting the characteristic of the graph assembly: and analyzing the elements in the bipartite graph component set, and extracting graph feature vectors by combining the information obtained by preprocessing. Wherein the graph component feature analysis comprises:
C1. analyzing the structural characteristics of the graph assembly: calculating the number of nodes in the graph assembly, including a universal name node and an IP node, and calculating the maximum degree and the average degree of all central nodes;
C2. analyzing the full domain name node characteristics: calculating the Whois information of the full domain name according to the public data of the Whois database by using the information after the flow preprocessing of the step A, wherein the Whois information comprises the creation time, the updating times, the integrity, the maximum number of layers of the full domain name, the number of tie layers, the number of TLD (top level domain name) types, the number of 2-LD (second level domain name) types, the maximum length and the average length of 2-LD (second level domain name) characters, the number of words and the character repetition degree and the like;
C3. analyzing the characteristics of the IP nodes: calculating the Whois information of the IP node according to the public data of the Whois database by using the information after the flow preprocessing of the step A, wherein the Whois information comprises the complete state, the updating time, the country, the individual, the region, the ASN (autonomous system number) quantity of the IP node, the ratio of the ASN (autonomous system number) quantity to the IP and the like;
C4. analyzing the characteristics of the connecting edges: the nodes in the graph component are connected through connecting edges, one connecting edge is a primary DNS query response, and TTL information (Time To Live, the field specifies the maximum network segment number allowed To pass before an IP packet is discarded by a router) including the average value and the variance value of the connecting edge is selected as the characteristic of the connecting edge;
C5. calculating blacklist characteristics: the blacklist comprises a full domain name blacklist and an IP blacklist, when the characteristics of the blacklist of the graph assembly are analyzed, the number of full domain name marks of the graph assembly, the number of marked 2-LD + TLD (second-level domain name + top-level domain name), the maximum degree of the full domain name nodes, the number of the marked IP nodes, the maximum degree of the marked IP nodes and the ratio of the marked nodes to the total nodes are calculated by combining the published blacklist library.
D. And (4) classifying graph components: the published Fast-flux and Domain-flux botnet sets are used as data input, and a mixed data set containing real flow is constructed through flow replay of TCPReplay in a laboratory environment. Wherein the Fast-flux public dataset is pure Fast-flux malicious traffic in CTU-13 and sample traffic of Strom, Waledoc and Zeus botnet in ISOT. The Domain-flux public dataset is the ISOT HTTP Botnet dataset constructed by Alenazi A et al. And executing the step A to the step C, completing the standardization of data according to the extracted graph feature vector, dividing the standardized data into a training set and a test set, and obtaining a classification model by using a LightGBM algorithm.
E. Inputting the information of the flow to be detected into a classification model, calculating whether the flow to be detected is malicious flow or not through the classification model, and if the flow to be detected is calculated to be the malicious flow, calculating the category of the malicious flow through the classification model, wherein the category of the malicious flow is Fast-flux or Domain-flux botnet.