Summary of the invention
The present invention provides a kind of Botnet detection method based on DNS mapping association figure, for Fast-flux with
The detection of two kinds of Botnets of Domain-flux, Detection accuracy with higher.
The present invention is based on the Botnet detection methods of DNS mapping association figure, comprising:
A.DNS traffic filtering and pretreatment: according to the traffic mirroring of network under test exit equipment, according to scheduled rule
Filter DNS flow, refilter comprising A record (A (Address) record be used to refer to determining the corresponding IP of host name (or domain name)
Location record) response data packet flow, then the filtered response data packet flow is pre-processed;
B. figure mapping association is handled: to pretreated response data packet flow, being responded according to DNS query, respectively with complete
Domain name (FQDN, Fully Qualified Domain Name) and IP are keyword (key), extract relationship maps therein and close
System, constructs respectively using universe name and IP as the bigraph (bipartite graph) component set of center node, and to the figure group in each bigraph (bipartite graph) component set
Part merges respectively;
C. it the analysis of figure module diagnostic and extracts: analyzing the element that the bigraph (bipartite graph) is concentrated, in conjunction with the information that pretreatment obtains,
Extract figure feature vector;
D. figure component classification: using published Fast-flux and Domain-flux Botnet collection as data input,
It executes step A~step C and the standardization of data is completed according to the figure feature vector of extraction, the data after the standardization are drawn
Training set and test set is divided to obtain disaggregated model using LightGBM algorithm;LightGBM is a kind of quick, high-performance, distribution
The outstanding gradient lift frame of formula, it is the machine learning times such as can be used for sorting, classify, returning in open source in 2017 by Microsoft
Business.It is based on decision Tree algorithms, using the wise strategy division leaf node of optimal leaf, in the premise for not reducing accuracy rate
Under, the sorting algorithm speed compared to mainstream improves 10 times or so, and the memory of occupancy has dropped 3 times or so instead.
E. by the information input of measurement of discharge into disaggregated model, by disaggregated model calculate this to measurement of discharge whether be
The classification of malicious traffic stream is calculated by disaggregated model if calculating to measurement of discharge is malice strength for malicious traffic stream
(Fast-flux or Domain-flux Botnet).
After tested, method of the invention can cover Fast-flux and Domain-flux two types Botnet simultaneously
Detection, and have higher Detection accuracy.
Further, pretreatment described in step A includes the response data recorded according to the white list of universe name and IP to A
Packet stream amount carries out secondary filter, and is ID with the timestamp of flow, extracts multiple field informations of every record in A record, packet
Include timestamp, source MAC, target MAC (Media Access Control) address, source IP, destination IP, TTL numerical value, source port and destination port etc..
Further, when being merged in step B to the figure component in bigraph (bipartite graph) component set, to centered on universe name
It bigraph (bipartite graph) component set and corresponding mode is respectively adopted as the bigraph (bipartite graph) component set of center node using IP carries out figure component
Merge.
Specifically, when being merged to the figure component with the entitled central node of universe, the first hierarchical nature according to universe name,
The diversity factor DD between similar domain name is calculated, similar two figure components are then merged using k means clustering algorithm, wherein calculating
Diversity factor DD between similar domain name are as follows:
Wherein, ωλFor the median that domain name diversity factor calculates, λ is the level of domain name, and X and Y respectively indicate one entirely
Domain name, XλIndicate λ layers of universe name X, YλIndicate λ layers of universe name Y, such as universe name www.baidu.com, first layer
For com, | Xλ| indicate XλLength, | Yλ| indicate YλLength, | X | indicate the number of levels of X, | Y | indicate the number of levels of Y, α is
Parameter preset, initialization α are 2.The effect of α is balance weight, and initial value is empirical value, subsequent to optimize adjustment.ddλWith
Ω is respectively the median of calculating process.
Specifically, being mentioned when to being merged using IP as the figure component of center node with the IP address that the central node is neighbouring
It is condition for similar service, and under conditions of meeting specific time span, calculates the similarity IS of the two IP, reach threshold
Value then merges the similar figure component;The time span refers to the time interval of data processing in actual implementation, usually
Initial value is 12 hours, wherein calculating the similarity IS of two IP are as follows:
In above formula, X indicates the IP address of figure component central node, and Y indicates the neighbouring IP address, XmIndicate X's
Numerical value, YmIndicate the numerical value of Y, XtIndicate the timestamp of X, YtIndicate the timestamp of Y, α and β respectively indicate parameter preset, initial value
Respectively 1.8 and 0.2, λ indicate that the classification difference of two IP address, such as the classification difference of Class A IP address and Class B IP address are
The classification difference of 1, A IP like and C IP like is 2.
On this basis, figure module diagnostic described in step C, which is analyzed, includes:
C1. the structure feature of analysis chart component: calculating the number of nodes in figure component, including universe name node and IP are saved
Point calculates the equal degree of maximum degree peace of all central nodes;
C2. universe name node diagnostic is analyzed: with the pretreated information of the flow of step A, according to the public affairs of Whois database
Data are opened, the Whois information of universe name is calculated;Whois information is domain name and the public information of IP, shows its basic related letter
Breath.
C3. analyzing IP node diagnostic: with the pretreated information of the flow of step A, according to the open number of Whois database
According to the Whois information of calculating IP node;
C4. analysis connection is in feature: by connecting when connection between figure component interior joint and node, a connection side is
DNS query response, chooses TTL information (Time To Live, field including its average value and variance yields in connection side
Specified IP coating router allow before abandoning by maximum web segment number) as connection side feature;
C5. calculate blacklist feature: blacklist includes universe name blacklist and IP blacklist, the blacklist of analysis chart component
When feature, in conjunction with published blacklist library, the quantity of the full domain name mark of figure component, second level domain+top level domain (2- are calculated
LD+TLD) labeled quantity, the labeled maximum degree of universe name node, the quantity of labeled IP node, IP node quilt
The maximum degree of label is labeled the ratio of the total node of node Zhan.
Further, the Whois information of universe name described in step C2, creation time, update time including universe name
Number, integrity degree, the maximum number of plies of universe name, the draw number of plies, top level domain (TLD) number of species, second level domain (2-LD) type
Quantity and the maximum length of second level domain (2-LD) character, average length include word quantity and character repetition degree.
Further, the Whois information of IP node described in step C3, state, renewal time, institute including IP node
Belong to the quantity of country, autonomous system number (ASN) quantity of node IP and autonomous system number (ASN) and the ratio of IP.
The present invention is based on the Botnet detection method of DNS mapping association figure, beneficial effect includes:
1, the detection to Fast-flux and two kinds of Botnets of Domain-flux can be covered simultaneously.
2, for the response bag flow of DNS traffic filtering A record, the data volume of very big less subsequent processing.
3, a kind of new DNS flow processing is provided using universe name and IP as the bigraph (bipartite graph) collection of center node by constructing
Thinking.
4, with IP is carried out merging for algorithms of different to universe name respectively, greatly reduces figure module data collection, while also more
Meet the technical characteristic of Fast-flux and Domain-flux well.
5, by the signature analysis to DNS mapping association figure, the accuracy of Botnet detection is greatly improved, simultaneously
It can also be suitable for the processing of the mass data of high speed network.
Specific embodiment with reference to embodiments is described in further detail above content of the invention again.
But the range that this should not be interpreted as to the above-mentioned theme of the present invention is only limitted to example below.Think not departing from the above-mentioned technology of the present invention
In the case of thinking, the various replacements or change made according to ordinary skill knowledge and customary means should all be included in this hair
In bright range.
Specific embodiment
The present embodiment uses the release operating system CentOS system based on Linux, version number 7.6.1810.
The present invention is based on the Botnet detection methods of DNS mapping association figure as shown in Figure 1, comprising:
A.DNS traffic filtering and pretreatment: the equipment in network under test exit, including interchanger, router etc., by matching
Port Mirroring is set, flow is imported into specific server network interface, and PF_RING packet is installed on that server, if data volume
It is larger, the flow collection of 10Gbps rank can also be realized, according to BPF (Bai Ke by the way of PF_RING+Zero Copy
Lay Packet Filter, Berkeley Packet Filter) rule-based filtering DNS flow, it refilters and records (A comprising A
(Address) record is used to refer to determine the corresponding IP address of host name (or domain name) and records) response data packet flow.
Then the filtered response data packet flow is pre-processed, including the white list pair according to universe name and IP
The response data packet flow of A record carries out secondary filter, and is ID with the timestamp of flow, extracts in A record every record
Multiple field informations, including timestamp, source MAC, target MAC (Media Access Control) address, source IP, destination IP, TTL numerical value, source port and mesh
Port etc..
B. figure mapping association is handled: to pretreated response data packet flow, being responded according to DNS query, respectively with complete
Domain name (FQDN, Fully Qualified Domain Name) and IP are keyword (key), extract relationship maps therein and close
System, constructs respectively using universe name and IP as the bigraph (bipartite graph) component set of center node, and to the bigraph (bipartite graph) component centered on universe name
Integrate and the merging that corresponding mode carries out figure component is respectively adopted as the bigraph (bipartite graph) component set of center node with IP.
Wherein, when being merged to the figure component with the entitled central node of universe, first according to the hierarchical nature of universe name, meter
The diversity factor DD between similar domain name is calculated, similar two figure components are then merged using k means clustering algorithm, wherein calculating phase
Like the diversity factor DD between domain name are as follows:
Wherein, ωλFor the median that domain name diversity factor calculates, λ is the level of domain name, and X and Y respectively indicate one entirely
Domain name, XλIndicate λ layers of universe name X, YλIndicate λ layers of universe name Y, | Xλ| indicate XλLength, | Yλ| indicate YλLength
Degree, | X | indicate the number of levels of X, | Y | indicate the number of levels of Y, α is parameter preset, and initialization α is 2.The effect of α is balance power
Weight, initial value is empirical value, subsequent to optimize adjustment.ddλIt is respectively the median of calculating process with Ω.
When to being merged using IP as the figure component of center node, provided with the neighbouring IP address of the central node similar
Service is condition, and under conditions of meeting specific time span, calculates the similarity IS of the two IP, reach threshold value and then merge
The similar figure component;The time span refers to the time interval of data processing in actual implementation, and usual initial value is
12 hours, wherein calculating the similarity IS of two IP are as follows:
In above formula, X indicates the IP address of figure component central node, and Y indicates the neighbouring IP address, XmIndicate X's
Numerical value, YmIndicate the numerical value of Y, XtIndicate the timestamp of X, YtIndicate the timestamp of Y, α and β respectively indicate parameter preset, initial value
Respectively 1.8 and 0.2, λ indicate the classification difference of two IP address, such as A class and B class difference are 1.
C. it the analysis of figure module diagnostic and extracts: analyzing the element that the bigraph (bipartite graph) is concentrated, in conjunction with the information that pretreatment obtains,
Extract figure feature vector.Wherein the figure module diagnostic, which is analyzed, includes:
C1. the structure feature of analysis chart component: calculating the number of nodes in figure component, including universe name node and IP are saved
Point calculates the equal degree of maximum degree peace of all central nodes;
C2. universe name node diagnostic is analyzed: with the pretreated information of the flow of step A, according to the public affairs of Whois database
Data are opened, calculate the Whois information of universe name, creation time, update times, integrity degree, universe name including universe name are most
The big number of plies, the draw number of plies, TLD (top level domain) number of species, 2-LD (second level domain) number of species and 2-LD (second-level domain
Name) it the maximum length of character, average length, include word quantity and character repetition degree etc.;
C3. analyzing IP node diagnostic: with the pretreated information of the flow of step A, according to the open number of Whois database
According to, calculate the Whois information of IP node, it is good working condition, renewal time, belonging country including IP node, affiliated personal, affiliated
Regional, node IP ASN (autonomous system number) quantity and ASN (autonomous system number) quantity and the ratio of IP etc.;
C4. analysis connection is in feature: by connecting when connection between figure component interior joint and node, a connection side is
DNS query response, chooses TTL information (Time To Live, field including its average value and variance yields in connection side
Specified IP coating router allow before abandoning by maximum web segment number) as connection side feature;
C5. calculate blacklist feature: blacklist includes universe name blacklist and IP blacklist, the blacklist of analysis chart component
When feature, in conjunction with published blacklist library, the quantity of the full domain name mark of figure component, 2-LD+TLD (second level domain+top are calculated
Grade domain name) labeled quantity, the labeled maximum degree of universe name node, the quantity of labeled IP node, IP node quilt
The maximum degree of label is labeled the ratio of the total node of node Zhan.
D. figure component classification: using published Fast-flux and Domain-flux Botnet collection as data input,
It is reset in laboratory environments by the flow of TCPReplay, building includes the mixed data set of real traffic.Wherein Fast-
Flux public data integrate as in CTU-13 pure Fast-flux malicious traffic stream and ISOT in Strom, Waledoc and Zeus it is stiff
The sample flow of corpse network.Domain-flux public data integrates as the ISOT HTTP Botnet number of Alenazi A et al. building
According to collection.It executes step A~step C and the standardization of data is completed according to the figure feature vector of extraction, after the standardization
Data divide training set and test set and obtain disaggregated model using LightGBM algorithm.
E. by the information input of measurement of discharge into disaggregated model, by disaggregated model calculate this to measurement of discharge whether be
The classification of malicious traffic stream is calculated by disaggregated model, is if calculating to measurement of discharge is malice strength for malicious traffic stream
Fast-flux or Domain-flux Botnet.