CN115102714A

CN115102714A - Malicious domain name detection method and device based on dynamic evolution diagram

Info

Publication number: CN115102714A
Application number: CN202210538664.8A
Authority: CN
Inventors: 孙嘉伟; 印君男; 刘俊矫; 吴广君
Original assignee: Institute of Information Engineering of CAS
Current assignee: Institute of Information Engineering of CAS
Priority date: 2022-05-17
Filing date: 2022-05-17
Publication date: 2022-09-23

Abstract

The invention discloses a malicious domain name detection method and a malicious domain name detection device based on a dynamic evolution diagram, wherein the method comprises the following steps: dividing DNS flow into different time snapshot windows T to construct T domain name request graphs, wherein k hop neighborhoods of nodes are used as receiving fields of a graph convolution network; extracting features from the high-order local neighborhood of each node to calculate the intermediate representation of the nodes in each domain name request graph, and calculating the time span features of each node in a time snapshot window t based on the receiving field; based on the intermediate representation and the time span characteristics, node propagation evolution characteristics are extracted from all domain name request graphs, and final representations of all request domain names are obtained; and finally, classifying and representing to obtain a malicious domain name detection result. The invention can establish the association relation of the malicious domain name which is just registered or associated sparsely, and catch the propagation evolution process of the malicious domain name along with the time, thereby rapidly detecting the newly generated malicious domain name and having better robustness.

Description

Malicious domain name detection method and device based on dynamic evolution diagram

Technical Field

The invention belongs to the field of computer network security, relates to a malicious domain name detection technology, and particularly relates to a malicious domain name detection method and device based on a dynamic evolution diagram.

Background

The flexibility and accessibility of domain names have enabled DNS to be used in a variety of malware for different purposes: phishing, spam, CC communication, malware propagation, and the like. The theory of graph is used to detect malicious domain names, which has shown strong advantages, but the existing graph detection systems all require a time window to accumulate a sufficient number of domain name visits or analytic relationships to perform inference classification. Moreover, existing graph detection methods for malicious domain names are mainly directed to static graphs, while actual domain name graphs evolve over time. By using the dynamic evolution diagram, the method and the device can capture the propagation evolution process of the malicious domain name, so that the malicious domain name can be detected at the early stage of the propagation of the malicious software more quickly. One key challenge is that the time graph gets larger and larger, and over time, memory becomes an issue. This is particularly important for graph neural networks, since graphs must fit in memory to predict. Therefore, it may be necessary to cut off the history of vertices and their edges that were unnecessary in the past, depending on the available resources.

Existing malicious domain name detection techniques can be divided into the following two categories:

feature-based methods typically use means of feature engineering and machine learning to detect DGA domain names. The DGA botnet detection technique of Pleiades by Antonakakis et al clusters (length of domain name itself, access frequency, etc.) by examining statistical features of NXDomain, and then labels datasets and detects CC domain names using statistical multi-classification models and hidden markov models, respectively. The research relies on lexical and structural features of domain names, and is dedicated to finding domain names for CC communication.

Behavior-based methods typically use periodic behavior features of domain names, group behavior features, co-occurrence of domain names, etc. to detect malicious domain names, based on network or host association behaviors. Manadhata et al constructed a bipartite graph of host domain names for domain name access through server logs at 2014 esorcis meeting, on which a belief propagation algorithm was used to predict the probability that a domain name is a malicious domain name. Issa Khali et al propose the co-occurrence between IPs resolved by domain names in 2016 AsiaCCS conference 663-674, and utilize malicious domain name seeds to infer the malicious probability of associated domain names by utilizing loose co-occurrence rules.

Disclosure of Invention

The invention mainly aims to provide a malicious domain name detection method and device based on a dynamic evolution diagram, which are used for rapidly detecting a malicious domain name.

The technical content of the invention comprises:

a malicious domain name detection method based on a dynamic evolution diagram comprises the following steps:

dividing DNS flow into different time snapshot windows T to construct T domain name request graphs, wherein nodes in the domain name request graphs are request domain names, edges are relations between the request domain names, and for neighborhood structure information of each node, a k-hop neighborhood of the node is used as a receiving field of a graph convolution network;

extracting features from the high-order local neighborhood of each node to calculate the intermediate representation of the node in each domain name request graph, and calculating the time span feature of each node in each domain name request graph in the time snapshot window t based on the receiving field;

extracting node propagation evolution characteristics in all domain name request graphs based on the intermediate representation and the time span characteristics to obtain final representations of all request domain names;

and finally, classifying and representing to obtain a malicious domain name detection result of each request domain name.

Further, the dividing the DNS traffic into different time snapshot windows T to construct T domain name request graphs includes:

analyzing DNS flow to obtain a request record, wherein the request record comprises: a source client, a request domain name and a request time;

calculating time intervals between different domain name requests in each source client according to the request time, and dividing the request domain names with the time intervals larger than a first set threshold value into different time snapshot windows t;

taking each request domain name distributed in the time snapshot window t as a node of a domain name request graph;

in the time snapshot window t, if two domain names have the same source client access or are analyzed to the same address, the corresponding nodes are connected by edges, and the weight of the edges is obtained by calculating the number of the same source clients or the number of the analyzed addresses;

and constructing a domain name request graph based on the nodes and the edges with the set weights.

Further, when the DNS traffic is analyzed, the acquired request domain name is filtered through white list filtering, one-time domain name filtering, public service domain filtering, cloud server filtering and popular content delivery network filtering.

Further, the building of the T domain name request graphs based on the DNS traffic further includes:

calculating the client sharing degree, the number of clients and the query number corresponding to each edge;

if the client sharing degree is smaller than a second set threshold, the number of clients is smaller than a third set threshold or the query number is smaller than a fourth set threshold, cutting the edge, and using an increasing threshold and iterative cutting in the cutting process;

and after all the edges in the domain name request graph are processed, taking the processed domain name request graph as the domain name request graph.

Further, the extracting features from the higher-order local neighborhood of each node to compute an intermediate representation of the node in each domain name request graph includes: :

constructing a feature extraction network by using a plurality of stacked graph convolution networks;

establishing a domain name request graph set;

and inputting the domain name request graph set into the feature extraction network to obtain the intermediate representation of the node in each domain name request graph.

Further, the extracting propagation evolution features of the intermediate representation of each node in all domain name request graphs to obtain the final representation of each request domain name includes:

constructing a representation sequence of the nodes in snapshot windows t at different times based on the intermediate representation of the nodes in each domain name request graph;

and inputting the expression sequence into a recurrent neural network to obtain the final expression of each request domain name.

Further, the calculating, based on the receiving field, a time span characteristic of each node in each domain name request graph in the time snapshot window t includes:

calculating the time distance distribution of the k hops in a time snapshot window t based on the longest time minus the shortest time of the peak set of the node set in the k hops and the time distribution of the k hops;

and taking the time distance distribution as the time span characteristic of the corresponding node in the time snapshot window t.

Further, the inputting the representation sequence into a recurrent neural network to obtain a final representation of each requested domain name includes:

capturing sequence information of a time snapshot window t by using a position embedding method;

combining the sequence information of the time snapshot window t with the neighborhood information of the domain name request graph to obtain a node representation sequence containing the sequence information;

and obtaining the final representation of each request domain name based on the node representation sequence containing the sequence information.

Further, the method of classifying a final representation includes: a SOFTMAX classifier is used.

A storage medium having a computer program stored therein, wherein the computer program is arranged to perform any of the above methods when executed.

An electronic device comprising a memory having a computer program stored therein and a processor configured to execute the computer program to perform any of the methods described above.

Compared with the prior art, the invention has the beneficial effects that:

1. the method can quickly detect the newly generated malicious domain name, and can obtain good detection performance consistent with other detection systems for the malicious domain name detection in the conventional time range, and particularly, the dynamic evolution diagram is suitable for the memory requirement of a large-scale diagram;

2. the propagation evolution process of the malicious domain name along with time can be captured, so that the malicious domain name can be detected in the early stage of the propagation of malicious software more quickly;

3. malicious domain name detection with higher robustness can be carried out in two dimensions of a structure neighborhood and a time neighborhood through unsupervised learning, so that an adversary is more difficult to bypass;

4. the domain name time evolution graph is used for capturing the propagation evolution of a newly generated malicious domain name in two dimensions of a structure neighborhood and a time neighborhood, so that not only is the local associated structure information of each domain name node in the time snapshot used, but also the time evolution characteristics of node edges are used as global change information. The method comprises the steps that sub-graph features of specific shapes in time snapshots can well distinguish propagation evolution of malicious domain name nodes along with evolution of sub-graphs in each time snapshot, and aims at solving the problems that newly registered malicious domain name nodes are few in association relation with suspicious domain names immediately after being registered or in the stage of starting propagation, sparse in access and lack of useful information and effective association cannot be established;

5. the time graph gets larger and memory becomes an issue over time. This is particularly important for graph neural networks, since graphs must fit in memory to predict. Therefore, it may be necessary to cut off the history of vertices and their edges that were unnecessary in the past, depending on the available resources.

Drawings

Fig. 1 is a flowchart of an extensible malicious domain name detection method using a domain name node dynamic evolution diagram.

FIG. 2 is a schematic diagram of domain name map clipping.

FIG. 3 is a schematic diagram of domain name structure neighborhood information and time evolution feature modules according to the present invention.

Detailed Description

In the following, the technical solutions in the embodiments of the present invention will be clearly and completely described in conjunction with the embodiments of the present invention, and it is obvious that the described embodiments are only specific embodiments of the present invention, not all embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments of the invention without making creative efforts, are described.

An extensible malicious domain name detection method utilizing a domain name node dynamic evolution diagram comprises the following steps:

(1) and analyzing DNS flow to obtain request information, wherein the request information comprises a source client address, a domain name requested to be analyzed and request time.

(2) And (3) based on the request information analyzed in the step (1), using the request time, and cutting the DNS traffic into a plurality of time snapshot windows t by using time intervals.

(3) Based on the time snapshot window t in (2), the domain name is represented as a node by using a graph, the relationship between two domain names is represented as an edge, a domain name graph structure of the domain name under the time snapshot window t is generated by using the request sequence relationship between the domain names, and the key characteristic is that the DNS query activities of the time snapshot window are accumulated on one graph.

(4) The window size of the time snapshot window t is determinative of the size of the graph. The invention uses the time difference distribution in the K-hop neighborhood of each node as the K-neighborhood time difference distribution metric. For the neighborhood structure information of each node, the invention uses the K-hop neighborhood of the node as the receiving field of the GCN, and the neighborhood characteristics of the node in the K-hop neighborhood are all contained in the neighborhood structure representation of the node.

(5) And (4) based on the domain name map structure in the step (3), cutting the edge with low association degree, so that resources occupied by large-scale map calculation are suitable for limited calculation resources, and a new domain name map structure is obtained. This is accomplished by cutting edges where the domain name request association (e.g., client sharing CSR), number of clients, or number of queries is less than a response threshold. The key feature is to accumulate the DNS query activity for this time snapshot window on a graph. Incremental thresholds and iterative cuts are used in the cutting process. The degree of sequential association is determined by the Client Sharing Ratio (CSR) between the connected domain names.

Wherein v is _i 、v _j There are two domain names respectively, one for each domain name,

respectively, a set of clients that access the two domain names.

(6) And (4) extracting a domain name component based on the new domain name graph structure in the step (5), capturing local structure attributes in the snapshot by using a graph neural network of the GCN, and embedding neighbor nodes in each domain name to represent each domain name node on the premise of keeping the association relation of domain name requests.

(7) And (6) based on the node embedded representation of the time snapshot windows t, representing the node neighborhood representation of the domain name node in different time snapshot windows, and representing the time change of the graph structure in a plurality of time steps.

(8) Capturing dynamics and evolutionary features thereof behind multiple time snapshot sequences using a recurrent neural network GRU based on the embedded representation of the domain name nodes of the multiple time snapshot windows t output in (7).

(9) Based on the final domain name node dynamic representation in the step (8), the method uses SOFTMAX to classify the domain name nodes, trains a malicious domain name classifier, and detects the malicious domain name.

Further, in the reception field of the GCN in (4), all vertices in the coverage map are the most ideal state, so that the present invention can obtain the most comprehensive and complete structural neighborhood information of the node, but a larger window size will result in a larger memory requirement. In order to ensure the comparability of the historical time snapshot with the number of data nodes. The invention is provided with

Giving a distribution of temporal distances within the receptive field of the time snapshot GCN, using the temporal measure as a time span characteristic of the node in the time window, and also as a characteristic input in (8), wherein

Is the time distribution of k hops in a time snapshot g, g is the current time snapshot, N ^k (u) set of vertices d for set of nodes within k hops, k being the k-hop neighborhood, defined in 4, ts _min (u)、ts _min (v) The shortest time is subtracted from the longest time of the k-hop time distribution.

Further, (5) as a result of the edge removal, the edge-connected nodes lose their connection. The disconnected nodes form a connected component. Higher domain name associations for CSRs include high CSRs between entire domain members such as malware; memory is especially important for graph neural networks because graphs must fit into memory to predict, both server-driven high CSRs (such as cross-domain access in web content and domain names for load balancing and content distribution) and client-driven high CSRs (such as domain names that periodically connect in programs). Therefore, it may be necessary to cut off the history of vertices and their edges that were unnecessary in the past, depending on the available resources. The invention cuts off the history of unnecessary vertexes and edges thereof through the sharing degree of the client.

Further, (5) if the CSR of the edge connecting two domain names is lower than the other edges, then the domain names in the edge necklace are considered not to be used or accessed by the same set of clients. Incremental thresholds and iterative cuts are used in the cutting process, which requires different thresholds because each different malicious or benign domain name component has a different edge cutting threshold.

Further, the output of (7) comprises a final dynamic node representation spanning a plurality of time steps. The node representation can not only retain local structural information around the node, but can also capture the evolving characteristics of the structure.

Particularly, the invention mainly solves the problem of node classification of a domain name dynamic graph in a plurality of time snapshot windows, but has the problems that newly registered malicious domain name nodes are few in association relation with suspicious domain names just after being registered or at the beginning of a propagation stage, the access is sparse, and effective association cannot be established due to the lack of useful information. The solution of the present invention, as shown in fig. 1, is as follows:

(1) the specific implementation method of the resolution and the preprocessing of the DNS request flow is as follows:

a) for all DNS request data, analyzing a request record (src, domain, time) from a request stream, wherein the src represents a source client of a request, the domain is a domain name of the request, and the time is a request time.

b) Aggregating DNS requests from the same source client to form a domain name request data set D of a plurality of source clients _i ,

Representing a source client s _i At the time of

Requested domain name

Wherein n is _j Is the total number of requested domain names for source client i.

(2) The DNS requests the cutting of the time snapshot window, as shown in fig. 2, the specific implementation is as follows:

a) to source client s _i Domain name request set D _i And cutting by using a time interval, and dividing the domain name requests of which the time interval M between different domain name requests of the same source client is greater than a threshold value M into different time snapshot windows t.

b) Integrating domain name requests of all source clients belonging to the same time slice into the same time window snapshot to form a domain name request set U-U of a time snapshot t ₁ ,u ₂ ,…,u _n },u _i ＝{d ₁ ,d ₂ ,…,d _ni }，u _i Is n of the ith source client request _i A set of domain names, where n is the number of source clients in the time snapshot t.

(3) Constructing a domain name request graph of a time snapshot window t, wherein the specific implementation mode is as follows:

a) using a graph representation, domain names are represented as nodes and the relationship between two domain names is represented as an edge. If the two domain names have the same source client side to access or resolve to the same address in the same time snapshot, edges are connected between corresponding nodes in the graph.

b) The construction of the graph aggregates DNS query information on DNS request flows in a cumulative manner.

c) The weight of an edge represents the number of request source clients or the number of resolved addresses that are the same for both domain names.

(4) The domain name map is cut, and the specific implementation mode is as follows:

a) and deleting edges with low client sharing degree (CSR) and less client quantity or query quantity than each threshold value in the domain name map clipping process.

b) Incremental thresholds and iterative cuts are used in the cutting process.

(5) The domain name structure neighborhood information is extracted, and the specific implementation mode is as follows:

a) domain name structure neighborhood information extraction is to extract features from the higher order local neighborhood of each node to compute the intermediate node representation for each snapshot.

b) The module consists of a plurality of stacked graph convolution networks GCN for extracting features from subgraphs of domain name nodes of different time windows. The input is a set U of domain name node maps _m ＝{U ₁ ,U ₂ ,…,U _m M e T, the output is a node representation of each time snapshot window

At the time of the snapshot window t of time,

representing the coding of the domain name node v at the current local structure neighborhood node.

(6) Domain name propagation evolution feature extraction, as shown in fig. 3, the specific implementation is as follows:

a) in order to further capture the malicious domain name propagation evolution mode in the time snapshots, the invention uses a recurrent neural network to extract the propagation evolution characteristics of the intermediate representation of the domain name node extracted in the last step.

b) The input to this layer is a sequence of representations of a particular node v at different time windows. For each node v, the invention defines the inputs as

T is the total number of time snapshot windows and the output layer is a new node representation for which node v contains a context representation for each time snapshot window.

c) Capturing snapshot sequence information of a time snapshot evolution module by using a position embedding method, { p ¹ ,…,p ^T Combining the snapshot window sequence information with the structure neighborhood information to obtain a node representation sequence containing the sequence information

d) Concluding a final representation of a node using a recurrent neural network

(7) The specific implementation mode of the malicious domain name detection is as follows:

a) and marking the malicious domain names in the data set by using a domain name blacklist to construct a training set. Malicious domain name data is classified using a SOFTMAX classifier.

b) And detecting domain name data of the test set by using the trained neural network classifier, and detecting the malicious domain name in the domain name data.

The extensible malicious domain name detection system utilizing the domain name dynamic time evolution diagram is implemented according to the steps of the method, C, Python is used as a background program development language, and CSV is used as a data storage mode.

The system consists of a noise preprocessing module, a domain name time snapshot constructing module, a domain name structure neighborhood information extracting module, a domain name time evolution module and a domain name classification training and detecting module, and is described in detail as follows:

(1) and a noise preprocessing module. The module resolves DNS traffic and filters the following domain names to improve detection performance and effect, white list filtering, one-time domain name filtering, public service domains, cloud servers, and popular content delivery networks.

(2) And a domain name time snapshot map construction module. The module divides the resolved domain name request sequence into data of a plurality of time windows at time intervals, and iteratively accumulates the domain name request sequence data in the event window into a domain name request graph.

(3) And a domain name map cutting module. The module cuts the domain name node edges with weak association relation (low client sharing degree) in the domain name graph by using an incremental threshold and an iterative cutting mode.

(4) And a domain name structure neighborhood information extraction module. The module is responsible for calculating the structural neighborhood representation of each domain name node in a single stacked time window through the structural neighborhood node.

(5) And a domain name time evolution module. The module uses a recurrent neural network GRU to capture the propagation evolution of the graph structure over multiple time windows.

(6) And a domain name classification training detection module. The module is divided into a training module and a detection module. The training module uses a propagation evolution graph of the marked malicious domain name node in the DNS request stream as graph evolution training data of the malicious domain name to train a malicious domain name classifier; the detection module uses a trained classifier to classify unknown domain names.

In summary, the present invention not only uses the local association structure information of each domain name node in the time snapshot, but also uses the time evolution characteristics of the node edges as global change information. The method is characterized in that sub-graph features of specific shapes in time snapshots can well distinguish propagation evolution of malicious domain name nodes along with evolution of sub-graphs in each time snapshot, and aiming at the problems that newly registered malicious domain name nodes have few association relations with suspicious domain names just after being registered or in the stage of starting propagation, access is sparse, and useful information is lacked so that effective association cannot be established, the method calculates the characteristics of nodes and learns potential high-order features of domain names in different time snapshots by combining the node attribute features of the domain names through two dimensions of structure neighborhood information and time neighborhood. And classifying the evolution embedded representation of the node by using the marked malicious domain name, and detecting the malicious domain name.

Examples

This example was conducted using DNS traffic data for two days for a large ISP network. The DNS data in the ISP recursive servers was collected on two days, 12/month 7 to 12/month 8 in 2019. Data volume of 1.5T per day, 8 billion data volume per hour, containing various types of DNS records. Due to limitations of storage space and computational resources, the present invention samples the 1/10 ratio every hour for a total of 4 samples. The parameters mainly related to the system provided by the invention mainly comprise a time snapshot cutting threshold value and some hyper-parameters of GCN and RNN. The present invention uses the validation set to adjust the hyper-parameters and reports the test results under the optimal result parameters of the validation set. Time snapshot cut threshold parameter 10 minutes, 30 minutes and 60 minutes are optional parameters. The experiment was evaluated using a standard 5-Fold cross-validation protocol. Firstly, a training data set is randomly and evenly divided into 5 parts, wherein 3 parts are used as a training set, 1 part is used as a verification set, and the rest 1 part is used as a test set, so that the detection effect on the test set is finally obtained. This was repeated 5 times and the average of 5 times was taken as the final experimental result for this set of parameters.

Experimental examples the results of the GAT and NODE2VEC algorithms on the training set data were compared under these several sets of parameter settings by 5-Fold cross validation. As shown in table 1, the model of the present invention can achieve a true positive rate of 96.72% and an accuracy of 96.94% on average. NODE2VEC on average was able to achieve a true positive rate of 91.3% and an accuracy of 93.9%. The FANCI can reach 91% of true positive rate and 93.9% of accuracy on average, so that the malicious domain name can be effectively detected under the classification detection method of the model.

TABLE 1 test results

Finally, it should be noted that: the described embodiments are only some, but not all embodiments of the present application. All other embodiments obtained by a person of ordinary skill in the art based on the embodiments in the present application without making any creative effort belong to the protection scope of the present application.

Claims

1. A malicious domain name detection method based on a dynamic evolution diagram comprises the following steps:

2. The method of claim 1, wherein the partitioning of DNS traffic into different time snapshot windows T to construct T domain name request graphs comprises:

3. The method of claim 2, wherein the obtained request domain name is filtered by whitelist filtering, one-time domain name filtering, public service domain filtering, cloud server filtering, and popular content delivery network filtering while resolving DNS traffic.

4. The method of claim 2, wherein said building a T domain name request graph based on DNS traffic further comprises:

calculating the corresponding client sharing degree, the number of clients and the number of queries of each edge;

5. The method of claim 1, wherein extracting features from a higher-order local neighborhood of each node to compute an intermediate representation of the node in each domain name request graph comprises: :

establishing a domain name request graph set;

and inputting the domain name request graph set into the feature extraction network to obtain the intermediate representation of the nodes in each domain name request graph.

6. The method according to claim 1, wherein the extracting the propagation evolution feature from the intermediate representation of each node in all domain name request graphs to obtain the final representation of each request domain name includes:

7. The method of claim 1, wherein calculating the time span characteristics of each node in each domain name request graph in the time snapshot window t based on the receiving field comprises:

8. The method of claim 1, wherein inputting the sequence of representations into a recurrent neural network to obtain a final representation for each requested domain name comprises:

and obtaining a final representation of each request domain name based on the node representation sequence containing the sequence information.

9. The method of claim 1, wherein the method of classifying the final representation comprises: a SOFTMAX classifier is used.

10. An electronic device comprising a memory having a computer program stored therein and a processor arranged to run the computer program to perform the method according to any of claims 1-9.