CN114422211B

CN114422211B - HTTP malicious traffic detection method and device based on graph attention network

Info

Publication number: CN114422211B
Application number: CN202111653905.5A
Authority: CN
Inventors: 芦斌; 翟懿; 刘龙; 吴魏; 费金龙; 郭茂华; 潘雁; 孟轶同
Original assignee: Information Engineering University of PLA Strategic Support Force
Current assignee: Information Engineering University of PLA Strategic Support Force
Priority date: 2021-12-30
Filing date: 2021-12-30
Publication date: 2023-07-18
Anticipated expiration: 2041-12-30
Also published as: CN114422211A

Abstract

The invention belongs to the technical field of network communication safety, and particularly relates to a method and a device for detecting HTTP malicious traffic based on a graph attention network, wherein the method comprises the steps of constructing a host-level communication behavior graph based on header field information in HTTP data packets, wherein the communication behavior graph is expressed as G= (N, E), N is a node, and E is an edge; the communication behavior graph is classified by using a graph attention network-based detection model, which includes 3 improved graph attention layers, a graph pooling layer and a full connection layer. The communication behavior diagram constructed by the invention is used for describing macroscopic network behaviors in the malicious code communication process, and an improved graph attention layer is used based on a detection model of the graph attention network, so that the graph attention network can process node characteristics and edge characteristics at the same time, and the comprehensive characteristics are utilized to realize accurate detection of malicious traffic.

Description

HTTP malicious traffic detection method and device based on graph attention network

Technical Field

The invention belongs to the technical field of network communication security, and particularly relates to a method and a device for detecting HTTP malicious traffic based on a graph attention network.

Background

The network session flow shows a complete interaction process of both communication parties, and contains abundant flow information, so that the malicious flow detection methods all take the session flow as a detection object, and are research objects of most of the malicious flow detection methods at present. However, as can be seen from the network behavior of the malicious code, the malicious code often contains multiple interaction processes in different stages in the infection propagation process, and the malicious code contains more than one session stream and some network communication which looks normal is an essential ring in an attack chain. Therefore, the dependency relationship between the traffic needs to be analyzed from a macroscopic angle so as to deeply mine information contained in the malicious code communication mode, so that the invention converts the research view angle, takes all communication traffic infected by the host as detection objects, and realizes a host-level malicious traffic detection method.

Disclosure of Invention

Aiming at the problems in the prior art, the invention provides an HTTP malicious traffic detection method and device based on a graph attention network, wherein a constructed communication behavior graph is used for describing macroscopic network behaviors in the malicious code communication process, an improved graph attention layer is used based on a detection model of the graph attention network, so that the graph attention network can process node characteristics and edge characteristics at the same time, and the accurate detection of malicious traffic is realized by utilizing comprehensive characteristics.

In order to achieve the above purpose, the present invention adopts the following technical scheme:

the invention provides a method for detecting HTTP malicious traffic based on a graph attention network, which comprises the following steps:

constructing a host-level communication behavior diagram based on header field information in an HTTP data packet, wherein the communication behavior diagram is expressed as G= (N, E), N is a node, and E is an edge;

the communication behavior graph is classified by using a graph attention network-based detection model, which includes 3 improved graph attention layers, a graph pooling layer and a full connection layer.

Further, the constructing a communication behavior diagram of the host level includes: first, an HTTP request-response pair is extracted from an original PCAP file, nodes for generating a communication behavior diagram are defined according to the nodes, and edges of the communication behavior diagram are generated according to edge generation rules.

Further, in the communication behavior diagram, a node in the diagram represents an HTTP request-response process, and the node is defined as follows:

N＝(time,content,node_feature)

where time represents a time stamp of an HTTP request packet, content represents node content for generating an edge connection relationship, and node_feature represents a node feature.

Further, the content comprises URL, reference and Host information obtained from corresponding fields of the HTTP request data packet, and Location information obtained from corresponding fields of the HTTP response data packet;

the node_feature comprises a session stream statistics feature, a URL feature, a request feature and a response feature;

the session flow statistical characteristics refer to the statistical characteristics of session flows where the HTTP request-response process expressed by the current node is located; the URL features comprise URL length, special character proportion in the URL, whether the URL comprises an IP address and a URL entropy value; the request features comprise a request mode and whether Cookies are contained or not; the response characteristics include a response status code, a request resource type, and a request resource size.

Further, in the communication behavior diagram, the edges in the diagram represent the connection relationship between the nodes, and the edges are defined as follows:

E＝(N _s ,N _d ,edge_feature)

wherein N is _s Represents the start of the directed edge, N _d The end point of the directed edge is indicated, and edge_feature indicates the edge feature.

Further, the rule of edge generation of the communication behavior graph is as follows:

if the Location content of the node i is the same as the URL content of the node j, and the time stamp of the node i is smaller than that of the node j, an edge of the Location type pointing to the node j from the node i exists;

if the URL content of node i content is the same as the reference content of node j content, and the timestamp of node i is less than the timestamp of node j, there is an edge of the reference type pointing from node i to node j.

Further, the edge features include edge type, edge length, and edge connection host dissimilarity; the edge type is two types of edges generated according to an edge generation rule, namely a Location edge and a reference edge, if the edge is the Location edge, the characteristic value of the edge type is 1, and if the edge is the reference edge, the characteristic value of the edge type is 0; the edge length refers to the difference between the end time stamp and the start time stamp connected by the edge; the difference of the edge connection hosts refers to whether host content in two node content connected by the edge is the same, if so, the difference characteristic value of the edge connection hosts is 1, and if not, the difference characteristic value of the edge connection hosts is 0.

Further, the detection model based on the graph attention network extracts node feature vectors of the communication behavior graph by using 3 improved graph attention layers, all the node feature vectors subjected to multiple iterations are subjected to aggregation and splicing through a graph pooling layer to obtain global representation of the graph, and finally the aggregated feature vectors are mapped to corresponding types by utilizing the nonlinear fitting capacity of a full connection layer.

Further, the improved graph annotation layer focuses not only on the characteristics of the node itself and other connected nodes, but also on the characteristics of edges.

The invention also provides an HTTP malicious traffic detection device based on the graph attention network, which comprises:

the communication behavior diagram construction module is used for constructing a host-level communication behavior diagram based on header field information in the HTTP data packet, wherein the communication behavior diagram is expressed as G= (N, E), N is a node, and E is an edge;

the malicious traffic detection module is used for classifying the communication behavior graph by using a graph attention network-based detection model, wherein the graph attention network-based detection model comprises 3 improved graph attention layers and a full connection layer.

Compared with the prior art, the invention has the following advantages:

according to the HTTP malicious traffic detection method based on the graph attention network, firstly, a communication behavior graph is constructed from Web sessions based on the relevance between HTTP traffic to describe macroscopic network behaviors in the malicious code communication process, and feature dimensions of the communication behavior graph are enriched by combining four features of session flow statistics features, URL features, request features and response features of nodes, and three features of edge type, edge length and edge connection host machine difference. And then, a detection model-EGATM based on a graph attention network is provided, and the model uses an improved graph attention layer, so that the graph attention network can process node characteristics and edge characteristics at the same time, semantic information contained in a communication behavior graph is automatically extracted, and the problem that the basic graph attention network cannot process the edge characteristics is solved. The detection method constructed by the invention utilizes the comprehensive characteristics of network behaviors obtained at the host level, realizes more accurate malicious detection effect, and has lower false alarm rate and stronger anti-interference capability compared with the detection method based on the session flow level.

Drawings

In order to more clearly illustrate the embodiments of the invention or the technical solutions in the prior art, the drawings that are required in the embodiments or the description of the prior art will be briefly described, it being obvious that the drawings in the following description are only some embodiments of the invention, and that other drawings may be obtained according to these drawings without inventive effort for a person skilled in the art.

FIG. 1 is a flow chart of a method for HTTP malicious traffic detection based on a graph attention network in accordance with an embodiment of the present invention;

FIG. 2 is an exemplary diagram of a CGN messaging process according to an embodiment of the invention;

FIG. 3 is an exemplary diagram of edge connections of a communication behavior diagram of an embodiment of the present invention;

FIG. 4 is a diagram of communication behavior of different hosts according to an embodiment of the present invention;

FIG. 5 is a schematic diagram illustrating the impact of key node loss on graph structure according to an embodiment of the present invention;

FIG. 6 is a graph of a test model training scenario based on a graph attention network in accordance with an embodiment of the present invention;

FIG. 7 is a diagram of a graph attention network based detection model two-class confusion matrix in accordance with an embodiment of the present invention.

Detailed Description

The following description of the embodiments of the present invention will be made clearly and completely with reference to the accompanying drawings, in which it is apparent that the embodiments described are only some embodiments of the present invention, but not all embodiments. All other embodiments, which can be made by those skilled in the art based on the embodiments of the invention without making any inventive effort, are intended to be within the scope of the invention.

First, the following graph attention network is introduced:

in recent years, artificial intelligence technology represented by deep learning has brought about tremendous development and revolution in many fields based on the development of big data and computer power. However, in addition to the euclidean space data such as text, voice, image, etc., in real life, a large amount of data is generated from non-euclidean space, and the elements are interconnected to form a complex network such as the world wide web, traffic network, social network, etc. In this case, it is difficult to adapt the general neural network to tasks related to the graph data. This is mainly because the graph data is non-European spatial data, each node has a different local structure, and some important operations (such as convolution and pooling) cannot be directly performed. In addition, most neural networks assume that data samples are independent of each other, but the interdependence relationship between nodes is indispensable for graph data. The widespread real world demand has prompted researchers to focus on neural networks capable of processing graph data, i.e., graph neural networks (Graph Neural Network, GNN).

The graph rolling network (Graph Convolution networks, GCN) was proposed by Kipf et al in 2016 based on spectrum theory. The essence of convolution operation in CNN is to use the convolution kernel of the shared parameter to implement spatial feature extraction by calculating the weighted sum of the central pixel point and surrounding adjacent pixel points. There is a similar operation in GCN, and a node and its neighboring nodes send messages and exchange information with each other by a message transfer method between nodes. Fig. 2 intuitively illustrates the message passing process of the GCN, which can be roughly divided into two major steps: the first step, each node constructs a characteristic vector representing the characteristic of the node, and the characteristic vector is a message to be sent to all neighbors; and a second step of sending a message to the neighbor nodes, so that each node receives a message from its neighbor node.

When the GCN aggregates the neighbor nodes, the nodes are all viewed from the same kernel, and the problem of different importance of different neighbor nodes is not considered. To address this problem, velickovic et al have proposed a graph attention network (Graph Attention Networks, GAT) by applying an attention mechanism to the operation of the graph neural network to aggregate neighbors. Similar to GCN, the graph attention layer first creates a message for each node using the linear layer, and then uses the features from the node itself and the features of other nodes in combination for attention calculations.

Is the feature vector (d) corresponding to the node i and the node j at the first layer ^(l) Feature vector length representing a level i node),>is the weight parameter of the node characteristic transformation of the layerThe activation function is designed as a LeakyReLU, and in order to better distribute weights, velickovic and the like perform unified normalization processing on the correlation calculated by all neighbors through Softmax. Weight coefficient alpha _ij The calculation formula of (2) is as follows:

once the calculation of the weight coefficient is completed, according to the thought of weighted summation of the attention mechanism, the new feature vector of the node i is obtained by aggregating the information of all neighbors, and is as follows:

as shown in fig. 1, the method for detecting HTTP malicious traffic based on graph attention network of the present embodiment includes the following steps:

step S11, constructing a communication behavior diagram of a host level based on header field information in an HTTP data packet, wherein the communication behavior diagram is expressed as G= (N, E), N is a node, and E is an edge;

step S12, classifying the communication behavior graph by using a graph attention network-based detection model, wherein the graph attention network-based detection model comprises 3 improved graph attention layers and a full connection layer.

The communication behavior diagram of the embodiment is constructed based on the correlation between HTTP traffic, that is, based on header field information in HTTP packets, and because the header field in the HTTP packets indicates the access jump relationship between HTTP requests, the access jump relationship is the correlation between HTTP traffic; the principle analysis is as follows: when a user browses a webpage normally, the user accesses the webpage A and sends an HTTP request through the browser to obtain corresponding webpage resources. After receiving the response data, the browser analyzes the webpage code, further initiates HTTP requests for Javascript, CSS, fonts, images and the like required for completing page rendering, and after the browser completes page loading and rendering, more HTTP requests can be generated by the interaction behavior of the user and the webpage. If the user clicks the link of the webpage B at this time, the browser repeats the similar operation to finish loading and rendering of the webpage B. While these HTTP request packets typically contain a reference field to indicate their source information to account for traffic and to prevent hotlinking. Therefore, in HTTP traffic generated by normal access of a user, one HTTP request often triggers multiple HTTP requests associated therewith.

Whereas HTTP traffic generated by malicious code has its own characteristics. In the infection propagation chains of various EKs, an attacker typically redirects to different domain names multiple times to hide own trails, increase tracing difficulty, but domain names involved in a normal access path are generally few, and in addition, since a reference field can be used for tracking users, the attacker also modifies the reference field in HTTP traffic to hide the redirection relationship. In the process of establishing communication between malicious codes and the C & C server, some designers of the malicious codes select the HTTP protocol to perform C & C communication so as to pass through a firewall and avoid being found by network inspectors, but the C & C communication traffic based on the HTTP protocol is always independently initiated and does not have the association characteristic of normal HTTP traffic.

From the above analysis, it can be seen that if the communication behavior diagram of the current host can be constructed by using the correlation characteristics between HTTP traffic, the communication behavior diagram constructed by the host infected with the malicious code may have a distinct structure from the communication behavior diagram of the normal host due to the difference between the malicious behavior and the normal behavior. Based on such consideration, the present embodiment designs a method for constructing an HTTP traffic communication behavior diagram of a user based on the header field of the HTTP packet, for malicious traffic detection.

Specifically, the construction of the communication behavior diagram of the host level includes: first, an HTTP request-response pair is extracted from an original PCAP file, nodes for generating a communication behavior diagram are defined according to the nodes, and edges of the communication behavior diagram are generated according to edge generation rules.

The current method for constructing the graph from the HTTP traffic mainly uses a reference field and a Location field in the HTTP data packet, the reference field of the HTTP request data packet indicates the source information of the HTTP request data packet, and the Location field of the HTTP response data packet indicates the place to be jumped, so that a directed graph can be constructed based on the field information in the HTTP data packet. The communication behavior diagram of the embodiment gives more dimensional features to nodes and edges in the diagram on the basis of the directed diagram, so that the detection model based on the diagram attention network can comprehensively utilize the multidimensional features to detect malicious traffic.

In the communication behavior diagram, one node in the diagram represents an HTTP request-response process, and the node is defined as follows:

N＝(time,content,node_feature)

The content contains URL, reference and Host information acquired from the corresponding field of the HTTP request data packet and Location information acquired from the corresponding field of the HTTP response data packet. These will be used to generate the connection relationships of the edges and the corresponding edge features.

node_feature represents node characteristics including session flow statistics characteristics, URL characteristics, request characteristics, and response characteristics, as shown in table 1.

Session flow statistics: the session flow statistics feature refers to the statistics feature of the session flow where the HTTP request-response process represented by the current node is located. Malicious code can be downloaded maliciously by using HTTP traffic, uploaded privately, and malicious activities such as C & C communication. The session flow statistical feature can well reflect the difference between the normal network activity and the malicious network activity, and can also reflect the similarity between the malicious activities, so the session flow statistical feature is extracted as a part of the node feature in the embodiment. It should be noted that after HTTP/1.1, multiple HTTP request-response procedures are involved in the same session, and these nodes share the statistics of the session flows.

URL feature: the URL serving as a resource access identifier also contains rich information, and four characteristics of the length of the URL, the proportion of special characters in the URL and whether the URL contains an IP address and a URL entropy value are extracted through comparing and analyzing the normal URL and the malicious URL.

Request features: the request features mainly comprise two features of a request mode and whether Cookies are contained. The standard HTTP protocol supports six request methods, GET, POST, PUT, HEAD, DELETE, OPTION, where GET requests and POST requests are most commonly used, and represent HTTP traffic behavior of the current node, so we set the feature value of GET request mode to be 1, POST to be 2, and others to be 3.Cookie is also one of the common fields in the header of the HTTP request packet, and is mainly used for storing user information, thereby providing personalized service and improving service quality, and researchers research and find that 90% of websites in front of Alexa will set Cookie fields, but malicious codes will not set Cookie fields, so that whether the HTTP request packet contains Cookie fields is also one of the characteristics.

Response characteristics: the response characteristics mainly comprise three characteristics of a response status code, a request resource type and a request resource size. The HTTP response status code is mainly divided into: in normal access, since public websites generally aim at providing stable and reliable services, and the corresponding response types are 2XX, but in malicious traffic, various conditions such as redirection, malicious site migration and the like can exist, the response types are more abundant, so that the response status code of 2XX is set to have a characteristic value of 1, and correspondingly 3XX is 2,4XX and 3,5XX is 5. The part of HTTP request initiated by malicious code is mainly used for carrying out malicious downloading, requesting binary files, and the content-type field in the response data packet is displayed as application/OCtet-stream; the content-type field in the response data packet is mainly image/jpg, text/html, etc., so the request resource type is used as node characteristic, the characteristic value of content-type for application is 1, image is 2, text is 3, and the other is 4. Some malicious codes can hide the file type actually transmitted in order to avoid detection, and at this time, the phenomenon that the request resource size is not matched with the request type can be found by checking the content-length field in the HTTP response data packet, so that the content-length request resource size is also taken as one of the characteristics of the node.

TABLE 1 node characterization of communication behavior diagram

In the communication behavior diagram, edges in the diagram represent the connection relation between nodes, and the edges are defined as follows:

E＝(N _s ,N _d ,edge_feature)

wherein N is _s Represents the start of the directed edge, N _d The end point of the directed edge is indicated, and edge_feature indicates the edge feature. The edge generation rule of the communication behavior graph is as follows:

As shown in fig. 3, the solid line in the figure represents a reference type edge, the broken line represents a Location type edge, the solid line on the right side represents the same host of two nodes to which the edge is connected, and the broken line on the right side and the solid line on the left side represent different hosts of two nodes to which the edge is connected. In this process, the user first accesses a site (node 1) with a URL of http:// a.com/index. Html and requests a script resource (node 2) of http:// a.com/sample. Js during page loading, but for some reason the script resource is not already in the location indicated by the URL of http:// a.com/sample. Js, while the server redirects the user to a new URL-http:// b.com/sample. Js (node 3) to obtain the corresponding script resource. Then according to the edge generation rule: the Location content of the node 2 is the same as the URL content of the node 3, and the node 2 points to the edge of the Location type of the node 3; the URL content of node 1 is the same as the reference content of node 2 and node 3, there are reference type edges pointed to by node 1 to node 2 and node 3, respectively.

The edge characteristics in the communication behavior diagram comprise edge types, edge lengths and edge connection hosts which are different; the edge type is two types of edges generated according to an edge generation rule, namely a Location edge and a reference edge, if the edge is the Location edge, the characteristic value of the edge type is 1, and if the edge is the reference edge, the characteristic value of the edge type is 0; the edge length refers to the difference between the end time stamp and the start time stamp connected by the edge; the difference of the edge connection hosts refers to whether host content in two node content connected by the edge is the same, if so, the difference characteristic value of the edge connection hosts is 1, and if not, the difference characteristic value of the edge connection hosts is 0.

FIG. 4 shows graphs of communication behavior generated by a malicious host after infection of malicious code, respectively (a), (b) and (c); (d) (e) and (f) are communication behavior diagrams generated by the normal host respectively. As can be seen from fig. 4, the communication behavior diagram of the normal host has a generally shorter path length and more consistent edge types, and the reflection on the visual effect is that the structure is simple and clear and the regularity is provided; in a communication behavior diagram of a malicious host, the path length is often longer, the types of edges are also rich, the diagram contains a certain number of isolated nodes, and the diagram structure is more complex and is also more changeable.

The detection model-EGATM based on the graph attention network extracts node feature vectors of the communication behavior graph by using 3 improved graph attention layers (EGAT layers for short), aggregates and splices all node feature vectors subjected to multiple iterations through a graph pooling layer to obtain global representation of the graph, and finally maps the aggregated feature vectors to corresponding types by utilizing the nonlinear fitting capacity of a full-connection layer, thereby realizing identification of malicious traffic.

EGAT layer

The attention network of Velickovic et al focuses on the characteristics of the nodes themselves and other connected nodes only and does not consider the characteristics of the edges connecting the nodes when performing the attention calculations. However, in the communication behavior diagram, the edge feature also plays a role in characterizing the behavior of the host network. So in order to enable the graph attention network to notice both node features and edge features, we make the following modifications to equation (1):

wherein e _i,j ∈R ^e Representing edge characteristics connecting node i and node j,representing the layerWeight parameters of the edge feature transformation, weight parameters are updated to +.>The meaning of the other symbols is consistent with equation (1). Once the calculation of the weight coefficients is completed, the feature vector of the node may be calculated according to formula (2). It should be noted that the EGAT only introduces edge features when calculating the attention coefficients and does not update and maintain the edge features.

2. Layer of pooling

The image classification is the same as the image classification in computer vision, and the global information needs to be subjected to fusion learning. In the CNN model, a hierarchical pooling (Hierarchical Pooling) mechanism is typically employed to gradually extract global information. In the graph classification task of the GNN model, it is also necessary to aggregate all node features that have undergone multiple iterations, so as to read out the global representation of the graph, and train the final classifier on the basis of the graph representation. Because the read operation in GNN is as good as the pooling operation in CNN, both are globally expressed by a one-time aggregation of all inputs, so-called graph pooling. Pooling is taking the average of all node features as the global representation of the graph:

wherein the method comprises the steps ofRepresenting a global representation of the diagram, V representing a set of points in the diagram,/->Representing node characteristics after K iterations.

Through multiple experiments, the EGATM model architecture is set as shown in Table 2, 3 EGAT layers and 1 full connection layer are used, and finally Softmax is used as a classifier, cross entropy is used as a loss function in the training stage of the model, the optimization method is Adam, and the Dropout value is 0.5.

TABLE 2 EGATM model architecture

The invention will be better illustrated by experiments and analyses.

(1) Experimental data

Table 3 shows HTTP traffic data used, where one sample represents traffic of one host, and the dataset contains 3917 samples in total, where the number of malicious samples is 1805 and the number of normal samples is 2112.

Table 3 HTTP traffic data set

Malicious samples: the Malware-traffic-analysis network is a website focusing on malicious traffic analysis, which collects network traffic of various malicious codes from 2013 and provides manual analysis records for security researchers. The experiment collects data published by Malware-traffic-analysis.net from month 6 in 2013 to month 2 in 2020 and extracts HTTP traffic therefrom. These malicious samples span 7 years and each sample is extracted and analyzed from the actual case by security researchers, thus effectively representing a real threat in HTTP traffic in recent years.

Normal samples: the normal sample consists of three portions of flow. One part is the daily HTTP traffic of the user grabbed at the campus gateway port; the other part is HTTP traffic of CTU-Normal data set, CTU-Normal mainly contains traffic generated by daily activities such as file downloading, online chat and video browsing, and traffic generated by accessing website with Alexa ranking top 1000; the last part is part of the traffic of the clickminer dataset that collects the normal traffic of users accessing the web site through the browser.

(2) Analysis of experimental data

In order to keep the number of positive and negative samples balanced, the experiment performs splitting treatment on a large PCAP file of normal flow based on session flow so as to increase the number of normal samples. Meanwhile, in order to ensure fairness of data, the model is prevented from simply judging the sample types according to the number of nodes in the graph (namely the number of sessions contained in the samples), the session number distribution situation of malicious samples is counted, and the split normal sample session number distribution is kept consistent with the session number distribution of the malicious samples. This splitting may cause some key nodes in the communication behavior graph to be partitioned into samples of one another, thereby greatly changing the structure of the graph and the statistical index of the graph, as shown in fig. 5. However, in a real network environment, because the difference of traffic capturing environments and occasions is likely to cause the loss of key nodes of a communication behavior diagram, the splitting process simulates a real traffic capturing scene to a certain extent, and the anti-interference capability of a model can be checked to a certain extent.

Another notable point is that in constructing the traffic pattern from the original traffic, the average time spent by the malicious samples is 533ms, and the average time spent by the normal samples is 373ms, which means that the traffic pattern of the malicious samples tends to be more complex in structure, which also laterally demonstrates the observation findings of the normal host traffic pattern and the malicious host traffic pattern.

(3) Experimental setup

The experiment uses NFStream to preprocess the original PCAP file and uses NetworkX to construct a communication behavior diagram. In the model building stage, pytorch Geometric is used as a graphic neural network framework, a computer is configured into an 8-core 16-thread CPU, a 16GB memory and a display card NVIDIAGeForce RTX 2060. Performing a classification experiment of the malicious sample and the normal sample by using the HTTP traffic data set, wherein the ratio of the training set to the testing set is about 9:1. and selecting the two classification accuracy, the precision, the recall rate, the false alarm rate and the F1 value as model evaluation indexes.

(4) Experimental results

The model training situation is shown in fig. 6, which shows the change situation of the loss function value and the training accuracy of the previous 120 rounds in the training process, and it can be found that the model loss value is basically stable and the model can be successfully converged after 80 rounds of training are completed. The test result of the EGATM model is shown in the confusion matrix of fig. 7, in which 205 normal samples and 187 malicious samples are in total in the test set, the EGATM model makes correct judgment on 196 normal samples and 180 malicious samples, misjudges 9 normal samples as malicious samples, and misjudges 7 malicious samples as normal samples. Through the confusion matrix, the EGATM model can accurately divide and judge most traffic samples, and can discover and identify network behaviors of malicious codes at a host level.

(5) Comparative experiments

To check the effectiveness of the EGATM model, the experiment conducted a comparative experiment of different models under the same data set. Firstly, a detection model based on a session flow of the current main stream is used for judging a detection object of a host level by using statistical characteristics based on the session flow; a plurality of researchers detect the malicious flow behavior after modeling by using the statistical characteristics of the graph or the tree, so that a detection model based on graph statistical characteristics is also compared; finally, a detection model based on the graph neural network is provided.

Session-RF: the random forest model based on the session flow statistical characteristics is to judge a malicious sample and a normal sample by taking the session flow as a detection object. For each sample in the HTTP traffic data set, if all session flows of the sample are judged to be normal traffic by the random forest model, the sample is a normal sample, and if any session flow of the sample is judged to be malicious traffic, the sample is a malicious sample.

graph-RF: the random forest model based on the graph statistical characteristics takes a communication behavior graph as a detection object, and is trained by extracting the statistical characteristics of the communication behavior graph, so that the classification of flow samples is realized, and the statistical characteristics of the correlation graph are shown in a table 4.

Gatm model: the GATM model is structurally similar to the EGATM model, but is built using the basic GAT layer.

Egatm model: the invention provides a model.

TABLE 4 Graph-RF communication behavior Graph statistics

The experimental results are shown in table 5, and the EGATM obtains the highest detection result in the indexes of higher accuracy, precision, recall and F1 values, which indicate better detection effect; among the indicators that the lower the false alarm rate is, the better the effect is, EGATM also takes the lowest value among the 4 methods. The results in table 5 show that the false alarm rate of the Session-RF model is as high as more than 0.97, which is far higher than that of other 3 methods, so that in the detection object of the host computer, the correlation between the flows should be analyzed from the macroscopic angle, the behavior characteristics of a plurality of Session flows are fused, and the judgment is comprehensively made. Comparing Graph-RF model with two experimental results based on Graph neural network model, the Graph neural network can automatically extract rich information contained in the Graph, and assist researchers to make better judgment. The GATM model and the EGATM model are compared, and the description shows that by improving the basic graph attention layer, the model can consider the edge characteristics when calculating the attention coefficient, so that the model can be helped to learn the characteristic representation of the communication behavior graph better, and the detection capability of the model on malicious samples is improved. Finally, by combining experimental data analysis with the experimental results of table 5, it is explained that on the basis of flow behavior characteristics, the EGATM can synthesize various information to make judgment by integrating statistical characteristics and content characteristics of flow, and the anti-interference capability of the model is effectively enhanced.

TABLE 5 EGATM model comparison experiments (%)

In the experimental stage, a host-level network flow sample data set is constructed and an experiment is carried out, and the experimental result shows that in a host-level detection object, the detection method based on the communication behavior diagram is more effective than the detection method based on the session flow, and the EGATM can automatically extract rich information contained in the communication behavior diagram, so that the false alarm rate is effectively reduced while higher detection precision is obtained.

Correspondingly, the embodiment also provides an HTTP malicious traffic detection device based on the graph attention network, which comprises:

It should be noted that, in this document, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus.

Finally, it should be noted that: the foregoing description is only illustrative of the preferred embodiments of the present invention, and is not intended to limit the scope of the present invention. Any modification, equivalent replacement, improvement, etc. made within the spirit and principle of the present invention are included in the protection scope of the present invention.

Claims

1. The HTTP malicious traffic detection method based on the graph attention network is characterized by comprising the following steps of:

step 1, constructing a communication behavior diagram of a host level based on header field information in an HTTP data packet, wherein the communication behavior diagram is expressed as G= (N, E), N is a node, and E is an edge;

the construction of the communication behavior diagram of the host level comprises the following steps: firstly, extracting HTTP request-response pairs from an original PCAP file, defining nodes for generating a communication behavior diagram according to the nodes, and generating edges of the communication behavior diagram according to edge generation rules;

N＝(time,content,node_feature)

wherein time represents a time stamp of an HTTP request packet, content represents node content for generating an edge connection relationship, and node_feature represents a node characteristic; the content comprises URL, reference and Host information obtained from corresponding fields of the HTTP request data packet and Location information obtained from corresponding fields of the HTTP response data packet; the node_feature comprises a session stream statistics feature, a URL feature, a request feature and a response feature; the session flow statistical characteristics refer to the statistical characteristics of session flows where the HTTP request-response process expressed by the current node is located; the URL features comprise URL length, special character proportion in the URL, whether the URL comprises an IP address and a URL entropy value; the request features comprise a request mode and whether Cookies are contained or not; the response characteristics comprise a response status code, a request resource type and a request resource size;

E＝(N _s ,N _d ,edge_feature)

wherein N is _s Represents the start of the directed edge, N _d The end point of the directed edge is represented, and edge_feature represents edge characteristics;

the edge generation rule of the communication behavior graph is as follows:

if the Location content of the node i is the same as the URL content of the node j, and the time stamp of the node i is smaller than that of the node j, an edge of the Location type pointing to the node j from the node i exists; if the URL content of the node i is the same as the reference content of the node j, and the time stamp of the node i is smaller than that of the node j, an edge of the reference type pointing from the node i to the node j exists;

the edge characteristics comprise edge type, edge length and edge connection host machine dissimilarity; the edge type is two types of edges generated according to an edge generation rule, namely a Location edge and a reference edge, if the edge is the Location edge, the characteristic value of the edge type is 1, and if the edge is the reference edge, the characteristic value of the edge type is 0; the edge length refers to the difference between the end time stamp and the start time stamp connected by the edge; the difference of the edge connection hosts means whether the host content in two node contents connected by the edge is the same or not, if so, the difference characteristic value of the edge connection hosts is 1, and if not, the difference characteristic value of the edge connection hosts is 0;

step 2, classifying the communication behavior graph by using a graph attention network-based detection model, wherein the graph attention network-based detection model comprises 3 improved graph attention layers, a graph pooling layer and a full connection layer; the method specifically comprises the following steps:

extracting node feature vectors of a communication behavior graph by using 3 improved graph attention layers based on a detection model of a graph attention network, carrying out aggregation and splicing on all node feature vectors subjected to multiple iterations through a graph pooling layer to obtain global representation of the graph, and finally mapping the aggregated feature vectors to corresponding types by utilizing nonlinear fitting capacity of a full-connection layer;

the improved graph annotation force layer node updating process comprises the following steps:

first, the weight coefficient alpha is calculated _ij The formula is as follows:

wherein e _i,j ∈R ^e Representing edge characteristics connecting node i and node j,weight parameters representing the layer-edge feature transformation, the weight parameters being +.>The feature vectors corresponding to the nodes i and j at the first layer are respectively d ^(l) The length of the eigenvector representing the node of the first layer, leakyReLU is the excitationA living function;

and then according to the thought of weighted summation of the attention mechanism, obtaining a new feature vector of the node i by aggregating the information of all neighbors, wherein the new feature vector is as follows:

2. an HTTP malicious traffic detection apparatus based on a graph attention network, comprising:

N＝(time,content,node_)

wherein time represents a time stamp of an HTTP request packet, content represents node content for generating an edge connection relationship, and node_represents node characteristics; the content comprises URL, reference and Host information obtained from corresponding fields of the HTTP request data packet and Location information obtained from corresponding fields of the HTTP response data packet; the node_comprises a session flow statistics feature, a URL feature, a request feature and a response feature; the session flow statistical characteristics refer to the statistical characteristics of session flows where the HTTP request-response process expressed by the current node is located; the URL features comprise URL length, special character proportion in the URL, whether the URL comprises an IP address and a URL entropy value; the request features comprise a request mode and whether Cookies are contained or not; the response characteristics comprise a response status code, a request resource type and a request resource size;

E＝(N _s ,N _d ,edge_feature)

the edge generation rule of the communication behavior graph is as follows:

the malicious traffic detection module is used for classifying the communication behavior graph by using a graph attention network-based detection model, wherein the graph attention network-based detection model comprises 3 improved graph attention layers and a full connection layer; the method specifically comprises the following steps:

wherein e _i,j ∈R ^e Representing edge characteristics connecting node i and node j,weight parameters representing the layer-edge feature transformation, the weight parameters being +.>The feature vectors corresponding to the nodes i and j at the first layer are respectively d ^(l) The length of the feature vector of the node of the first layer is represented, and the LeakyReLU is an activation function;