CN116132095A

CN116132095A - Hidden malicious traffic detection method integrating statistical features and graph structural features

Info

Publication number: CN116132095A
Application number: CN202211477370.5A
Authority: CN
Inventors: 卢功利; 孙辉; 周国栋; 赵益平; 郑康锋; 武斌
Original assignee: Kunshan jiuhua electronic equipment factory; Beijing University of Posts and Telecommunications
Current assignee: Kunshan jiuhua electronic equipment factory; Beijing University of Posts and Telecommunications
Priority date: 2022-11-23
Filing date: 2022-11-23
Publication date: 2023-05-16

Abstract

The invention discloses a hidden malicious flow detection method integrating statistical features and graph structural features, which comprises the following steps: monitoring gateway flow, aggregating data packets with the same source and destination addresses into data streams, and constructing a flow interaction diagram, wherein nodes in the diagram represent a host, and one side represents a data stream; extracting the packet-by-packet characteristics of the data packets in each data stream, generating a stream characteristic histogram from the packet-by-packet characteristic set in the data stream, and converting the stream characteristic histograms with different lengths into stream characteristic vectors with equal lengths; converting the flow interactive graph into a new graph structure-flow association graph according to the relation among the data flows of each node, wherein the graph takes the flow as a node and takes the flow characteristic vector as a node attribute; training a flow association graph by using a graph convolution neural network, and finally identifying hidden malicious traffic. By the method, the malicious concealed flow can be efficiently and accurately detected, and less time and space can be consumed while the security is high.

Description

Hidden malicious traffic detection method integrating statistical features and graph structural features

Technical Field

The invention relates to the technical field of network security, in particular to a hidden malicious traffic detection method integrating statistical features and graph structural features.

Background

With the continuous development of network communication technology, various infrastructures in life are informationized and digitized, and are transmitted and controlled through network traffic, new network attacks are continuously emerging, and the attacks threaten the security of military fields, industrial fields and infrastructures, and the scale of the attacks is gradually expanding.

The novel network attack flow generally has the following characteristics:

concealment: the novel network flow is generally encrypted and disguised as normal flow, the data packet content cannot be directly obtained, and the detection difficulty is greatly improved by performing measures such as rule matching, natural language processing and the like on the data packet content;

slowness: an attacker often slowly attacks so that the attack traffic can be submerged in the normal traffic, making it difficult for an defender to distinguish the hidden malicious traffic from the normal traffic where the data volume is large.

How to detect hidden malicious traffic has become a popular research direction for the last two years. Since the current attack traffic is usually encrypted traffic, the detection of the encrypted traffic is usually performed by a detection method based on statistical features or based on traffic interaction diagrams, the method based on the statistical features of the traffic only focuses on the intrinsic features of the traffic, and due to the slowness of the attack traffic, the normal data stream and the malicious data stream are difficult to distinguish based on the statistical features of the traffic, the time and the memory required by classification based on various statistical features combined with machine learning are relatively large, and meanwhile, the information of some data packet level features is lost, so that the network environment with high bandwidth and high security requirements cannot be met; the detection method based on the flow interaction diagram focuses on the structural characteristics of the flow, ignores the characteristics of the flow, and cannot accurately detect malicious hidden flow. The above malicious concealed traffic detection scheme is therefore not satisfactory for detecting novel concealed traffic.

Based on the defects and shortcomings, the prior art needs to be improved, and a hidden malicious flow detection method integrating statistical features and graph structural features is designed.

Disclosure of Invention

The invention mainly solves the technical problem of providing the hidden malicious flow detection method integrating the statistical characteristics and the graphic structural characteristics, which can efficiently and accurately detect malicious hidden flow and consume less time and space while having high safety.

In order to solve the technical problems, the invention adopts a technical scheme that: the method for detecting the hidden malicious traffic by combining the statistical features and the graph structural features comprises the following steps:

step S1: monitoring gateway flow, aggregating data packets into data streams, and constructing a flow interaction diagram, wherein nodes in the diagram represent a host, and one side represents a data stream;

step S2: carrying out data packet level feature extraction on data packets in each data stream, generating a stream feature histogram by a packet-by-packet feature set in the data stream, and converting the stream feature histograms with different lengths into stream feature vectors with equal lengths;

step S3: converting the flow interactive graph into a flow association graph according to the relation among the data flows of each node, wherein the graph takes the flow as a node and takes the flow characteristic vector as a node attribute;

step S4: training data flow characteristics and graph structure characteristics by using the graph convolution neural network, and finally identifying hidden malicious traffic.

Preferably, in step S1, a traffic monitoring device is disposed at the gateway, and monitors traffic inside the local area network.

Preferably, in step S1, the data packets monitored in the period are aggregated into a stream according to the source IP and the destination IP.

Preferably, in step S2, the data flow is subjected to data packet granularity feature extraction, and then the data packet granularity feature vector is converted into a flow feature vector with a fixed length, and the flow feature vector is input into a graph convolution neural network for learning and training, so as to finally obtain a classification result.

Preferably, in step S3, the edges in the traffic interaction graph are converted into nodes in the traffic association graph, whether the two data flow nodes are connected is judged by whether the two data flows are communicated with the same host, the traffic association graph is input into the graph convolutional neural network for training, the hidden layer clusters the features of the collar nodes, and finally the softmax fully connected layer is connected to complete the classification of the nodes.

Preferably, the data packets are parsed, four types of characteristics of the length, the protocol, the time interval between the data packets and the last data packet and the port number of each data packet are extracted, the characteristics of each data packet in the data stream are recorded, and the time stamp and the duration of the start of a link are saved.

Preferably, the processing of the granularity characteristic of the data packet is performed, firstly, counting the characteristic values in the data stream, respectively counting different characteristics, generating four histograms for each data stream, and the length of the histograms is unpredictable because the number of characteristic values of each characteristic cannot be predicted, so that the length of the histograms needs to be unified.

And converting the characteristic histograms with the indefinite length into characteristic vectors with the definite length by using a sensitive Hash function, respectively calculating the sensitive Hash values of different characteristic histograms, and approximating the Jaccard similarity between the histograms by the Jaccard similarity between the sensitive Hash values of two different histograms.

Preferably, in step S4, the graph rolling neural network can train the feature matrix and the graph structure at the same time, but only trains the feature matrix of the node, while the current flow feature vector represents the edge feature attribute, so as to convert the edge feature into the node feature, convert the flow interaction graph into the flow association graph, and then input the feature vector and the flow association graph together into the graph rolling neural network for training.

Compared with the prior art, the invention has the beneficial effects that:

the flow statistics characteristics and the structural characteristics are fused, the flow statistics characteristics represent the characteristics of data flow between two hosts, the structural characteristics are flow interaction characteristics in a local area network, the flow statistics characteristics are focused on data packets, the flow interaction characteristics are focused on global structural characteristics, at least one characteristic between normal flow and malicious flow is different, and the map convolution neural network can fuse the flow statistics characteristics and the structural characteristics by utilizing the transformation of a flow interaction map, so that the difference between the malicious flow and the normal flow is more comprehensively learned and hidden;

the data packet characteristics are counted and recorded packet by packet, and through the transformation of the histogram and the stream characteristic vector, the data packet level characteristics of the stream are completely recorded in the stream characteristic vector with fixed length, so that the information loss of the manually constructed statistical characteristics is reduced, and the fixed length characteristic vector also ensures the consumption of detection time and space.

Drawings

FIG. 1 is a system architecture diagram of a hidden malicious traffic detection method that incorporates statistical features and graph structural features.

Fig. 2 is a flow chart of a method for detecting hidden malicious traffic by combining statistical features and graph structural features.

Detailed Description

The preferred embodiments of the present invention will be described in detail below with reference to the accompanying drawings so that the advantages and features of the present invention can be more easily understood by those skilled in the art, thereby making clear and defining the scope of the present invention.

Referring to fig. 1 and 2, an embodiment of the present invention includes:

the method for detecting the hidden malicious traffic integrating the statistical features and the graph structural features comprises the following steps of:

In step S1, a flow monitoring device is arranged at a gateway to monitor the flow in the local area network.

In step S1, data packets monitored in the time period are aggregated into streams according to the source IP and the destination IP, one data stream is used as one side in the traffic interaction graph, the host is used as one node of the traffic interaction graph, the traffic interaction graph contains structure information of the data stream, the attack behavior generally has an attack chain, the attacker can generate abnormal lateral movement in the local area network so as to infect other hosts, so that the structure of part of attack traffic is different from the structure of normal traffic, malicious traffic can be distinguished by learning the structure of traffic, and the invention only distinguishes each node by using IP, because if nodes are distinguished by ports and IP at the same time, random port rules are adopted by some programs to cause certain influence on detection.

In step S2, extracting data packet granularity characteristic of the data stream, converting the data packet granularity characteristic vector into a fixed-length stream characteristic vector, so that the data stream characteristic can be learned by the graph rolling neural network, and the stream characteristic vectors corresponding to the similar stream characteristic vectors are also similar and are input into the graph rolling neural network for learning and training, so that a classification result is finally obtained.

In step S3, the edges in the flow interaction graph are converted into nodes in the flow association graph, whether the two data flow nodes are connected is judged by whether the two data flows are communicated with the same host, if the two data flows can be connected to the same host in the flow interaction graph, the two data flows are connected in the flow association graph, the transition of the graph can enable the flow feature vector serving as the edge attribute in the flow interaction graph to become the node attribute, so that training of the graph convolution neural network can be participated, the flow association graph is input into the graph convolution neural network for training, the flow feature vector is learned by the input layer to the hidden layer, the information of the nodes is continuously stacked, and the last layer predicts the result by using the softmax classifier.

Analyzing the data packets, extracting the length, protocol, time interval and port number of each data packet, recording the characteristics of each data packet in the data stream, and storing the time stamp and duration of the start of a link.

The method comprises the steps of processing granularity characteristics of data packets, firstly, counting characteristic values in data streams, respectively counting different characteristics, generating four histograms for each data stream, and unifying the lengths of the histograms because the lengths of the characteristic values of the characteristics cannot be predicted.

The sensitive Hash function is used for converting the characteristic histogram with the indefinite length into the characteristic vector with the definite length, the sensitive Hash value calculation is respectively carried out on different characteristic histograms, and the Jaccard similarity between the sensitive Hash values of the two different histograms approximates to the Jaccard similarity between the histograms, so the invention uses the sensitive Hash value to approximately replace the histograms for training.

In step S4, the graph rolling neural network may train the feature matrix and the graph structure at the same time, but only trains the feature matrix of the node, and the current flow feature vector represents the edge feature attribute, so as to convert the edge feature into the node feature, convert the flow interaction graph into the flow association graph, and then input the feature vector and the flow association graph together into the graph rolling neural network for training.

According to the hidden malicious flow detection method integrating the statistical features and the graph structural features, the flow feature vector is formed by processing the features of the granularity of the data packet, so that the information of the data flow is more comprehensively reserved, the feature vector with a fixed length is generated by calculating the sensitive Hash function, the consumption of space and time is reduced, and the detection cost can be reduced on the premise that the information is not lost by the feature extraction method; through the transition of the flow interaction diagram and the flow association diagram, the diagram convolution neural network can integrate the characteristics of two dimensions of the flow characteristics and the structural characteristics, and the detection accuracy is increased.

The method integrates the characteristics of the data stream and the flow graph, combines the characteristics of the stream and the global structure characteristics, and improves the detection accuracy; method for improving manual construction of flow statistics features, using flow feature histogram as medium to convert packet-by-packet features in flow into fixed-length flow feature vector, reducing time and space consumption while maintaining feature information comprehensiveness to a greater extent

The foregoing description is only illustrative of the present invention and is not intended to limit the scope of the invention, and all equivalent structures or equivalent processes or direct or indirect application in other related technical fields are included in the scope of the present invention.

Claims

1. A hidden malicious flow detection method integrating statistical features and graph structural features is characterized in that: the method comprises the following steps:

2. The method for detecting hidden malicious traffic by combining statistical features and graph structural features according to claim 1, wherein the method comprises the following steps: in step S1, a flow monitoring device is arranged at a gateway to monitor the flow in the local area network.

3. The method for detecting hidden malicious traffic by combining statistical features and graph structural features according to claim 1, wherein the method comprises the following steps: the data packets monitored in the time period are aggregated into a stream according to the source IP and the destination IP in step S1.

4. The method for detecting hidden malicious traffic by combining statistical features and graph structural features according to claim 1, wherein the method comprises the following steps: and S2, extracting data packet granularity characteristics of the data stream, converting the data packet granularity characteristic vectors into stream characteristic vectors with fixed length, inputting the stream characteristic vectors into a graph convolution neural network for learning and training, and finally obtaining a classification result.

5. The method for detecting hidden malicious traffic by combining statistical features and graph structural features according to claim 1, wherein the method comprises the following steps: and step S3, converting edges in the flow interaction graph into nodes in the flow association graph, judging whether the two data flow nodes are connected or not by judging whether the two data flows are communicated with the same host computer, inputting the flow association graph into a graph convolution neural network for training, wherein a hidden layer clusters the characteristics of the collar nodes, and finally connecting a softmax full-connection layer to finish the classification of the nodes.

6. The method for detecting hidden malicious traffic by combining statistical features and graph structural features according to claim 4, wherein the method comprises the following steps: analyzing the data packets, extracting the length, protocol, time interval and port number of each data packet, recording the characteristics of each data packet in the data stream, and storing the time stamp and duration of the start of a link.

7. The method for detecting hidden malicious traffic by combining statistical features and graph structural features according to claim 4, wherein the method comprises the following steps: the method comprises the steps of processing granularity characteristics of data packets, firstly, counting characteristic values in data streams, respectively counting different characteristics, generating four histograms for each data stream, and unifying the lengths of the histograms because the lengths of the characteristic values of the characteristics cannot be predicted.

8. The method for detecting hidden malicious traffic by combining statistical features and graph structural features according to claim 7, wherein the method comprises the following steps: and converting the characteristic histograms with the indefinite length into characteristic vectors with the definite length by using a sensitive Hash function, respectively calculating the sensitive Hash values of different characteristic histograms, and approximating the Jaccard similarity between the histograms by the Jaccard similarity between the sensitive Hash values of two different histograms.

9. The method for detecting hidden malicious traffic by combining statistical features and graph structural features according to claim 1, wherein the method comprises the following steps: in step S4, the graph rolling neural network may train the feature matrix and the graph structure at the same time, but only trains the feature matrix of the node, and the current flow feature vector represents the edge feature attribute, so as to convert the edge feature into the node feature, convert the flow interaction graph into the flow association graph, and then input the feature vector and the flow association graph together into the graph rolling neural network for training.