CN114124551B

CN114124551B - Malicious encryption traffic identification method based on multi-granularity feature extraction under WireGuard protocol

Info

Publication number: CN114124551B
Application number: CN202111430097.6A
Authority: CN
Inventors: 李航; 丁建伟; 刘志洁; 汪明达; 陈周国
Original assignee: CETC 30 Research Institute
Current assignee: CETC 30 Research Institute
Priority date: 2021-11-29
Filing date: 2021-11-29
Publication date: 2023-05-23
Anticipated expiration: 2041-11-29
Also published as: CN114124551A

Abstract

The invention provides a malicious encryption traffic identification method based on multi-granularity feature extraction under a WireGuard protocol, which comprises the following steps: obtaining a pcap file of flow data; carrying out data preprocessing on the flow data in the pcap format in the pcap file to obtain session data; extracting multi-granularity characteristics of the session data to obtain a multi-granularity characteristic library; based on the multi-granularity feature library, training a model by using a machine learning algorithm, performing encryption traffic recognition, and outputting an encryption traffic recognition result. The invention realizes a malicious encryption traffic identification method based on multi-granularity feature extraction under the WirelGuard protocol, thereby realizing the detection of the malicious encryption traffic under the WirelGuard protocol. And flow characteristics are further extracted from a plurality of granularities such as a packet level, a session level, a host level and the like, and distinguishing property and noise resistance of the characteristics are improved, so that accuracy of model detection is improved.

Description

Malicious encryption traffic identification method based on multi-granularity feature extraction under WireGuard protocol

Technical Field

The invention relates to the technical field of network information security, in particular to a malicious encryption traffic identification method based on multi-granularity feature extraction under a wireless guard protocol.

Background

The encryption technology of the traffic can be used for illegal transmission by illegal molecules while protecting the data security of enterprises and users. Because of the increase of the auditing difficulty of the encrypted traffic, the malicious traffic is difficult to accurately and comprehensively detect under the condition of not decrypting. Identifying encrypted traffic is therefore of great importance in maintaining secure operation of the network.

Traffic identification studies mostly spread around protocols, especially for virtual private protocols. Compared with the IPSec protocol widely used at present, the WirelGuard is a new virtual private network (Virtual Private Network) protocol, uses a more advanced encryption algorithm and a safe and trusted architecture, and provides strong performance on the premise of ensuring network security. It has the following characteristics:

(1) When the system is used, only the public key is needed to be simply configured and exchanged, the rest of the work is automatically completed by the WireGuard, and the terminal user does not need to manage connection, care state, manage daemon or worry about hidden contents, so that the use and maintenance thresholds are greatly reduced.

(2) Support the most advanced encryption technology at present, such as: noise protocol framework, curve25519, chaChaCha 20, poly1305, BLAKE2, sipHash24, HKDF and other encryption algorithms.

(3) The method is simple in design and easy to implement, is implemented by using only a small amount of codes and is Open in source, and compared with the implementation modes such as Swan/IPSec or OpenVPN/OpenSSL, the method effectively reduces the source code auditing workload, and even can be fully inspected by a single person.

(4) The method has the characteristics of high-speed encryption transmission, and due to the concise code and high execution efficiency, the WirelGuard can obtain good transmission performance in both PCs and small embedded devices (such as smart phones and routers).

Based on the above characteristics, the WireGuard is rapidly accepted in the industry, and starting from Linux kernel version 5.6, the WireGuard has been added as a kernel module, and will be rapidly popularized in the future.

In recent years, technologies for identifying encrypted traffic of common VPN protocols (such as IPSec, open VPN, open SSL, etc.) are mature, and have achieved better results, but research on identifying encrypted traffic of novel WireGuard protocols is not enough.

Disclosure of Invention

The invention aims to provide a malicious encryption traffic identification method based on multi-granularity feature extraction under a WirelGuard protocol under the condition of not decrypting.

The invention provides a malicious encryption traffic identification method based on multi-granularity feature extraction under a WirelGuard protocol, which comprises the following steps:

obtaining a pcap file of flow data;

carrying out data preprocessing on the flow data in the pcap format in the pcap file to obtain session data;

extracting multi-granularity characteristics of the session data to obtain a multi-granularity characteristic library;

based on the multi-granularity feature library, training a model by using a machine learning algorithm, performing encryption traffic recognition, and outputting an encryption traffic recognition result.

Further, the method for preprocessing the traffic data in the pcap format includes:

filtering broadcasting traffic data and ICMP protocol traffic data of the WireGuard traffic data in a pcap format in a pcap file;

extracting a data packet of a session from the filtered WireGuard flow data; the data packet comprises five-tuple information, effective load data and fields after each protocol analysis;

and storing the traffic data in the format of the pcap in the pcap file as session data containing data packets by taking the session as a unit.

Further, the method for extracting multi-granularity characteristics of the session data comprises the following steps:

the method comprises the steps of preprocessing data to obtain session data containing data packets, extracting packet-level features and session-level features of the data packets from the session data, and counting host-level features after aggregation according to IP addresses;

according to the same session five-tuple information, splicing the packet-level features of the first N data packets into the session-level features in sequence; and splicing the host-level features of the same source IP address and the host-level features of the same destination IP address in the session into the session-level features to obtain a final multi-granularity feature library.

Further, the packet-level features include:

a port number;

a transport protocol type;

the payload length, i.e. the length of the payload of the data packet;

whether or not the text contains the plaintext;

the entropy value of the payload, i.e. the entropy value of the payload of the data packet;

load characteristics: whether there are a record data type, protocol version, and packet length.

Further, the session-level feature includes:

packet length distribution of transmit/receive data streams: mean, variance, maximum, minimum and entropy values of packet length;

delay law of data stream: mean, variance, maximum and minimum of delay time;

stream receiving and transmitting data packet sequence characteristics: the ratio of uplink data to downlink data, the total number of transmission/reception packets, and the total number of transmission/reception bytes;

byte distribution: information entropy value and average information entropy.

Further, the entropy value of the packet length is defined as:

wherein Entropy (P) represents the Entropy of the packet length, m is the maximum payload length, x _i The load length is i pieces of message data, and n is the total number of messages.

Further, the host level features include:

time distribution of IP initiation request: counting the frequency according to the hours;

frequency characteristics of IP initiation requests: average, minimum, maximum and variance of the hourly frequency;

number of IP originated requests: the number of accesses/the number of corresponding ports/the number of requested domain names/the number of requested TCP sessions/the number of requested UDP sessions within 1 day/1 hour/5 minutes;

the proportion of the uplink data and the downlink data in all sessions in the IP;

packet length features in all sessions in IP: mean, maximum, minimum, and variance of packet lengths.

Further, the method for training a model and identifying encrypted traffic by using a machine learning algorithm based on the multi-granularity feature library comprises the following steps:

step 1: judging whether a trained model exists or not, and if yes, entering a step 2, otherwise, entering a step 5, wherein the processing process is encryption flow identification;

step 2: inputting a multi-granularity feature library and a trained model, and entering a step 3;

step 3: predicting the multi-granularity feature library by using the trained model, and entering step 4;

step 4: outputting an encrypted flow identification result, and ending;

step 5: inputting a multi-granularity feature library and marking data, and entering a step 6; the marking data is a tag data set which marks whether each session-level feature in the multi-granularity feature library belongs to malicious encrypted traffic;

step 6: setting machine learning algorithm parameters, and entering a step 7;

step 7: training and storing a model based on the multi-granularity feature library and the marking data, and entering a step 8;

step 8: and outputting the trained model, and ending.

Preferably, the machine learning algorithm is a gradient-lifted tree algorithm.

In summary, due to the adoption of the technical scheme, the beneficial effects of the invention are as follows:

the invention realizes a malicious encryption traffic identification method based on multi-granularity feature extraction under the WirelGuard protocol, thereby realizing the detection of the malicious encryption traffic under the WirelGuard protocol. And flow characteristics are further extracted from a plurality of granularities such as a packet level, a session level, a host level and the like, and distinguishing property and noise resistance of the characteristics are improved, so that accuracy of model detection is improved.

Drawings

In order to more clearly illustrate the technical solutions of the embodiments of the present invention, the following description will briefly describe the drawings in the embodiments, it being understood that the following drawings only illustrate some embodiments of the present invention and should not be considered as limiting the scope, and that other related drawings may be obtained according to these drawings without inventive effort for a person skilled in the art.

Fig. 1 is a flowchart of a method for identifying malicious encrypted traffic based on multi-granularity feature extraction under a WireGuard protocol according to an embodiment of the present invention.

FIG. 2 is a flow chart of multi-granularity feature extraction according to an embodiment of the invention.

FIG. 3 is a flow chart of training a model and performing encrypted traffic recognition using a machine learning algorithm according to an embodiment of the present invention.

Detailed Description

For the purpose of making the objects, technical solutions and advantages of the embodiments of the present invention more apparent, the technical solutions of the embodiments of the present invention will be clearly and completely described below with reference to the accompanying drawings in the embodiments of the present invention, and it is apparent that the described embodiments are some embodiments of the present invention, but not all embodiments of the present invention. The components of the embodiments of the present invention generally described and illustrated in the figures herein may be arranged and designed in a wide variety of different configurations.

Thus, the following detailed description of the embodiments of the invention, as presented in the figures, is not intended to limit the scope of the invention, as claimed, but is merely representative of selected embodiments of the invention. All other embodiments, which can be made by those skilled in the art based on the embodiments of the invention without making any inventive effort, are intended to be within the scope of the invention.

Examples

As shown in fig. 1, this embodiment proposes a method for identifying malicious encrypted traffic based on multi-granularity feature extraction under a WireGuard protocol, including the following steps:

s1, obtaining a pcap file of flow data; the source of the pcap file of the traffic data used in this embodiment is: VPN tools with WireGuard protocols continue to run for some time as internet access tools. Firstly, accessing web service, and collecting traffic data communicated by the web service as a positive sample; secondly, running malicious software, lesovirus and the like in a sandbox environment, and collecting traffic data of communication generated by the malicious software, lesovirus and the like as a negative sample; the traffic data of the positive sample and the negative sample are stored as a pcap file, which contains traffic data in the pcap format.

S2, carrying out data preprocessing on the flow data in the pcap format in the pcap file to obtain session data; the method specifically comprises the following steps:

S3, extracting multi-granularity characteristics of the session data to obtain a multi-granularity characteristic library; as shown in fig. 2, the method specifically includes:

according to the same session five-tuple information, splicing the packet-level features of the first N (for example, N=20) data packets into the session-level features in sequence; and splicing the host-level features of the same source IP address and the host-level features of the same destination IP address in the session into the session-level features to obtain a final multi-granularity feature library.

In this embodiment, the packet-level feature includes:

a port number;

a transport protocol type, such as UDP;

the payload length, i.e. the length of the payload of the data packet;

whether or not the text contains the plaintext;

In this embodiment, the session-level feature includes:

delay law of data stream: mean, variance, maximum and minimum of delay time;

byte distribution: information entropy value and average information entropy.

The entropy value for the packet length is defined as:

In this embodiment, the host-level feature includes:

S4, training a model and carrying out encryption traffic identification by using a machine learning algorithm (an optional gradient lifting tree algorithm) based on the multi-granularity feature library, and outputting an encryption traffic identification result; as shown in fig. 3, the method specifically includes:

step 1: judging whether a trained model (namely a gradient lifting tree model GBDT) exists or not, and if yes, entering a step 2, otherwise, entering a step 5, wherein the processing process is encryption flow identification;

step 4: outputting an encrypted flow identification result, and ending;

step 6: setting machine learning algorithm parameters, and entering a step 7;

step 8: and outputting the trained model, and ending.

The above description is only of the preferred embodiments of the present invention and is not intended to limit the present invention, but various modifications and variations can be made to the present invention by those skilled in the art. Any modification, equivalent replacement, improvement, etc. made within the spirit and principle of the present invention should be included in the protection scope of the present invention.

Claims

1. A malicious encryption traffic identification method based on multi-granularity feature extraction under a WireGuard protocol is characterized by comprising the following steps:

obtaining a pcap file of flow data;

training a model and carrying out encryption traffic identification by using a machine learning algorithm based on the multi-granularity feature library, and outputting an encryption traffic identification result;

the method for extracting the multi-granularity characteristics of the session data comprises the following steps:

according to the same session five-tuple information, splicing the packet-level features of the first N data packets into the session-level features in sequence; splicing the host-level features of the same source IP address and the host-level features of the same destination IP address in the session into the session-level features to obtain a final multi-granularity feature library;

the packet-level features include:

a port number;

a transport protocol type;

the payload length, i.e. the length of the payload of the data packet;

whether or not the text contains the plaintext;

load characteristics: whether there are a record data type, a protocol version, and a packet length;

the session-level features include:

delay law of data stream: mean, variance, maximum and minimum of delay time;

byte distribution: information entropy value and average information entropy;

the entropy value of the packet length is defined as:

2. The method for identifying malicious encrypted traffic based on multi-granularity feature extraction under the WireGuard protocol according to claim 1, wherein the method for performing data preprocessing on the traffic data in the pcap format comprises the following steps:

3. The method for malicious encrypted traffic identification under the WireGuard protocol based on multi-granularity feature extraction of claim 1, wherein the host-level features comprise:

4. The method for identifying malicious encrypted traffic based on multi-granularity feature extraction under the WireGuard protocol according to claim 1, wherein the method for training a model and identifying encrypted traffic by using a machine learning algorithm based on a multi-granularity feature library comprises the following steps:

step 4: outputting an encrypted flow identification result, and ending;

step 6: setting machine learning algorithm parameters, and entering a step 7;

step 8: and outputting the trained model, and ending.

5. The method for identifying malicious encrypted traffic based on multi-granularity feature extraction under the WireGuard protocol according to claim 1 or 4, wherein the machine learning algorithm is a gradient-lifted tree algorithm.