CN115334005A

CN115334005A - Encrypted flow identification method based on pruning convolution neural network and machine learning

Info

Publication number: CN115334005A
Application number: CN202210337870.2A
Authority: CN
Inventors: 李小勇; 栗仕超; 刘芸杉; 亢超群; 李二霞; 李灵慧; 苑洁; 高雅丽
Original assignee: China Online Shanghai Energy Internet Research Institute Co ltd; Beijing University of Posts and Telecommunications
Current assignee: China Online Shanghai Energy Internet Research Institute Co ltd; Beijing University of Posts and Telecommunications
Priority date: 2022-03-31
Filing date: 2022-03-31
Publication date: 2022-11-11
Anticipated expiration: 2042-03-31
Also published as: CN115334005B

Abstract

The invention discloses an encrypted traffic identification method based on a pruning convolutional neural network and machine learning, which comprises the steps of data preprocessing, CNN model construction, model pruning, CNN extraction of high-level feature vectors and LightGBM classification. According to the encrypted flow identification method based on the pruned convolutional neural network and the machine learning, manual feature extraction is not needed, the CNN model is used for automatically extracting high-level features from an original flow file and classifying the high-level features, meanwhile, the pruned convolutional neural network model is constructed, the number of model parameters is reduced, the calculation overhead is reduced, the LightGBM is used for classifying according to the high-level features of the encrypted flow, a weak classifier is used for achieving a strong classification effect, the accuracy is improved, and the final model can achieve higher performance and accuracy than other classification models.

Description

Encrypted flow identification method based on pruning convolution neural network and machine learning

Technical Field

The invention relates to the technical field of network traffic identification, in particular to an encrypted traffic identification method based on a pruning convolutional neural network and machine learning.

Background

Network traffic identification techniques have an important role in applications such as network quality of service control, traffic charging, network resource usage planning, and malware detection. With the continuous development of network information technology, more and more software uses SSL, SSH, VPN, tor and other encryption or port obfuscation technologies, and the proportion of encrypted traffic is higher and higher.

The survey statistics agency, netmarkertschare, states that by 10 months of 2019, the proportion of encrypted Web traffic has exceeded nine percent, the existing 90 sites of HTTPS are used by default in the top 100 ranked non-Google websites on the internet, and globally, the HTTPS proportion in the united states is 92%, russia is 85%, japan is 80%, and indonesia is 74%. This change presents new challenges to current traffic detection methods, making network traffic identification and analysis increasingly difficult.

The premise of traffic classification is that the characteristics of different traffic are unique, and the current traffic classification methods can be roughly classified into the following methods:

1) Port-based classification methods. The method distinguishes different traffic types according to the port numbers used by the traffic on the premise that the application services all use the ports allocated by the IANA and keep the ports unchanged.

2) Payload-based classification method. The method is also called deep packet inspection, namely protocols are distinguished according to static payload characteristics, and the method can be used for some coarse-grained traffic classification.

3) A statistical-based classification method. The method adopts more machine learning techniques, and distinguishes different types according to the statistical characteristics of the flow. These characteristics can be roughly classified into two types, namely packet level and flow level, the former includes some packet length, packet inter-arrival time and direction, and the latter includes some uplink and downlink traffic packet quantity, network flow time length, proportion of different types of traffic packets, and the like.

Current traffic classification methods have the following disadvantages:

1) The port-based classification method has the advantages that when the application software uses the ports beyond IANA regulations, the accuracy rate is greatly reduced, and the malicious software traffic uses random or dynamic ports, so the method cannot identify the malicious software traffic.

2) The classification method based on the effective load can destroy the load characteristics which the flow depends on after being encrypted, and is only suitable for coarse-grained flow classification or incompletely encrypted scenes.

3) The classification method based on deep learning has the advantages that the trained classification model has huge parameter quantity, and the deployment condition of the model is limited.

Disclosure of Invention

Aiming at the technical problem that the parameter quantity of a classification model trained by a deep learning-based classification method is huge, the invention provides an encrypted flow identification method based on a pruning convolutional neural network and machine learning, the characteristics do not need to be manually extracted, high-level characteristics are directly and automatically extracted from an original flow file and classified, the model is pruned, the parameter quantity of the model is reduced, the convolutional neural network is used for extracting the characteristics automatically, the LightGBM achieves the effect of strong classification by using a weak classifier, the final model can achieve higher performance and higher accuracy than other classification models, and the encrypted flow identification method is suitable for efficient detection of encrypted flow.

In order to achieve the above purpose, the invention provides the following technical scheme:

the invention provides an encrypted flow identification method based on a pruning convolutional neural network and machine learning, which comprises the following steps of:

s1: preprocessing data;

s2: a CNN model is constructed, and a convolutional neural network mainly comprises the following layers: an input layer, a convolution layer, a ReLU layer, a pool layer and a full connection layer;

s3: pruning the model, retraining the model, and obtaining an optimized CNN model after a plurality of iterations;

s4: outputting a 256-dimensional characteristic vector serving as the input of the LightGBM classifier by the optimized CNN model;

s5: the LightGBM classification is characterized in that a gradient decision tree in a LightGBM algorithm is obtained by carrying out multiple iterations on a given training data set, during each iteration, a new tree is readjusted by using gradient information to add into a previous iteration tree, the process is a continuously-changing linear combination process in a function space, the LightGBM integrates weights of all leaf nodes as references for building the tree, then partition points are determined, a first-order gradient and a second-order gradient are calculated, and after multiple iterations, the performance of the LightGBM classifier is enabled to reach the optimum.

Compared with the prior art, the invention has the following beneficial effects:

according to the encrypted flow identification method based on the pruning convolutional neural network and machine learning, manual feature extraction is not needed, high-level features are automatically extracted from original flow files by using a CNN model and are classified, meanwhile, the pruning convolutional neural network model is constructed, the number of model parameters is reduced, the calculation cost is reduced, the lightGBM is used for classification according to the high-level features of encrypted flow, a weak classifier is used for achieving a strong classification effect, the accuracy is improved, and the final model can achieve higher performance and accuracy than other classification models.

Drawings

In order to more clearly illustrate the embodiments of the present application or technical solutions in the prior art, the drawings needed to be used in the embodiments will be briefly described below, and it is obvious that the drawings in the following description are only some embodiments described in the present invention, and other drawings can be obtained by those skilled in the art according to the drawings.

Fig. 1 is a flowchart of an encrypted traffic identification method based on a pruning convolutional neural network and machine learning according to an embodiment of the present invention.

Fig. 2 is a flow chart of data preprocessing according to an embodiment of the present invention.

Fig. 3 is a flowchart of pruning steps provided by an embodiment of the present invention.

Detailed Description

The technical solutions in the embodiments of the present invention are clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are some, not all, embodiments of the present invention. All other embodiments, which can be obtained by a person skilled in the art without making any creative effort based on the embodiments in the present invention, belong to the protection scope of the present invention.

The invention provides an encrypted traffic identification method based on a pruning convolutional neural network and machine learning, which comprises the following steps as shown in figure 1:

s1: data preprocessing: processing the original flow file to be suitable for standard input of a CNN model;

the encrypted traffic input at step S1 uses public data set iscxnvpn 2016, which contains 6 traditional encrypted traffic types: email, chat, streaming, file transfer, voIP and P2P,6 corresponding VPN encrypted flows: VPN-Email, VPN-Chat, VPN-Streaming, VPN-File transfer, VPN-VoIP, and VPN-P2P. The traffic data are obtained by Wireshark and tcpdump tools in real environment, and the total volume is 28GB.

The specific flow of the data preprocessing step is shown in fig. 2. The key points are as follows:

removing irrelevant messages: i.e. removing packets that affect the model prediction or that are payload empty. The traffic in the real environment may include some packets for establishing and disconnecting TCP, such as packets including SYN, ACK, or FIN flag bits, and some packets for domain name resolution and packets with empty payload, which do not work for traffic classification, but rather affect classification accuracy, and therefore need to be removed.

Removing the Ethernet frame head: the ethernet frame header contains the MAC address for confirming the location of the network device and for transmitting data packets between the network nodes, but has little effect in traffic classification, so the ethernet frame header needs to be deleted.

Masking the IP address: the IP address has an overfitting effect on the model in traffic classification, and the source IP address and the target IP address need to be set to 0.

Checking the packet length: the method uses a convolutional neural network, requires a fixed-size input, but the length of the data packet is not constant, so to check the length of the data packet, if the length is smaller than the specified input size, zero padding is required at the end of the data packet. If the length is greater than the fixed input size, the packet needs to be truncated. Ensuring that the length of the traffic packet conforms to the input size of the CNN model.

Normalization: different evaluation indexes often have different dimensions, and in order to solve the comparability between the data indexes, the data packet needs to be subjected to normalization processing, and the normalization processing is divided by 255 in units of bytes, so that the input sizes are all between 0 and 1.

S2: and (5) constructing a CNN model.

The convolutional neural network is a feed-forward neural network comprising convolution calculation and having a deep structure, and is one of the most popular deep learning algorithms at present. With the deepening of learning theory and the improvement of computing performance, convolutional neural networks have been rapidly developed and applied to computer vision, natural language processing, and the like. The convolutional neural network is mainly composed of the following layers: input layer, convolutional layer, reLU layer, pool layer, and fully-connected layer. By overlapping the layers, a complete convolutional neural network is formed, and finally, 256-dimensional feature vectors are output for use by a subsequent LightGBM classifier. The output feature vector dimension is too high, so that the result is over-fitted and the cost is increased, and the classification accuracy is reduced due to too low dimension. The structure of the CNN model used in the present invention is shown in table 1, wherein the key points are as follows:

a convolutional layer: conv2D, two-dimensional convolution, the flow data packet can be converted into a gray image, more suitable for processing with two-dimensional convolution.

Activation function: reLU, as shown in equation (1), activates a node only when the input is greater than 0, the output is 0 when the input is less than 0, and the output is equal to the input when the input is greater than 0. The function can remove negative values in the convolution result, leaving positive values unchanged.

ReLU(x)＝max(0,x) (1)

Batch standardization: batch Normalization, similar to normal data Normalization, is a method for unifying scattered data and optimizing a neural network, and divides data into small batches for random gradient descent. As shown in formula (2), wherein α _i Is the value of the original activation of a certain neuron,

is a standard value after standardized operation.

Loss function: the cross entropy Loss function (CrossEntropy Loss) represents the difference between the true probability distribution and the predicted probability distribution as shown in formula (3), and the smaller the value of the cross entropy, the better the model prediction effect.

Activation function of output layer: softmax, when a sample passes through Softmax layer and outputs a vector of T × 1, the index of the number with the largest value in the vector is taken as the prediction label of the sample, and the formula is shown in (4).

Dropout: and (3) stopping training some neurons randomly during training, improving the robustness of the model, and setting dropout to be 0.5 by the model.

TABLE 1 CNN model Main parameters

Network layer

Operation of

Input device

Convolution kernel

Step size

Filling in

Output the output

Number of weights

1

Conv2D+ReLU+BN

30*30

3*3

1

Same

8*30*30

80

2

Conv2D+ReLU+BN

8*30*30

3*3

2

Same

16*14*14

1168

3

Conv2D+ReLU+BN

16*14*14

3*3

2

Same

32*6*6

4640

4

Conv2D+ReLU+BN

32*6*6

3*3

1

Same

64*4*4

18496

5

Full connection + Dropout

64*4*4

Null

None

256

262400

generally speaking, the more layers and parameters of a neural network, the better the result, but at the same time, the more computing resources consumed. Therefore, parameters which have small influence on the prediction result can be removed by using a pruning technology, the neuron with low contribution degree is abandoned according to the ranking of the neuron contribution degree of the model to the output result, and the model has higher running speed and smaller model files. As shown in fig. 3, assuming that the first layer has 4 neurons and the second layer has 5 neurons, the corresponding weight matrix is 4 × 5 in size. The pruning process is as follows:

sorting the weights of the adjacent two layers of neurons according to the absolute value;

the smaller absolute value (e.g. 0.4) weight is pruned, i.e. set to 0, based on the pruning rate P.

After pruning, the model is retrained and an optimized CNN model is obtained after a plurality of iterations.

s5: lightGBM classification.

The LightGBM is a framework for realizing a GBDT algorithm, the GBDT is a model with a long and prosperous life in machine learning, the main idea is to use a weak classifier (decision tree) to carry out iterative training to obtain an optimal model, and the model has the advantages of good training effect, difficulty in overfitting and the like. Compared with the conventional CNN full connection layer used as a classifier, the LightGBM classifier supports high-efficiency parallel training, has higher training speed and lower memory consumption, supports distributed rapid processing of mass data, and reduces the deployment requirement of a detection model.

The gradient decision tree in LightGBM algorithm is obtained by performing multiple iterations on a given training data set, and in each iteration, a new tree is readjusted by using gradient information to join a previous iteration tree, and in function space, the above process is a continuously changing linear combination process, as shown in formula (6):

χ is the function space of the iterative tree, f _q (x _i ) The predicted value of the ith instance in the qth tree is represented.

Each segmentation node of the tree adopts an optimal segmentation point, and a greedy method is actually used in the process of building the tree model. The LightGBM integrates the weights of all leaf nodes as a reference to construct the tree, then determines the partitioning points and computes the first and second order gradients.

For any given tree structure, lightGBM defines the total number of times each feature is partitioned in the iterative tree, T _ Split, and the sum of gains T _ Gain that the feature is partitioned in all decision trees as metrics for measuring feature importance, as follows:

and K is K decision trees generated by K rounds of iteration.

After multiple iterations, the LightGBM classifier performance is optimized.

Compared with the original CNN model classification, the LightGBM improves the accuracy and recall rate and increases the recognition speed.

According to the encrypted flow identification method based on the pruning convolutional neural network and machine learning, provided by the invention, the characteristics do not need to be manually extracted, the CNN model is used for automatically extracting high-level characteristics from an original flow file and classifying the high-level characteristics, meanwhile, the pruning convolutional neural network model is constructed, the model parameter quantity is reduced, the calculation cost is reduced, the LightGBM is used for classifying according to the high-level characteristics of the encrypted flow, a weak classifier is used for achieving the effect of strong classification, the accuracy is improved, and the final model can achieve higher performance and accuracy than other classification models (see table 2).

TABLE 2 comparison of the models of the present application with other classification models

Method	Rate of accuracy	Recall rate	F1 value
				1D CNN	0.89	0.89	0.89
CNN+LSTM	0.91	0.91	0.91
				SAE	0.92	0.92	0.92
2D-CNN	0.91	0.91	0.91
				Model before pruning	0.90	0.86	0.88
Post-pruning model (application)	0.94	0.93	0.93

It should be noted that, in this document, relational terms such as first and second, and the like are used solely to distinguish one entity or action from another entity or action without necessarily requiring or implying any actual such relationship or order between such entities or actions. Also, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrases "comprising a," "8230," "8230," or "comprising" does not exclude the presence of additional like elements in a process, method, article, or apparatus that comprises the element.

All the embodiments in the present specification are described in a related manner, and the same and similar parts among the embodiments may be referred to each other, and each embodiment focuses on the differences from the other embodiments. In particular, apparatus embodiments, electronic device embodiments, computer-readable storage medium embodiments, and computer program product embodiments are described with relative simplicity as they are substantially similar to method embodiments, where relevant only as described in portions of the method embodiments.

The above-mentioned embodiments are only specific embodiments of the present application, and are used to illustrate the technical solutions of the present application, but not to limit the technical solutions, and the scope of the present application is not limited to the above-mentioned embodiments, although the present application is described in detail with reference to the foregoing embodiments, those skilled in the art should understand that: those skilled in the art can still make modifications or easily conceive of changes to the technical solutions described in the foregoing embodiments, or make equivalents to some of them, within the technical scope of the present disclosure; such modifications, changes or substitutions do not depart from the spirit and scope of the present disclosure, which should be construed in light of the above teachings. Are intended to be covered by the scope of the present application. Therefore, the protection scope of the present application shall be subject to the protection scope of the claims.

Claims

1. A method for identifying encrypted traffic based on a pruning convolutional neural network and machine learning is characterized by comprising the following steps:

s1: preprocessing data;

s4: outputting a 256-dimensional characteristic vector serving as input of the LightGBM classifier by the optimized CNN model;

2. The method for identifying encrypted traffic based on pruned convolutional neural network and machine learning of claim 1, wherein the encrypted traffic inputted in step S1 uses public data set iscxnvpn 2016, which contains 6 traditional encrypted traffic: email, chat, streaming, file transfer, voIP and P2P,6 corresponding VPN encrypted flows: VPN-Email, VPN-Chat, VPN-Streaming, VPN-File transfer, VPN-VoIP, and VPN-P2P.

3. The encrypted traffic identification method based on the pruned convolutional neural network and the machine learning of claim 2, wherein the traffic data input in step S1 are all obtained by Wireshark and tcpdump tools in a real environment, and the total amount is 28GB.

4. The encrypted traffic identification method based on the pruning convolutional neural network and the machine learning according to claim 1, wherein the data preprocessing process in the step S1 comprises:

s11: reading a pcap file;

s12: irrelevant messages are removed;

s13: removing the Ethernet frame header;

s14: covering the IP address;

s15: checking whether the packet length is larger than a specified input size, if so, truncating the data packet, otherwise, performing zero padding at the tail of the data packet to generate a byte matrix;

s16: the packets are normalized and divided by 255 in bytes so that the input sizes are all between 0 and 1.

5. The encrypted traffic identification method based on pruning convolutional neural network and machine learning according to claim 1, wherein step S2 is to form a complete convolutional neural network by stacking the input layer, convolutional layer, reLU layer, pool layer and full connection layer;

wherein the convolutional layer is a two-dimensional convolution;

the activation function ReLU is shown in equation (1):

ReLU(x)＝max(O,x) (I)

batch normalization is shown in equation (2):

wherein alpha is _i Is the value of the original activation of a certain neuron,

the standard value is a standard value after standardized operation;

the loss function is shown in equation (3):

the activation function Softmax formula of the output layer is shown as (4):

the model sets dropout to 0.5.

6. The encrypted traffic identification method based on the pruning convolutional neural network and the machine learning according to claim 1, wherein the main parameters of the CNN model in the step S2 are:

network layer Operation of Input device Convolution kernel Step size Filling in Output the output Number of weights 1 Conv2D+ReLU+BN 30*30 3*3 1 Same 8*30*30 80 2 Conv2D+ReLU+BN 8*30*30 3*3 2 Same 16*14*14 1168 3 Conv2D+ReLU+BN 16*14*14 3*3 2 Same 32*6*6 4640 4 Conv2D+ReLU+BN 32*6*6 3*3 1 Same 64*4*4 18496 5 Full connection + Dropout 64*4*4 Null Null None 256 262400

7. The encrypted traffic identification method based on the pruning convolutional neural network and the machine learning according to claim 1, wherein the pruning process in the step S3 is as follows:

s31: sorting the weights of the adjacent two layers of neurons according to the absolute value;

s32: according to the pruning speed P, the weight with the absolute value less than 0.4 is pruned, namely the weight is set to be 0;

s33: after pruning, the model is retrained and an optimized CNN model is obtained after a plurality of iterations.

8. The encrypted traffic recognition method based on pruned convolutional neural network and machine learning of claim 1, wherein the continuously varying linear combination process of step S5 is as shown in formula (6):

χ is the function space of the iterative tree, f _q (x _i ) Indicating the predicted value of the ith example in the qth tree.

9. The method for identifying encrypted traffic based on pruned convolutional neural network and machine learning of claim 1, wherein step S5, for any given tree structure, lightGBM defines the total number of times T _ Split each feature is Split in the iterative tree and the total Gain T _ Gain of the feature after being Split in all decision trees as the metric for measuring the feature importance, which is specifically defined as follows:

and K is K decision trees generated by K rounds of iteration.