CN115334005B

CN115334005B - Encryption flow identification method based on pruning convolutional neural network and machine learning

Info

Publication number: CN115334005B
Application number: CN202210337870.2A
Authority: CN
Inventors: 李小勇; 栗仕超; 刘芸杉; 亢超群; 李二霞; 李灵慧; 苑洁; 高雅丽
Original assignee: China Online Shanghai Energy Internet Research Institute Co ltd; Beijing University of Posts and Telecommunications
Current assignee: China Online Shanghai Energy Internet Research Institute Co ltd; Beijing University of Posts and Telecommunications
Priority date: 2022-03-31
Filing date: 2022-03-31
Publication date: 2024-03-22
Anticipated expiration: 2042-03-31
Also published as: CN115334005A

Abstract

The invention discloses an encryption flow identification method based on a pruning convolutional neural network and machine learning, which comprises the steps of data preprocessing, CNN model construction, model pruning, CNN extraction of high-level feature vectors and LightGBM classification. According to the encryption flow identification method based on the pruned convolutional neural network and the machine learning, the characteristics are not required to be manually extracted, the CNN model is utilized to automatically extract the advanced characteristics from the original flow file and classify the same, meanwhile, a pruned convolutional neural network model is constructed, the model parameter is reduced, the calculation cost is reduced, the LightGBM is used for classifying according to the advanced characteristics of the encryption flow, the strong classification effect is achieved through the weak classifier, the accuracy is improved, and the final model can achieve higher performance and accuracy than other classification models.

Description

Encryption flow identification method based on pruning convolutional neural network and machine learning

Technical Field

The invention relates to the technical field of network traffic identification, in particular to an encrypted traffic identification method based on a pruned convolutional neural network and machine learning.

Background

The network traffic recognition technology plays an important role in network quality of service control, traffic billing, network resource usage planning, malware detection and other applications. With the continuous development of network information technology, more and more software uses encryption or port confusion technologies such as SSL, SSH, VPN and Tor, and the like, so that the duty ratio of encrypted traffic is higher and higher.

The research statistics agency Netmarketshare states that by 10 months of 2019, the proportion of encrypted Web traffic has exceeded ninety, the top 100 non-Google Web sites on the internet had 90 bits of HTTPS by default, and the proportion of HTTPS in the united states was 92%, russia 85%, japan 80%, and indonesia 74% worldwide. This variation presents new challenges to current traffic detection methods, making network traffic identification and analysis increasingly difficult.

The precondition of flow classification is that the characteristics of different flows are unique, and the current flow classification method can be roughly divided into the following steps:

1) Port-based classification methods. The method is based on the premise that the application service uses ports allocated by the IANA and remains unchanged, different flow types are distinguished according to port numbers used by the flows.

2) Payload-based classification methods. This approach, also known as deep packet inspection, i.e., distinguishing protocols based on static payload characteristics, can be used on some coarse-grained traffic classification.

3) Statistical-based classification methods. The method adopts a plurality of machine learning techniques, and different types are distinguished according to the statistical characteristics of the flow. These features can be broadly divided into packet level, which includes some packet length, packet inter-arrival time and direction, etc., and flow level, which includes some number of upstream and downstream traffic packets, network flow duration, proportion of different types of traffic packets, etc.

The current flow classification method has the following disadvantages:

1) The classification method based on the ports can greatly reduce the accuracy rate when encountering the ports which are beyond IANA regulation and use random or dynamic ports for the malicious software traffic, and the method can not identify the malicious software traffic.

2) The method for classifying the traffic based on the payloads can destroy the load characteristics depending on the traffic after encrypting, and is only suitable for traffic classification with coarse granularity or scenes which are not completely encrypted.

3) The classification model trained by the method has huge parameter quantity and limits the model deployment condition based on the classification method of deep learning.

Disclosure of Invention

Aiming at the technical problem of huge parameter quantity of a classification model trained by a classification method based on deep learning, the invention provides an encryption flow identification method based on a pruning convolutional neural network and machine learning, which does not need to manually extract features, automatically extracts advanced features from an original flow file and classifies the extracted features, pruning is carried out on the model, the parameter quantity of the model is reduced, the convolutional neural network is used for automatically extracting the extracted features, the LightGBM achieves the effect of strong classification by a weak classifier, the final model can achieve higher performance and higher precision than other classification models, and the encryption flow identification method is suitable for efficient detection of encryption flow.

In order to achieve the above object, the present invention provides the following technical solutions:

the invention provides an encrypted flow identification method based on a pruned convolutional neural network and machine learning, which comprises the following steps:

s1: preprocessing data;

s2: the CNN model is built, and the convolutional neural network mainly comprises the following layers: an input layer, a convolution layer, a ReLU layer, a pool layer and a full connection layer;

s3: pruning the model, retraining the model, and obtaining an optimized CNN model after a plurality of iterations;

s4: the optimized CNN model outputs a 256-dimensional feature vector as input of the LightGBM classifier;

s5: the method comprises the steps of classifying the lightGBM, obtaining a gradient decision tree in the lightGBM algorithm by carrying out iteration on a given training data set for a plurality of times, readjusting a new tree by gradient information during each iteration to add a previous iteration tree, wherein in a function space, the process is a continuously-changing linear combination process, the lightGBM integrates weights of all leaf nodes as references for constructing the tree, then determining partition points, calculating first-order gradient and second-order gradient, and optimizing the performance of the lightGBM classifier after a plurality of iterations.

Compared with the prior art, the invention has the beneficial effects that:

according to the encryption flow identification method based on the pruned convolutional neural network and the machine learning, the characteristics are not required to be manually extracted, the CNN model is utilized to automatically extract the high-level characteristics from the original flow file and classify the high-level characteristics, meanwhile, the pruned convolutional neural network model is constructed, the model parameter is reduced, the calculation cost is reduced, the LightGBM is used for classifying according to the high-level characteristics of the encryption flow, the strong classification effect is achieved through the weak classifier, the accuracy is improved, and the final model can achieve higher performance and accuracy than other classification models.

Drawings

In order to more clearly illustrate the embodiments of the present application or the technical solutions in the prior art, the drawings that are needed in the embodiments will be briefly described below, and it is obvious that the drawings in the following description are only some embodiments described in the present invention, and other drawings may be obtained according to these drawings for a person having ordinary skill in the art.

Fig. 1 is a flowchart of an encryption traffic identification method based on a pruned convolutional neural network and machine learning according to an embodiment of the present invention.

Fig. 2 is a flow chart of data preprocessing according to an embodiment of the present invention.

Fig. 3 is a flowchart of a pruning step according to an embodiment of the present invention.

Detailed Description

The following description of the embodiments of the present invention will be made clearly and fully with reference to the accompanying drawings, in which it is evident that the embodiments described are some, but not all embodiments of the invention. All other embodiments, which can be made by those skilled in the art based on the embodiments of the invention without making any inventive effort, are intended to be within the scope of the invention.

The invention provides an encryption flow identification method based on a pruned convolutional neural network and machine learning, which is shown in figure 1 and comprises the following steps:

s1: data preprocessing: processing the original flow file to be suitable for standard input of the CNN model;

the encrypted traffic entered in step S1 uses the public data set ISCXVPN2016, which contains 6 conventional encrypted traffic: email, chat, streaming, file transfer, voIP and P2P,6 corresponding VPN encrypted traffic: VPN-Email, VPN-Chat, VPN-Streaming, VPN-File transfer, VPN-VoIP, and VPN-P2P. Traffic data were obtained by both the Wireshark and tcpdump tools in a real environment for a total of 28GB.

The specific flow of the data preprocessing step is shown in fig. 2. The key points are as follows:

removing irrelevant messages: i.e. removing packets that affect model predictions or have empty payloads. Traffic in the real environment may contain some packets for TCP connection and disconnection, such as packets containing SYN, ACK or FIN flags, and some packets for domain name resolution and packets with empty payload, which are not effective for traffic classification, but rather affect classification accuracy, so that removal is required.

The ethernet frame header is removed: the ethernet header contains a MAC address for identifying the network device location and for transmitting data packets between network nodes, but has little effect in traffic classification, so the ethernet header needs to be deleted.

Masking the IP address: the IP address has an overfitting effect on the model in traffic classification, requiring the source IP address and the destination IP address to be set to 0.

Check packet length: the method uses convolutional neural network, which requires input of fixed size, but the length of the data packet is not constant, so the length of the data packet is checked, and if the length is smaller than the prescribed input size, zero padding is needed at the end of the data packet. If the length is greater than the fixed input size, the packet needs to be truncated. Ensuring that the length of the traffic packet conforms to the input size of the CNN model.

Normalization: different evaluation indexes often have different dimensions, and in order to solve the comparability between data indexes, normalization processing is required to be performed on the data packet, and 255 is divided by byte unit, so that the input size is between 0 and 1.

S2: and constructing a CNN model.

Convolutional neural networks are a type of feedforward neural network that includes convolutional calculations and has a deep structure, and are one of the most popular deep learning algorithms at present. With the penetration of learning theory and the improvement of computing performance, convolutional neural networks have been rapidly developed and applied to computer vision, natural language processing, and the like. The convolutional neural network is mainly composed of the following layers: input layer, convolution layer, reLU layer, pool layer, and full connection layer. By superposing the layers, a complete convolutional neural network is formed, and finally 256-dimensional feature vectors are output for the subsequent LightGBM classifier. The feature vector dimension is too high, which easily causes over fitting of results and increases cost, and too low dimension can reduce the accuracy of the classification. The structure of the CNN model used in the invention is shown in Table 1, wherein the key points are as follows:

convolution layer: conv2D, two-dimensional convolution, the flow data packet can be converted into a gray level image, and the processing is more suitable for two-dimensional convolution.

Activation function: reLU, as shown in equation (1), activates a node only when the input is greater than 0, outputs 0 when the input is less than 0, and outputs equal to the input when the input is greater than 0. The function may remove negative values from the convolution result, leaving positive values unchanged.

ReLU(x)＝max(0,x) (1)

Batch normalization: batch Normalization, like the conventional data normalization, is a way to unify scattered data and is also a way to optimize neural networks, dividing the data into small batches for random gradient descent. As shown in formula (2), wherein alpha _i Is the original activation value of a certain neuron,is a normalized value after normalization operation.

Loss function: the cross entropy loss function (CrossEntropy Loss) represents the difference between the true probability distribution and the predicted probability distribution as shown in formula (3), and the smaller the value of the cross entropy, the better the model prediction effect.

Activation function of output layer: softmax, when a sample passes through the Softmax layer and outputs a vector of T1, the index of the largest value in the vector is taken as the prediction label of the sample, and the formula is shown in (4).

Dropout: when training, training of some neurons can be stopped randomly, robustness of the model is improved, and the model is set to be dropout to be 0.5.

TABLE 1 principal parameters of CNN model

Network layer

Operation of

Input device

Convolution kernel

Step size

Filling

Output of

Weight number

1

Conv2D+ReLU+BN

30*30

3*3

1

Same

8*30*30

80

2

Conv2D+ReLU+BN

8*30*30

3*3

2

Same

16*14*14

1168

3

Conv2D+ReLU+BN

16*14*14

3*3

2

Same

32*6*6

4640

4

Conv2D+ReLU+BN

32*6*6

3*3

1

Same

64*4*4

18496

5

Full connection +Dropout

64*4*4

Null

None

256

262400

generally speaking, the more layers and parameters of the neural network, the better the result, but at the same time, the more computational resources are consumed. Therefore, the pruning technology can be used for removing parameters which have smaller influence on the prediction result, the contribution degree of the neurons of the model to the output result is ordered according to the neurons of the model, and neurons with low contribution degree are abandoned, so that the running speed of the model is faster, and the model file is smaller. As shown in fig. 3, assuming that the first layer has 4 neurons and the second layer has 5 neurons, the corresponding weight matrix is 4*5. The pruning process is as follows:

sorting weights of two adjacent layers of neurons according to absolute values;

the weight with smaller absolute value (e.g. 0.4) is clipped according to pruning rate P, i.e. set to 0.

After pruning, the model is retrained, and an optimized CNN model is obtained after a plurality of iterations.

s5: lightGBM classification.

The LightGBM is a framework for realizing GBDT algorithm, GBDT is a model which is long and weak in machine learning, and the main idea is to use weak classifier (decision tree) for iterative training to obtain an optimal model, and the model has the advantages of good training effect, difficult fitting and the like. Compared with the conventional CNN full-connection layer serving as a classifier, the LightGBM classifier supports high-efficiency parallel training, has higher training speed and lower memory consumption, supports distributed rapid processing of massive data, and reduces the deployment requirement of a detection model.

The gradient decision tree in the LightGBM algorithm is obtained by iterating a given training data set multiple times, and during each iteration, readjusting a new tree with gradient information to add a previous iteration tree, wherein in the function space, the above-mentioned process is a continuously-changing linear combination process, as shown in formula (6):

χ is the function space of the iteration tree, f _q (x _i ) Representing the predicted value of the ith instance in the qth tree.

Each partition node of the tree adopts an optimal partition point, and a greedy method is actually used in building a tree model. The LightGBM integrates the weights of all leaf nodes as a reference for building a tree, then determines the segmentation points and calculates the first and second order gradients.

For any given tree structure, the LightGBM defines the total number of times each feature is partitioned in the iteration tree, t_split, and the sum of gains t_gain that the feature brings after being partitioned in all decision trees, as a measure for measuring the importance of the feature, specifically defined as follows:

wherein K is K decision trees generated by K rounds of iteration.

And after multiple iterations, the performance of the LightGBM classifier is optimized.

Compared with the classification of the original CNN model, the LightGBM improves the accuracy and recall rate, and the recognition speed is also accelerated.

According to the encryption flow identification method based on the pruned convolutional neural network and the machine learning, the characteristics are not required to be manually extracted, the CNN model is utilized to automatically extract the high-level characteristics from the original flow file and classify the high-level characteristics, meanwhile, the pruned convolutional neural network model is constructed, the number of model parameters is reduced, the calculation cost is reduced, the LightGBM is used for classifying according to the high-level characteristics of the encryption flow, the effect of strong classification is achieved through a weak classifier, the accuracy is improved, and the final model can achieve higher performance and higher accuracy than other classification models (see Table 2).

TABLE 2 comparison of the frontal model of the present application with other classification models

Method	Accuracy rate of	Recall rate of recall	F1 value
				1D CNN	0.89	0.89	0.89
CNN+LSTM	0.91	0.91	0.91
				SAE	0.92	0.92	0.92
2D-CNN	0.91	0.91	0.91
				Model before pruning	0.90	0.86	0.88
Pruning model (application)	0.94	0.93	0.93

It is noted that relational terms such as first and second, and the like are used solely to distinguish one entity or action from another entity or action without necessarily requiring or implying any actual such relationship or order between such entities or actions. Moreover, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising one … …" does not exclude the presence of other like elements in a process, method, article, or apparatus that comprises the element.

In this specification, each embodiment is described in a related manner, and identical and similar parts of each embodiment are all referred to each other, and each embodiment mainly describes differences from other embodiments. In particular, for the apparatus embodiments, the electronic device embodiments, the computer-readable storage medium embodiments, and the computer program product embodiments, the description is relatively simple, as relevant to the description of the method embodiments in part, since they are substantially similar to the method embodiments.

The foregoing examples are merely specific embodiments of the present application, and are not intended to limit the scope of the present application, but the present application is not limited thereto, and those skilled in the art will appreciate that while the foregoing examples are described in detail, the present application is not limited thereto. Any person skilled in the art, within the technical scope of the disclosure of the present application, may modify or easily conceive of changes to the technical solutions described in the foregoing embodiments or make equivalent substitutions for some of the technical details; such modifications, changes or substitutions do not depart from the spirit and scope of the corresponding technical solutions. Are intended to be encompassed within the scope of this application. Therefore, the protection scope of the present application shall be subject to the protection scope of the claims.

Claims

1. An encrypted traffic identification method based on a pruned convolutional neural network and machine learning is characterized by comprising the following steps:

s1: preprocessing data; the encrypted traffic entered in step S1 uses the public data set ISCXVPN2016, which contains 6 conventional encrypted traffic: email, chat, streaming, file transfer, voIP and P2P,6 corresponding VPN encrypted traffic: VPN-Email, VPN-Chat, VPN-Streaming, VPN-File transfer, VPN-VoIP, and VPN-P2P;

2. The encrypted traffic recognition method based on the pruned convolutional neural network and the machine learning according to claim 1, wherein the traffic data input in the step S1 are acquired by a Wireshark tool and a tcpdump tool in a real environment, and total 28GB.

3. The encrypted traffic recognition method based on a pruned convolutional neural network and machine learning according to claim 1, wherein the step S1 data preprocessing process comprises:

s11: reading a pcap file;

s12: removing irrelevant messages;

s13: removing the Ethernet frame header;

s14: masking the IP address;

s15: checking whether the packet length is larger than the specified input size, if so, cutting off the data packet, otherwise, performing zero padding at the end of the data packet to generate a byte matrix;

s16: the data packet is normalized by dividing 255 in bytes so that the input size is between 0 and 1.

4. The encrypted traffic recognition method based on pruned convolutional neural network and machine learning according to claim 1, wherein step S2 forms a complete convolutional neural network by superimposing an input layer, a convolutional layer, a ReLU layer, a pool layer, and a full-connection layer;

wherein the convolution layer is a two-dimensional convolution;

the activation function ReLU is shown in equation (1):

ReLU(x)＝max(0，x) (1)

batch normalization is shown in equation (2):

wherein alpha is _i Is the original activation value of a certain neuron,is a standard value after standardized operation;

the loss function is shown in equation (3):

the activation function Softmax formula of the output layer is shown in (4):

the present model set dropout to 0.5.

5. The encrypted traffic recognition method based on pruning convolutional neural network and machine learning according to claim 1, wherein the pruning process of step S3 is as follows:

s31: sorting weights of two adjacent layers of neurons according to absolute values;

s32: cutting out a weight with an absolute value smaller than 0.4 according to the pruning rate P, namely setting the weight to 0;

s33: after pruning, the model is retrained, and an optimized CNN model is obtained after a plurality of iterations.

6. The encrypted traffic recognition method based on a pruned convolutional neural network and machine learning according to claim 1, wherein the continuously varying linear combination process of step S5 is as shown in formula (6):

7. The method for identifying encrypted traffic based on a pruned convolutional neural network and machine learning according to claim 1, wherein, for any given tree structure, the LightGBM defines the total number of times each feature is segmented in the iterative tree, t_split, and the Gain sum t_gain of the feature after being used for segmentation in all decision trees as a measure for measuring the importance of the feature, specifically defined as follows:

wherein K is K decision trees generated by K rounds of iteration.