CN114884894B

CN114884894B - Semi-supervised network traffic classification method based on transfer learning

Info

Publication number: CN114884894B
Application number: CN202210415447.XA
Authority: CN
Inventors: 李涛; 周明睿
Original assignee: Nanjing University of Posts and Telecommunications
Current assignee: Nanjing University of Posts and Telecommunications
Priority date: 2022-04-18
Filing date: 2022-04-18
Publication date: 2023-10-20
Anticipated expiration: 2042-04-18
Also published as: CN114884894A

Abstract

The invention provides a semi-supervised network traffic classification method based on transfer learning, which comprises the following steps: extracting statistical characteristics of network flow data and screening by using pearson correlation coefficients; step 2: inputting the statistical characteristics into a pre-training model and storing the parameters learned by the pre-training model; step 3: and migrating the pre-training model into a retraining model with more linear layers, and retraining the retraining model to obtain a classifier for classifying network traffic. According to the invention, the statistical characteristics of the network traffic data are extracted, the Pelson correlation coefficient is used for screening, the pre-training model is used for training and then the parameters are saved, and finally the re-training model is used for re-training, so that the problems that the existing network traffic classification method is difficult in classifying and collecting and marking a large number of data sets, the outdated data sets are easy to waste, and the model classification accuracy is reduced when the sample is subjected to conceptual deviation are solved.

Description

Semi-supervised network traffic classification method based on transfer learning

Technical Field

The invention relates to a semi-supervised network traffic classification method based on transfer learning, and belongs to the field of convolutional neural networks.

Background

With the rapid growth of network data and network complexity, the rise of network traffic diversity presents a great challenge to network traffic control and detection. The study of network traffic classification helps to schedule different types of network traffic and to protect against attacks by malicious traffic. How to automatically detect and correctly classify different types of network traffic categories has become a key issue in improving network quality of service and network security quality.

Traditional classification methods of network traffic can be classified into the following three categories:

port number based method: the internet digital distribution organization (The Internet Assigned Numbers Authority, IANA) divided the different network protocol port numbers in the nineties of the twentieth century, thereby uncovering a prelude to the classification of network traffic. This method determines the unknown application class by analyzing the port number in the network packet header and then comparing it to the port map. Because early Internet application is simpler, the network traffic classification method based on the port number has higher classification accuracy and higher classification speed. However, with the popularization of technologies such as network address conversion, port forwarding, protocol embedding, dynamic port allocation, etc., the accuracy of the port number-based network traffic classification method is greatly reduced.

Payload-based methods: researchers have found that the payload of a packet contains a lot of information for classification, and deep packet inspection (Deep Packet Inspection, DPI) technology is of increasing interest. DPI technology classifies network traffic by analyzing the payload portion of a packet. The technology does not need port number information of the data packet, so that the technology is not affected by dynamic ports. However, this method has the following problems: the encrypted data packet cannot be analyzed, and the performance is reduced when the protocol is fuzzy and the traffic of the protocol package is processed; the method analyzes the specific content of the data transmitted by the user, and can cause infringement on the privacy of the user.

A machine learning based method: the method classifies network traffic using statistical features of the network traffic, comprising: packet size, packet interval time, packet rate, etc. The statistical characteristics representing the network traffic are input into a machine learning model, and the network traffic identification based on the machine learning model can be realized through a certain training method. However, this method has the following problems: a set of characteristics reflecting network traffic needs to be designed, and a certain degree of expertise and a great deal of time are required for mining the characteristics; the method requires a large amount of labeled data to train the classifier, a large amount of manpower and material resources are consumed for collecting and labeling the large amount of data, and with the development of network technology, the data set collected and labeled before is likely to be outdated.

In view of the foregoing, it is necessary to propose a new semi-supervised network traffic classification method based on transfer learning to solve the above-mentioned problems.

Disclosure of Invention

The invention aims to provide a semi-supervised network traffic classification method based on transfer learning, which aims to solve the problems that the existing network traffic classification is difficult to collect and mark a large number of data sets, the outdated data sets are easy to waste, and the model classification accuracy is reduced when a sample is subjected to conceptual offset.

In order to achieve the above purpose, the present invention provides a semi-supervised network traffic classification method based on transfer learning, which specifically comprises the following steps:

step 1: extracting statistical characteristics of network flow data and screening by using pearson correlation coefficients;

step 2: inputting the statistical characteristics into a pre-training model and storing the parameters learned by the pre-training model;

step 3: and migrating the pre-training model into a retraining model with more linear layers, and retraining the retraining model to obtain a classifier for classifying network traffic.

As a further improvement of the present invention, the step 1 specifically includes:

step 11: capturing network traffic data by software, the network traffic data comprising unlabeled data and labeled data, the number of unlabeled data being greater than the number of labeled data;

step 12: extracting key information of each data packet in the network flow data;

step 13: calculating to obtain statistical characteristics according to key information of the first 45 data packets of each flow in the network flow data;

step 14: and (3) screening key statistical characteristics capable of effectively identifying network traffic categories from the statistical characteristics of the flow by using the Pelson correlation coefficient, and dividing the key statistical characteristics into a labeled part and a non-labeled part.

As a further improvement of the present invention, the step 2 specifically includes:

step 21: converting the key statistical features of the unlabeled data into an unlabeled data vector matrix as input of the pre-training model;

step 22: building the pre-training model, setting the initial learning rate of the pre-training model to be 0.001, setting the batch size to be 32 and setting the iteration number to be 150;

step 23: inputting the unlabeled data vector matrix in the step 21 into the pre-training model for pre-training;

step 24: and after the pre-training is finished, the parameters of each layer of the pre-training model are stored.

As a further improvement of the present invention, the step 3 specifically includes:

step 31: the tagged data is processed according to 7:3, dividing the ratio into a training set and a testing set, and converting key statistical characteristics of the training set into a training set vector matrix serving as input of the retraining model;

step 32: building the retraining model, and migrating the trained pre-training model in the step 24 to the retraining model with more linear layers;

step 33: setting the initial learning rate of the retraining model to be 0.001, the batch size to be 64 and the iteration number to be 100;

step 34: and the retraining model after training is used as a classifier for classifying network traffic.

As a further improvement of the present invention, there are 116 statistical features described in step 1.

As a further improvement of the invention, 90 statistical features with a correlation of less than 0.9 are screened out by the pearson correlation coefficient.

As a further improvement of the present invention, the pearson correlation coefficient calculation formula is:

wherein X and Y represent different statistical signature sequences for different samples.

As a further improvement of the present invention, the key information in step 12 includes packet arrival time, packet protocol, packet source IP address, packet destination IP address, packet size.

As a further improvement of the invention, the pre-trained model does not contain a Softmax layer for classification, and the final output of the pre-trained model is the statistical feature.

As a further improvement of the invention, two fully connected layers are added to the retraining model to reduce the statistical features and a Softmax layer is used to obtain the final classification result.

The beneficial effects of the invention are as follows: the semi-supervised network traffic classification method based on transfer learning provided by the invention is used for solving the problems that the existing network traffic classification is difficult to collect and mark a large number of data sets, the outdated data sets are easy to waste, and the model classification accuracy is reduced when a sample is subjected to conceptual deviation by extracting statistical characteristics of network traffic data, screening by using a Pearson correlation coefficient, training by using a pre-training model, then storing parameters, and finally retraining by using a retraining model.

Drawings

Fig. 1 is a step diagram of a semi-supervised network traffic classification method based on transfer learning according to the present invention.

Fig. 2 is a flow chart of fig. 1.

Fig. 3 is a schematic diagram of a pre-training model structure and parameters of each layer according to an embodiment of the present invention.

FIG. 4 is a diagram of retraining model structures and layers of parameters according to an embodiment of the present invention.

Fig. 5 is a schematic flow classification diagram according to an embodiment of the present invention.

Detailed Description

In order to make the objects, technical solutions and advantages of the present invention more apparent, the present invention will be described in detail with reference to the accompanying drawings and specific embodiments.

As shown in fig. 1 and fig. 2, the present invention provides a semi-supervised network traffic classification method based on transfer learning, which specifically includes the following steps:

Step 1 is used for collecting and extracting original network traffic data so as to apply the original network traffic data to the training model of the invention, and step 1 specifically comprises the following steps:

step 11: capturing network traffic data by software, wherein the network traffic data comprises unlabeled data and labeled data, and the quantity of the unlabeled data is larger than that of the labeled data;

step 13: calculating according to key information of the first 45 data packets of each flow in the network flow data to obtain statistical characteristics;

step 14: and (3) screening key statistical features capable of effectively identifying network traffic categories from the flow statistical features by using Pelson correlation coefficients, and dividing the key statistical features into two parts, namely labeled and unlabeled.

Step 11 captures network traffic data through software, the network traffic data usually has no label, but the traditional network traffic method based on machine learning and deep learning usually needs to train a classifier by using a large amount of labeled data to achieve a good classification effect, but collecting and labeling a large amount of data samples needs to consume huge manpower and material resources, and along with the rapid development of network technology, the data samples collected and labeled at a huge cost are likely to be outdated. The invention only needs a large amount of unlabeled data and a small amount of labeled data to capture the network traffic data, thereby greatly reducing the difficulty of collecting the original data.

The key information of each data packet in step 12 includes the arrival time of the data packet, the protocol of the data packet, the source IP address of the data packet, the destination IP address of the data packet, and the size of the data packet, and the key information of each data packet is saved as a txt file.

Step 13, calculating to obtain statistics features according to key information of first 45 data packets of each flow in the network flow data, wherein the statistics features are 116 in total.

Because the correlation exists between the features, the higher the correlation between the features is, the higher the similarity degree between the two features is, and the higher the probability of category confusion is when a classification task is performed, so that the step 14 utilizes the pearson correlation coefficient to calculate the correlation between the features, and screens out 90 statistical features with the correlation smaller than 0.9, and the purpose of the step is to reduce the feature dimension, reduce the calculation complexity and simultaneously improve the classification accuracy.

The pearson correlation coefficient calculation formula is:

The step 2 specifically comprises the following steps:

step 21: converting key statistical features of the unlabeled data into an unlabeled data vector matrix as input of a pre-training model;

step 22: building a pre-training model, setting the initial learning rate of the pre-training model to be 0.001, setting the batch size to be 32 and setting the iteration times to be 150;

step 23: inputting the unlabeled data vector matrix in the step 21 into a pre-training model for pre-training;

Specifically, in step 21, 90 key statistical features representing a large number of unlabeled exemplars are converted into a matrix of unlabeled data vectors of size 2×45, thereby facilitating input into the pre-training model.

Referring to fig. 3, the pre-trained model built in step 22 does not include a Softmax layer for classification.

And step 24, after the pre-training model is trained, the characteristic distribution condition of the network data flow can be predicted.

The pre-training model in the invention is a one-dimensional convolutional neural network model.

The step 3 specifically comprises the following steps:

step 31: the tagged data is read according to 7:3, dividing the ratio into a training set and a testing set, and converting key statistical characteristics of the training set into a training set vector matrix serving as input of a retraining model;

step 32: building a retraining model, migrating the retraining model trained in step 24 to a retraining model with more linear layers,

step 34: the trained retraining model is used as a classifier for classifying network traffic.

Step 31 converts the 90 key statistical features of the training set representing a large number of labeled samples into a matrix of labeled data vectors of size 2 x 45, thereby facilitating input into the retraining model. Because the retraining model is a one-dimensional convolutional neural network model and has the characteristics that the retraining model is tested while training, the data with the labels are tested according to 7:3 is divided into a training set and a testing set, wherein the training set is used for inputting and training, and the part divided into the testing set is used for retraining and testing the whole retraining model, so that the reliability and the accuracy of the retraining model are ensured.

Referring to fig. 4, two fully connected layers are added to the retrained model in step 32 to reduce the statistical features and one Softmax layer is added to the retrained model to obtain the final classification result.

In order to illustrate the effects of the above-described aspects of the present invention, a description will be given below with reference to specific examples.

As shown in fig. 5, a flow classification diagram is shown. Firstly, network traffic is collected by utilizing a Wireshark tool, a pcap file is generated and stored, only key information of the first 45 data packets in the network traffic data is extracted in consideration of a real-time classification task, wherein the key information comprises packet size, packet arrival time, source IP address, destination IP address and transmission protocol information, and statistical characteristics calculated by utilizing the key information of the first 45 data packets in the network traffic data are used for representing the statistical characteristics of the whole network traffic data. The 116 statistical features calculated include two parts: 4 items of packet arrival time sequence, packet size sequence, packet difference sequence and time stamp sequence, wherein 17 kinds of statistical characteristics are shown in table 1, and the specific characteristic information is 68 kinds of statistical characteristics in total; the whole network flow data is divided into 48 statistical characteristics of uplink flow, downlink flow and whole data packet according to the data packet direction, and the specific characteristic contents are shown in table 2. Because the correlation exists between the features, the higher the correlation between the features is, the higher the similarity degree between the two features is, the higher the probability of category confusion is when a classification task is performed, so that 90 statistical features with the correlation smaller than 0.9 are screened out by using the correlation between the features calculated by the pearson correlation coefficient, and the aim of the step is to reduce the feature dimension, reduce the calculation complexity and improve the classification accuracy.

TABLE 1 17 statistical characteristics

TABLE 2 48 statistical characteristics

And then taking statistical features representing a large number of unlabeled old samples as input of a pre-training model, wherein the pre-training model is a one-dimensional convolutional neural network model, the pre-training model after training is completed can predict the feature distribution condition of the whole network data flow, the pre-training model is migrated to a new model with more linear layers after training is completed, namely, the model is retrained, then the new model is quickly retrained by using the statistical features representing a small number of labeled new samples, and the network flow classification task can be completed by the new model after retrained.

The present invention verifies on the ISCX-non vpn dataset, the 13 year south post Video dataset (Video 13) and the 19 year south post Video dataset (Video 19). Wherein the ISCX-non vpn data set is a common data set comprising six different application class traffic data: an email class, a file transfer class, a streaming class, a text chat class, a voice telephony class, and a P2P class. The Video13 dataset and the Video19 dataset are network Video datasets collected in 2013 and 2019 using a campus network of the university of south Beijing and the postal service respectively, and the feature distribution between the two datasets is different. Categories of Video13 dataset include: voD super-definition, voD high definition, voD standard definition, live video, conversational video, and P2P video. Categories of Video19 datasets include: on demand 480P, on demand 720P, live 480P, live 720P. Because of the 6 year difference between the Video13 dataset and the Video19 dataset, the Video13 dataset can be considered to be an outdated old dataset to some extent. The verification effect is as follows:

as shown in table 3, the Video19 dataset used 16574 samples together achieved an accuracy of 95.91% without pre-training; when pretrained using ISCX and Video13 datasets, video19 used 8332 samples in total achieved 95.6% and 98.31% accuracy, respectively. From this, it can be seen that when the invention is retrained, video19 only needs 50% of the samples, and its overall accuracy can approach or exceed that of a fully supervised training using all the samples, verifying the effectiveness of the invention.

TABLE 3 Performance of the invention on different Pre-training data sets

The Video19 dataset was mixed into Video13 at a ratio of 20%,40% and 60%, and then verified, and the verification results are shown in table 4. As the proportion of the new data in the pre-training data set increases, new data available for model training to learn during pre-training also increases, and thus the overall accuracy increases. Then to simulate the concept offset, non-Video-class traffic data (file transfer class and email class) in the ISCX dataset is added to the Video19 dataset to be identified for experimentation. The invention proves that the pre-training model can be better utilized to help improve the classification effect when the concept deviation occurs. Therefore, the classification accuracy of the new model obtained by the pre-training model migration to the new data is higher.

Table 4 overall accuracy of the invention versus different methods on a mixed pre-training dataset

In summary, the invention solves the problems that the existing network traffic classification method is difficult to collect and mark a large number of data sets in a classified manner, the outdated data sets are easy to waste, and the model classification accuracy is reduced when the samples are subjected to conceptual offset by screening the statistical characteristics of the network traffic data and using the pearson correlation coefficient, training the parameters by using the pre-training model and then storing the parameters, and finally retraining by using the retraining model.

The above embodiments are only for illustrating the technical solution of the present invention and not for limiting the same, and although the present invention has been described in detail with reference to the preferred embodiments, it should be understood by those skilled in the art that modifications and equivalents may be made thereto without departing from the spirit and scope of the technical solution of the present invention.

Claims

1. A semi-supervised network traffic classification method based on transfer learning is characterized by comprising the following steps:

step 1: extracting statistical characteristics of network flow data and screening by using pearson correlation coefficients; the method specifically comprises the following steps:

step 14: screening key statistical features capable of effectively identifying network traffic categories from the statistical features of the flows by using Pelson correlation coefficients, and dividing the key statistical features into two parts, namely labeled and unlabeled;

step 2: inputting the statistical characteristics after screening into a pre-training model and storing the parameters learned by the pre-training model; the method specifically comprises the following steps:

step 24: after the pre-training is finished, the parameters of each layer of the pre-training model are stored;

step 3: migrating the pre-training model to a retraining model with more linear layers, retraining the retraining model to obtain a classifier for classifying network traffic; the method specifically comprises the following steps:

2. The semi-supervised network traffic classification method based on transfer learning as set forth in claim 1, wherein: there are 116 statistical features described in step 1.

3. The semi-supervised network traffic classification method based on transfer learning as recited in claim 2, wherein: and screening 90 statistical features with the correlation less than 0.9 through the pearson correlation coefficient.

4. The semi-supervised network traffic classification method based on transfer learning as recited in claim 3, wherein: the pearson correlation coefficient calculation formula is as follows:

wherein->And->Representing different statistical signature sequences for different samples.

5. The semi-supervised network traffic classification method based on transfer learning as set forth in claim 1, wherein: the key information in step 12 includes packet arrival time, packet protocol, packet source IP address, packet destination IP address, and packet size.

6. The semi-supervised network traffic classification method based on transfer learning as set forth in claim 1, wherein: the pre-training model does not contain a Softmax layer for classification, and the statistical features are finally output by the pre-training model.

7. The semi-supervised network traffic classification method based on transfer learning as set forth in claim 1, wherein: and adding two full-connection layers into the retraining model to carry out the reduction of the statistical characteristics and a Softmax layer for obtaining a final classification result.