CN114615093B

CN114615093B - Anonymous network traffic identification method and device based on traffic reconstruction and inheritance learning

Info

Publication number: CN114615093B
Application number: CN202210506848.6A
Authority: CN
Inventors: 肖滕龙; 翟江涛; 许成程
Original assignee: Nanjing University of Information Science and Technology
Current assignee: Nanjing University of Information Science and Technology
Priority date: 2022-05-11
Filing date: 2022-05-11
Publication date: 2022-07-26
Anticipated expiration: 2042-05-11
Also published as: CN114615093A

Abstract

The invention discloses an anonymous network traffic identification method and device based on traffic reconstruction and inheritance learning, wherein the method comprises the following steps: collecting original network flow, primarily screening the flow, and removing non-Tor flow; reconstructing the primarily screened flow, and converting the flow into a gray characteristic diagram; processing the feature map after flow reconstruction by using a convolutional neural network model and a cyclic neural network model, extracting an interactive information feature vector, a packet space feature vector and a flow time sequence feature vector, and fusing the three feature vectors; inputting the fusion characteristics into a multi-classifier for application classification, wherein the multi-classifier updates classifier parameters through an inheritance learning mechanism when detecting a new flow class; the home application of the traffic is determined based on majority rules. The invention simplifies the process of feature design, enriches the comprehensiveness of features, meets the requirement of online updating of model parameters, keeps the model remembering the past training, and only needs small-scale training each time a new category is added.

Description

Anonymous network traffic identification method and device based on traffic reconstruction and inheritance learning

Technical Field

The invention relates to network traffic identification and network application classification, in particular to an anonymous network traffic identification method and device based on traffic reconstruction and inheritance learning.

Background

With the continuous development of the internet, the types of network traffic are gradually complex, and different types of application programs are continuously emerged. Applications can generate a large amount of network traffic, and different types of traffic can exhibit different characteristics. The goal of traffic classification is to identify the class of traffic based on its distinguishing characteristics, which is essential to network operators. The traffic classification is the first step of guaranteeing the service quality from the perspective of the user service quality, and is a premise of providing differentiated services for services according to requirements of different service types, and on the other hand, the traffic classification is the first step of detecting abnormal network traffic from the perspective of security, so that the network security can be better protected. In recent years, with the increasing demand of users for privacy protection and the continuous development of anonymized encryption technology, more and more traffic is specially processed, which presents new challenges to network traffic classification.

Classification methods in the field of traffic identification have undergone several changes, and conventional traffic classification methods are mainly classified into two categories: one is a port number-based method, which identifies according to a protocol number corresponding to a port number, but with the advent of anonymous network port obfuscation techniques, this method is becoming ineffective. The other type is an identification method based on Deep Packet Inspection (DPI), and data packet loads are matched to determine the category based on different categories of regular expressions. But this method is not feasible as the traffic anonymization encryption technology is mature. With the loss of function of the traditional methods, researchers began to look for new methods of traffic classification. Machine learning methods that have progressed rapidly in recent years have received considerable attention from researchers. Compared with the traditional classification method, the machine learning technology is more intelligent and convenient, and can effectively avoid the influence of flow encryption by classifying according to the statistical characteristics of the flow. Therefore, researchers have proposed a traffic classification algorithm based on machine learning, and the machine learning algorithms widely used at present include support vector machines, decision trees, random forests, XGBoost methods, and the like. The classification methods have good classification accuracy and are widely accepted by all social circles. However, the traffic classification method based on machine learning requires expert experience to extract and screen traffic characteristics, and the characteristics are not comprehensive enough while consuming time and energy, and have high representativeness requirements on the characteristics and low classification accuracy. The model based on deep learning becomes a research hotspot at present, an end-to-end model is favored by researchers, but in actual deployment, when a novel traffic identification scene is encountered, the model needs to be retrained, a large amount of time is consumed, and the difficulty is encountered in anonymous network traffic application classification at present.

Disclosure of Invention

The invention aims to: the invention aims to provide an anonymous network traffic identification method and device based on traffic reconstruction and inheritance learning, which at least partially solve the problems in the background art.

The technical scheme is as follows: an anonymous network traffic identification method based on traffic reconstruction and inheritance learning comprises the following steps:

collecting original network flow, primarily screening the flow, and removing non-Tor flow;

reconstructing the flow after primary screening, converting the flow into a gray characteristic diagram, comprising: original byte characteristic reconstruction: taking the standard byte asLTo be less thanLThe data packet of each byte is subjected to zero padding operation, exceedingLThe data packet of each byte is cut off and normalized to generatei*iThereby converting the packed byte matrix into a gray image; and reconstructing the characteristics of the uplink and downlink interactive behaviors: constructing horizontal and vertical coordinates according to the size and direction of the data packets and time intervals, and forming a characteristic diagram simulating uplink and downlink interaction behaviors by taking the number of the data packets in each time interval as a gray value of a pixel point;

inputting corresponding uplink and downlink interactive behavior characteristic graphs into a convolutional neural network to extract and obtain interactive information characteristic vectors by taking a data packet as a unitV _s Inputting the original byte characteristic diagram into the convolutional neural network to extract and obtain a packet space characteristic vectorV _n Grouping the packet space feature vectors and inputting the grouped packet space feature vectors into a recurrent neural network to extract to obtain the stream time sequence feature vectorsV _m And fusing the three feature vectors;

inputting the fusion characteristics into a multi-classifier for application classification, wherein the multi-classifier updates classifier parameters through an inheritance learning mechanism when detecting a new flow category;

the home application of the traffic is determined based on majority rules.

The invention also provides an anonymous network flow identification device based on flow reconstruction and inheritance learning, which comprises the following components:

the data acquisition and filtering module is used for acquiring original network flow, primarily screening the flow and eliminating non-Tor flow;

the flow reconstruction module reconstructs the flow after primary screening, converts the flow into a gray characteristic diagram, and comprises: original byte characteristic reconstruction unit: taking the standard byte asLFor less thanLThe data packet of one byte is subjected to zero padding operation, and exceedsLThe data packet of each byte is cut off and normalized to generatei*iThereby converting the packed byte matrix into a gray image; and an uplink and downlink interactive behavior characteristic reconstruction unit: constructing horizontal and vertical coordinates according to the size, the direction and the time intervals of the data packets, and forming a characteristic diagram simulating uplink and downlink interaction behaviors by taking the number of the data packets in each time interval as a gray value of a pixel point;

the feature extraction and fusion module takes a data packet as a unit and inputs the corresponding uplink and downlink interactive behavior feature map into the convolutional neural network to extract and obtain an interactive information feature vectorV _s Inputting the original byte characteristic diagram into the convolutional neural network to extract and obtain a packet space characteristic vectorV _n Inputting a group of packet space feature vectors into a recurrent neural network to extract and obtain a stream time sequence feature vectorV _m And fusing the three feature vectors;

the application classification module is used for inputting the fusion characteristics into a multi-classifier for application classification, and the multi-classifier updates classifier parameters through an inheritance learning mechanism when detecting a new flow category;

and the class judgment module is used for determining the attribution application of the flow based on a majority principle.

The present invention also provides a computer apparatus comprising:

one or more processors;

a memory; and

one or more programs, wherein the one or more programs are stored in the memory and configured for execution by the one or more processors, which when executed by the processors, implement the steps of the anonymous network traffic identification method based on traffic reconstruction and inheritance learning as described above.

The present invention also provides a computer readable storage medium having stored thereon a computer program which, when being executed by a processor, implements the steps of the anonymous network traffic identification method based on traffic reconstruction and inheritance learning as described above.

Has the beneficial effects that: according to the method, the characteristic vectors containing the interactive information, the packet-level spatial information and the flow-level time sequence information with different dimensions are extracted through the reconstruction of the flow characteristic diagram, and application classification is carried out, so that the problem that the classification accuracy is low when the characteristic representativeness is insufficient is solved, the characteristic design process is simplified, the comprehensiveness of the characteristics is enriched, and the requirement of online updating of model parameters is met. Meanwhile, the invention utilizes the inheritance learning mechanism to ensure that the classifier model keeps the memory of the past training, and only needs small-scale training when a new category is added each time. The method of the invention can realize the application classification of the anonymous network flow with high efficiency, accuracy and low cost.

Drawings

FIG. 1 is a general flow diagram of a Tor traffic identification method of the present invention;

FIG. 2 is a flowchart of an embodiment of a Tor traffic application identification method of the present invention;

FIG. 3 is a schematic diagram of interactive behavior traffic reconstruction in accordance with the present invention;

FIG. 4 is a schematic diagram of a convolutional neural network structure employed in the present invention;

FIG. 5 is a schematic diagram of a recurrent neural network architecture employed in the present invention;

FIG. 6 is a schematic diagram of an online updating method for inherited learning mechanism parameters in the present invention;

fig. 7 is a diagram illustrating most principles of determining flow attribution categories according to the present invention.

Detailed Description

The technical solution in the embodiments of the present invention will be clearly and completely described below with reference to the accompanying drawings.

Referring to fig. 1 and fig. 2, the anonymous network traffic identification method based on traffic reconstruction and inheritance learning provided by the present invention includes the following steps:

step 1, collecting original network flow, carrying out flow primary screening, and removing non-Tor flow.

According to The embodiment of The invention, a flow detector is deployed in a network, accounts of various application programs are established, Tor (The on Router, Onion Router) network is used for simulating The behavior of users using various applications, and Tor flow, namely anonymous network flow, is generated. The method comprises the steps of capturing flow by Wireshark, storing the flow in a PCAP mode, and dividing original flow into two-way flows according to a { SrcIP, SrcPort, DstIP, DstPort and Protocol } quintuple mode and then storing the two-way flows. In the quintuple, SrcIP is a source IP address, SrcPort is a source port, DstIP is a destination IP address, DstPort is a destination port, and Protocol represents a Protocol type. One network flow with the same quintuple data is considered to be a unidirectional flow, while the source and destination IPs and source and destination ports of a bidirectional flow may be interchanged simultaneously. For example, a packet containing only a to B is a unidirectional flow, and a packet containing a to B and B to a is a bidirectional flow. The network mainly comprises two types of protocol flows, namely a TCP flow and a UDP flow, wherein the TCP flow uses a SYN zone bit to represent the beginning of transmission, the FIN zone bit is used for finishing transmission, and the UDP flow uses a data packet time interval as a judgment basis. The method and the device use the DPKT library to analyze and divide the PCAP file and reserve the information of all layers of the session stream.

And (3) performing feature extraction on the divided network flow by using a feature extraction tool CICFlowMeter, performing histogram equal-depth discretization on the features, inputting the features into a limit gradient lifting decision tree, and sequentially performing traversal calculation on the value of each feature through a target function consisting of a loss function and a regularization penalty term to find out feature points of the minimized target function, thereby filtering out non-Tor flow and reducing the working complexity. Objective function

As shown in formula (1), wherein

In order to be a function of the loss,

for the penalty function:

(1)

(2)

(3)

in the formula (I), the compound is shown in the specification,

in

The difference between the true value and the predicted value is described,

is a sampleiFirst, thetThe decision tree model generated by the round of fitting,g _i is composed of

The first derivative of (a) is,h _i is composed of

The second derivative of (c).TIs the number of leaves of the decision tree model,

in order to obtain a learning rate,

for the prediction of the input samples by the decision tree,

to control the constant parameters of the size of the penalty term,

is a decision tree ofjA predicted value for each leaf node. The target function represents the error between the predicted value and the true value, after a training sample is input, the decision tree evaluates different values of each characteristic of Tor and non-Tor flow, the influence of the value of each characteristic on the judgment of a certain sample as Tor or non-Tor flow is detected, namely a loss function is calculated, when the value of the characteristic is a certain value, the sample is always judged as non-Tor, and the value of the characteristic is taken as a splitting characteristic point. So that non-Tor traffic can be filtered out.

And 2, step: and reconstructing the primarily screened flow, and converting the flow into a gray characteristic diagram.

The flow reconstruction of the invention comprises two parts of a data packet original byte characteristic diagram and an uplink and downlink interactive behavior characteristic diagram.

For original byte characteristic reconstruction, taking standard byte as L, carrying out zero filling operation on data packets less than L bytes, carrying out truncation processing on data packets more than L bytes, and generating after standardization/normalizationi*iAnd (4) into a gray scale image. WhereiniDetermined by the packet size distribution. For example, according to the invention, the data packet size is distributed within 1400 bytes, the standard size of the data packet is 1444 bytes, a 38 × 38 packet byte matrix is generated, so that a gray image is obtained, a single input is input into a convolutional neural network to obtain a spatial feature vector, and the obtained packet spatial vector is grouped according to flow and then a cyclic neural network is used to obtain a time sequence feature vector.

And for the reconstruction of the characteristics of the uplink and downlink interactive behaviors, the size and the direction of the data packets of the network flow and the arrival time interval form a three-dimensional characteristic graph, and the number of the data packets in each time interval is used as the gray value of a pixel point to form a gray graph simulating the vertical interactive information. As shown in fig. 3, the abscissa of the grayscale map is the size of the data packet, the maximum value and the minimum value of the data packet in the stream sample are found out and used as the starting position and the ending position of the abscissa, the sizes of all the data packets are normalized to the whole abscissa, the ordinate is equally divided into two parts, which are the arrival time of the uplink packet and the downlink packet respectively, and the depth of the cross pixel point of the abscissa represents the number of the data packets. In the description of the present invention, uplink and downlink refer to bidirectional transmission between two network nodes, and uplink and downlink interactive behavior information refers to paired data packets with opposite destination IP and source IP, for example, a and B transmit with each other, the direction of the generated first data packet represents uplink, and the direction opposite to the first data packet represents downlink.

And 3, step 3: and constructing a neural network model and extracting features.

Convolutional Neural Networks (CNN) are a kind of multi-layered supervised learning neural network, and convolutional layers and pooling layers are core parts of feature extraction. The weight parameters in the network are reversely adjusted layer by adopting a gradient descent algorithm to minimize a loss function, and the accuracy of the network is improved by frequent iterative training. Convolutional neural networks consist of alternating convolutional and pooling layers, followed by a fully-connected layer and a logistic regression classifier such as a Softmax layer. The input of the first fully connected layer is a feature map obtained by feature extraction from the convolutional layer and the pooling layer. A Recurrent Neural Network (RNN) is a special neural network structure in which a sequence of current outputs is also related to previous outputs, and the network memorizes the previous information and applies it to the calculation of the current output.

As shown in fig. 4, the convolutional neural network model structure constructed by the present invention is input layer-convolutional layer (CONV 1) -pooling layer (POOL 1) -convolutional layer (CONV 2) -pooling layer (POOL 2) -convolutional layer (CONV 3) -full-connectivity layer (FC 1) -full-connectivity layer (FC 2) (FC 3 in the figure is a feature simplification step, which is described below). Inputting a gray image with 38 × 1 layers, after convoluting by CONV1, the number of channels is 32, the dimension is 38 × 32, after 2 × 2 convolution kernel sampling by POOL1, the dimension is 19 × 32, after convoluting by CONV2, the number of channels is 64, after 2 × 2 convolution kernel sampling by POOL2, the output dimension is 10 × 64, after 2 × 2 convolution kernel sampling, the dimension is pulled to one dimension by a Flatten function through CONV3 convolution, the dimension can be inputted into the full-connection layers, and finally, the neuron of the full-connection layer FC2 is nnSet to 64, i.e., FC2 outputs a feature vector of 1 × 64.

As shown in fig. 5, the recurrent neural network model constructed by the present invention has a structure of BiGRU layer (BiGRU 1) -BiGRU layer (BiGRU 2) -full connection layer (FC 4) (in the figure, FC5 is a feature simplification step, which is described below), and processes packet space feature packet input. The packet space features obtained by a plurality of data packet feature maps through a CNN model are used as a group, a group of packet feature vectors are input into a BiGRU (bidirectional circular gated neural network) layer to extract high-level time sequence feature vectors, the number of neurons of a full connection layer FC4 is m, and m is set to be 64 in the invention, namely the dimension of the flow time sequence feature vector based on the packet data packets is 1 x 64.

The specific process of extracting the features is as follows:

(a) extracting characteristic vectors of uplink and downlink interactive information;

inputting the up-down interactive behavior reconstruction graph into a convolutional neural network, extracting a spatial feature graph by operation of the first two convolutional layers and the pooling layer of the convolutional neural network, converting the feature graph into a one-dimensional vector by a Flatten function of the third convolutional layer so as to input the one-dimensional vector into a fully-connected layer, extracting 1 s of up-down interactive behavior feature vectors from the fully-connected layer FC2, wherein the neuron of FC2 is 64, and thus obtaining 1 x 64 one-dimensional feature vectors. And saving the interactive behavior feature vector.

(b) Extracting packet-level spatial feature vectors;

the invention inputs CNN to extract packet-level spatial features by taking a data packet as a unit, namely, the invention extracts the packet-level spatial features of a single data packet and intercepts or supplements zero in the data packetkA standard byte, which is converted into a single byte by means of single hot codinglThe dimension vector of the vector is calculated,ka byte can form a framel*kThe gray scale image of (1). In an embodiment of the present invention, the grayscale image set is represented by 9: and 1, training and dividing a test set. Training by using a convolutional neural network, selecting the size of Batchsize to be 64, selecting a cross entropy function as a loss function, using a random gradient descent algorithm in an optimization method, training the training times to be 200, learning rate to be 0.001, adopting a Tanh function as an activation function, and adopting maximum pooling for pooling operation. Extracting 1 from the fully connected layer after inputting the gray scale graph generated by the data packetnThe number of neurons in the full link layer FC2 is 64, and the extracted packet is the same as that of the extracted packet-level feature vectorThe rank feature vector dimension is 1 x 64. Where n = s, n and s are distinguished to indicate that both are features of different nature. And storing the feature vector of each data packet extracted by the CNN model.

(c) Extracting the flow-level time sequence characteristics of the grouped data packets;

the packet feature vectors are input into the recurrent neural network according to stream packets for training, as shown in fig. 5, 10 packet numbers are measured, then 10 packet feature vectors form 1 × 320 input, parameters required for training and the recurrent neural network are operated by the BiGRU layer to obtain stream level timing characteristics. The time sequence characteristics extracted from the BiGRU layer are converted into one-dimensional vectors through a Flatten function and input into the full connection layer FC4, and 1 x is extracted from the full connection layer FC4mIf the dimension of the one-dimensional feature vector of (1) is 64 for the full connection layer FC4 neuron, the extracted flow-level time-series feature vector dimension is 1 × 64, and the time-series feature vector is stored.

Referring to fig. 2 to 4, the present invention performs feature fusion in units of packets after feature simplification. The characteristic simplification means that 1 isnOne-dimensional spatial feature of (1) m And 1. the time sequence characteristic ofsThe interactive behavior characteristics are converted into one-dimensional characteristic vectors smaller than the original dimensionality by adding a full connection layer on the basis of the original model. In the invention, the packet level space characteristics of 1 × 64 and the flow level time sequence characteristics of 1 × 6 are simplified into 1 × 32 dimensions through the full connection layers FC3 and FC5, respectively, and the uplink and downlink interaction behavior characteristics of 1 × 64 are simplified into 1 × 26 dimensions through the full connection layer FC3 with the number of neurons being 26. The weights of the three types of features can be adjusted through feature simplification while facilitating subsequent processing. And the characteristic fusion is carried out by taking the data packet as a unit, namely the interactive behavior characteristic, the data packet space characteristic and the stream time sequence characteristic of the stream after the characteristic simplification are subjected to characteristic fusion according to a single data packet as a sample to obtain a characteristic vector with the dimension of 1 x 90, and the characteristic vector is transferred and input to the multi-classifier.

And 4, step 4: and classifying the traffic application by using a plurality of classifiers.

The multi-classifier adopts a one-dimensional convolutional neural network and has a structure of a first convolutional layer, a first pooling layer, a second convolutional layer, a second pooling layer, a first fully-connected layer, a second fully-connected layer and a Softmax layer. The input layer is a multi-step with 90 x 1 of dimensionality after the characteristic is fusedThe number of channels after the first convolutional layer is 32, the output dimension is 90 x 32, the dimension after the first sampling pooling layer is reduced to 30 x 32, the number of channels after the second convolutional layer is 64, the output dimension is 30 x 64, the dimension after the second sampling pooling layer is reduced to 10 x 64, and the neurons in the two fully-connected layers are 128 and 10cHere, thecIs the desired number of categories. The training parameter is set to be min-batch of 50, the loss function is a cross entropy function, the optimization method is a random gradient descent algorithm, the learning rate is 0.001, and the Epoch is 40. Before the flowNThe fused features of each packet in the data packet are classified for application.

In the embodiment of the invention, when a new flow class appears, the parameters of the multiple classifiers are updated through an inheritance learning mechanism. The specific process is as follows:

(1) sample data preprocessing;

dividing the new class samples and a small amount of old class samples into a training set, a verification set and a test set, wherein the ratio of the training set to the verification set to the test set is 9: 1: 1, predicting samples by using an original classifier, and outputting a normalized vector of a classification result from a final full-connection layerV _a The new classifier obtains the normalized vector of the classification result for the prediction sampleV _b Remembering the true class label vector asV _c 。

(2) Using an inheritance loss function to learn the parameters of the original model and adapt to the new category at the same time;

referring to fig. 6, the inherited loss function is defined as a weighted sum of the true loss function and the differential loss function, as shown in equation (5) below. Wherein the real loss function describes the real class in the training processV _c With new classifier prediction resultsV _b The fitting degree of (2) is equivalent to the process of learning new knowledge, and the cross entropy loss function is adopted in the invention, as shown in formula (5). Normalization vector of prediction result of original classifier by using difference loss functionV _a Normalizing vectors with new classifier predictorsV _b The degree of difference of (a) is equivalent to a process of retaining the originally learned weight information, so that updating of the classifier can be completed more quickly. Hair brushIt is clear that the difference of the two probability distributions is described using the KL divergence loss function. Ratio of old and new classes

As a function of the differential loss, and

then the weight of the true loss function.p(x _i )、q(x _i ) Respectively for random variable samplesxTwo probability distributions for the predicted result, 0.375 and 0.625 in the present invention, respectively.

(4)

(5)

(6)

(3) Defining retention coefficients to control the learning degree of the parameters of the original classifier;

the original classifier not only represents the classification result of the prediction sample, but also represents the degree similar to or different from other classes, different importance is given to the learning of the normalization vector of the classification result by using a retention coefficient, the retention coefficient is set between 0 and 1 according to the required learning degree, the retention coefficient is increased when the original classifier extracts sufficiently detailed features, and otherwise, a smaller retention coefficient is used.

(4) Using linear mapping to balance the classification preferences of different classes at the fully connected layer;

the parameters of the full-connection layer of the classifier are always most fitted to the latest category when predicting the sample, and in order to balance the fitting degree of the new and old categories, one is defined for the output result of the new categoryA linear mapping model processes the classification result vectors for the new classes. Two parameters of the linear mapping modela、bAnd determining by using a verification set, wherein the loss function adopts a cross entropy loss function, and the parameters are stored as a weight file.

(7)

outThe probability given to the classifier.

And 5: and judging the final attribution application of the traffic based on a majority principle.

Determining flow classification using majority rules refers to pre-staging flowNVoting selection is carried out after the classification result of each data packet is obtained,Nmost packets in the packet classification result are classified into a certain type of application, and the flow is determined as the application traffic. As shown in FIG. 7, the present inventionNAnd if the number of the data packets classified into a plurality of categories is equal, comparing the probability sum, and taking the category with the large probability sum as the final data flow attribution category.

Based on the same technical concept as the method embodiment, the invention also provides an anonymous network traffic identification device based on traffic reconstruction and inheritance learning, which comprises the following steps:

the flow reconstruction module is used for reconstructing the primarily screened flow and converting the flow into a gray characteristic diagram; the method comprises the following steps: original byte characteristic reconstruction unit: taking the standard byte asLTo be less thanLThe data packet of one byte is subjected to zero padding operation, and exceedsLThe data packet of each byte is cut off and normalized to generatei*iThereby converting the packed byte matrix into a gray image; and an uplink and downlink interactive behavior characteristic reconstruction unit: constructing horizontal and vertical coordinates according to the size, direction and time interval of the data packets, and taking the number of the data packets in each time interval as the gray value of the pixel pointForming a characteristic diagram for simulating uplink and downlink interactive behaviors;

a feature extraction and fusion module, which takes the data packet as a unit and inputs the corresponding uplink and downlink interactive behavior feature map into a convolutional neural network for extraction to obtain an interactive information feature vectorV _s Inputting the original byte characteristic diagram into the convolutional neural network to extract and obtain a packet space characteristic vectorV _n Inputting a group of packet space feature vectors into a recurrent neural network to extract and obtain a stream time sequence feature vectorV _m And fusing the three feature vectors;

It should be understood that the anonymous network traffic identification apparatus provided in this embodiment may implement all technical solutions of the anonymous network traffic identification method, functions of each functional module of the anonymous network traffic identification apparatus may be implemented according to the method in the foregoing method embodiment, and a specific implementation process may refer to relevant descriptions in the foregoing embodiment, which is not described herein again.

As will be appreciated by one skilled in the art, embodiments of the present invention may be provided as a method, system, or computer program product. Accordingly, the present invention may take the form of an entirely hardware embodiment, an entirely software embodiment or an embodiment combining software and hardware aspects. Furthermore, the present invention may take the form of a computer program product embodied on one or more computer-usable storage media (including, but not limited to, disk storage, CD-ROM, optical storage, and the like) having computer-usable program code embodied therein.

The present invention is described with reference to flowchart illustrations and/or block diagrams of methods, apparatus (systems), and computer program products according to embodiments of the invention. It will be understood that each flow and/or block of the flow diagrams and/or block diagrams, and combinations of flows and/or blocks in the flow diagrams and/or block diagrams, can be implemented by computer program instructions. These computer program instructions may be provided to a processor of a general purpose computer, special purpose computer, embedded processor, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, create means for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.

These computer program instructions may also be stored in a computer-readable memory that can direct a computer or other programmable data processing apparatus to function in a particular manner, such that the instructions stored in the computer-readable memory produce an article of manufacture including instruction means which implement the function specified in the flowchart flow or flows and/or block diagram block or blocks.

These computer program instructions may also be loaded onto a computer or other programmable data processing apparatus to cause a series of operational steps to be performed on the computer or other programmable apparatus to produce a computer implemented process such that the instructions which execute on the computer or other programmable apparatus provide steps for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.

Claims

1. an anonymous network traffic identification method based on traffic reconstruction and inheritance learning, is characterized in that, described method comprises the following steps:

Collect raw network traffic, and perform preliminary traffic screening to eliminate non-Tor traffic;

Reconstruct the traffic after the initial screening, and convert the traffic into a grayscale feature map, including: original byte feature reconstruction: take the standard byte as L , and perform zero-fill operation on the data packets with less than L bytes, Data packets with more than L bytes are truncated, and normalized to generate an i * i packet byte matrix to convert into grayscale images; and, reconstruction of uplink and downlink interaction behavior characteristics: according to the size, direction and time of the data packet The horizontal and vertical coordinates are constructed at intervals, and the number of data packets in each time interval is used as the gray value of the pixel point to form a feature map that simulates the interaction behavior of upstream and downstream;

Taking the data packet as the unit, input the corresponding uplink and downlink interactive behavior feature map into the convolutional neural network to extract the interactive information feature vector, input the original byte feature map into the convolutional neural network to extract the packet space feature vector, The packet space feature vector is input into the cyclic neural network to extract the stream time series feature vector, and the three feature vectors are fused;

Input the fusion feature into a multi-classifier for application classification, and the multi-classifier updates the classifier parameters through an inheritance learning mechanism when a new traffic category is detected;

The attribution application of the traffic is determined based on the majority rule.

2. the anonymous network traffic identification method based on traffic reconstruction and inheritance learning according to claim 1, is characterized in that, described collecting original network traffic, and carrying out the preliminary screening of traffic comprises:

Use network traffic collection tools to capture the original traffic, and divide the original traffic in the form of quintuple;

Use feature extraction tools to extract features from the divided network streams, perform deep discretization processing such as histograms on the features, and input them into the extreme gradient boosting decision tree. The values of the features are traversed and calculated in turn to find the feature points that minimize the objective function, thereby filtering out non-Tor traffic.

3. the anonymous network traffic identification method based on traffic reconstruction and inheritance learning according to claim 1, is characterized in that, according to packet size, direction and time interval, constructing abscissa and ordinate comprises: with packet size as abscissa, Find the maximum and minimum values of data packets in the flow sample as the start and end positions of the abscissa, normalize the size of all data packets to the entire abscissa, and divide the ordinate into two equal parts, which are uplink packets. With the arrival time of the downlink packet, the depth of the intersection of the horizontal and vertical coordinates represents the number of data packets.

4. the anonymous network traffic identification method based on traffic reconstruction and inheritance learning according to claim 1, is characterized in that, described convolutional neural network structure is input layer-convolution layer CONV1-pooling layer POOL1-convolution Layer CONV2-pooling layer POOL2-convolutional layer CONV3-full connection layer FC1-full connection layer FC2;

The interactive information feature vector is obtained according to the following method: input the uplink and downlink interactive behavior feature map into the convolutional neural network, extract the spatial feature map by the first two convolutional layers and the pooling layer operation, and pass the Flatten of the convolutional layer CONV3. The function converts the feature map into a one-dimensional vector to input the fully connected layer, and extracts a one-dimensional feature vector V _s of 1* s from the fully connected layer FC2, where s is the number of neurons in the fully connected layer FC2;

The packet space feature vector is obtained according to the following method: input the grayscale image converted into the original byte of the packet into the convolutional neural network model for training, and extract the spatial feature map by the operation of the first two convolutional layers and the pooling layer. , the feature map is converted into a one-dimensional vector by the Flatten function of the convolutional layer CONV3 to input the fully connected layer, and a one-dimensional feature vector V _n of 1* n is extracted from the fully connected layer FC2, n=s.

5. the anonymous network traffic identification method based on traffic reconstruction and inheritance learning according to claim 1, is characterized in that, the structure of described recurrent neural network model is BiGRU layer BiGRU1-BiGRU layer BiGRU2-full connection layer FC4;

The stream time sequence feature vector is obtained according to the following method: input the grayscale images of the grouped data packets into the cyclic neural network model for training in batches, perform operations by the BiGRU layer to obtain the time sequence feature map, and convert it into a one-dimensional vector input full connection through the Flatten function. layer, a one-dimensional feature vector of 1* m is extracted from the fully connected layer FC4, where m is the number of neurons in the fully connected layer FC4.

6. the anonymous network traffic identification method based on traffic reconstruction and inheritance learning according to claim 1, is characterized in that, the fusion of three kinds of feature vectors comprises: feature fusion is carried out in units of data packets, the dimension 1* s The upstream and downstream interaction behavior feature vector, 1* n spatial feature vector and 1* m time sequence feature vector are respectively converted into one-dimensional feature vectors with lower dimensions using a fully connected layer, and then the three lower-dimensional one-dimensional feature vectors are converted into The feature vectors are fused to obtain fused features.

7. the anonymous network traffic identification method based on traffic reconstruction and inheritance learning according to claim 1, is characterized in that, described multi-classifier adopts one-dimensional convolutional neural network, comprises convolution layer-pooling layer-Flatten Layer - fully connected layer - Softmax layer, which applies classification to the fusion features of each packet in the first N data packets of the traffic;

When the multi-classifier detects a new traffic category, updating the classifier parameters through the inheritance learning mechanism includes: retaining some of the feature parameters learned during the pre-training classifier, learning new traffic category samples at the same time, and using the inheritance loss function to calculate the classifier learning. The parameter difference before and after is combined with the new traffic category sample loss function to update the classifier parameters, and the retention coefficient is used to determine the degree of parameter learning. In the final fully connected layer, linear mapping is used to balance the classification preferences of different categories.

8. An anonymous network traffic identification device based on traffic reconstruction and inheritance learning, characterized in that, comprising:

The data collection and filtering module collects the original network traffic, and conducts a preliminary screening of the traffic to eliminate non-Tor traffic;

The traffic reconstruction module reconstructs the traffic after the initial screening, and converts the traffic into a grayscale feature map, including: original byte feature reconstruction unit: take the standard byte as L , for data less than L bytes The packet is zero-filled, the data packets exceeding L bytes are truncated, and after normalization, an i * i packet byte matrix is generated to convert it into a grayscale image; and, the uplink and downlink interactive behavior feature reconstruction unit: according to The size, direction and time interval of the data packet construct the horizontal and vertical coordinates, and the number of data packets in each time interval is used as the gray value of the pixel point to form a feature map that simulates the interaction behavior of uplink and downlink;

The feature extraction and fusion module takes the data packet as the unit, inputs the corresponding uplink and downlink interaction behavior feature map to the convolutional neural network to extract the interactive information feature vector, and inputs the original byte feature map to the convolutional neural network to extract the packet. Spatial feature vector, input a group of packet space feature vectors into the cyclic neural network to extract the stream time series feature vector, and fuse the three feature vectors;

The application classification module inputs the fusion feature into a multi-classifier for application classification, and the multi-classifier updates the parameters of the classifier through an inheritance learning mechanism when a new type of traffic is detected;

The category determination module determines the attribution application of the traffic based on the majority principle.

9. A computer equipment, characterized in that, comprising:

one or more processors;

memory; and

One or more programs, wherein the one or more programs are stored in the memory and are configured to be executed by the one or more processors, the programs when executed by the processors implement as claimed in claim 1 The steps of the anonymous network traffic identification method based on traffic reconstruction and inheritance learning according to any one of -7.

10. A computer-readable storage medium on which a computer program is stored, wherein when the computer program is executed by a processor, the traffic-based reconstruction and inheritance according to any one of claims 1-7 are implemented Learn the steps of an anonymous network traffic identification method.