CN114615093B - Anonymous network traffic identification method and device based on traffic reconstruction and inheritance learning - Google Patents

Anonymous network traffic identification method and device based on traffic reconstruction and inheritance learning Download PDF

Info

Publication number
CN114615093B
CN114615093B CN202210506848.6A CN202210506848A CN114615093B CN 114615093 B CN114615093 B CN 114615093B CN 202210506848 A CN202210506848 A CN 202210506848A CN 114615093 B CN114615093 B CN 114615093B
Authority
CN
China
Prior art keywords
flow
traffic
learning
inheritance
layer
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202210506848.6A
Other languages
Chinese (zh)
Other versions
CN114615093A (en
Inventor
肖滕龙
翟江涛
许成程
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Nanjing University of Information Science and Technology
Original Assignee
Nanjing University of Information Science and Technology
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Nanjing University of Information Science and Technology filed Critical Nanjing University of Information Science and Technology
Priority to CN202210506848.6A priority Critical patent/CN114615093B/en
Publication of CN114615093A publication Critical patent/CN114615093A/en
Application granted granted Critical
Publication of CN114615093B publication Critical patent/CN114615093B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L63/00Network architectures or network communication protocols for network security
    • H04L63/14Network architectures or network communication protocols for network security for detecting or protecting against malicious traffic
    • H04L63/1408Network architectures or network communication protocols for network security for detecting or protecting against malicious traffic by monitoring network traffic
    • H04L63/1416Event detection, e.g. attack signature detection
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/24Classification techniques
    • G06F18/243Classification techniques relating to the number of classes
    • G06F18/2431Multiple classes
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/25Fusion techniques
    • G06F18/253Fusion techniques of extracted features
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/044Recurrent networks, e.g. Hopfield networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L41/00Arrangements for maintenance, administration or management of data switching networks, e.g. of packet switching networks
    • H04L41/16Arrangements for maintenance, administration or management of data switching networks, e.g. of packet switching networks using machine learning or artificial intelligence
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L63/00Network architectures or network communication protocols for network security
    • H04L63/14Network architectures or network communication protocols for network security for detecting or protecting against malicious traffic
    • H04L63/1408Network architectures or network communication protocols for network security for detecting or protecting against malicious traffic by monitoring network traffic
    • H04L63/1425Traffic logging, e.g. anomaly detection
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L63/00Network architectures or network communication protocols for network security
    • H04L63/20Network architectures or network communication protocols for network security for managing network security; network security policies in general

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Computer Security & Cryptography (AREA)
  • Artificial Intelligence (AREA)
  • Evolutionary Computation (AREA)
  • Computing Systems (AREA)
  • General Physics & Mathematics (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Computer Networks & Wireless Communication (AREA)
  • Software Systems (AREA)
  • Signal Processing (AREA)
  • Computational Linguistics (AREA)
  • Computer Hardware Design (AREA)
  • Molecular Biology (AREA)
  • General Health & Medical Sciences (AREA)
  • Mathematical Physics (AREA)
  • Health & Medical Sciences (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Biomedical Technology (AREA)
  • Biophysics (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Evolutionary Biology (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Databases & Information Systems (AREA)
  • Medical Informatics (AREA)
  • Data Exchanges In Wide-Area Networks (AREA)

Abstract

The invention discloses an anonymous network traffic identification method and device based on traffic reconstruction and inheritance learning, wherein the method comprises the following steps: collecting original network flow, primarily screening the flow, and removing non-Tor flow; reconstructing the primarily screened flow, and converting the flow into a gray characteristic diagram; processing the feature map after flow reconstruction by using a convolutional neural network model and a cyclic neural network model, extracting an interactive information feature vector, a packet space feature vector and a flow time sequence feature vector, and fusing the three feature vectors; inputting the fusion characteristics into a multi-classifier for application classification, wherein the multi-classifier updates classifier parameters through an inheritance learning mechanism when detecting a new flow class; the home application of the traffic is determined based on majority rules. The invention simplifies the process of feature design, enriches the comprehensiveness of features, meets the requirement of online updating of model parameters, keeps the model remembering the past training, and only needs small-scale training each time a new category is added.

Description

Anonymous network traffic identification method and device based on traffic reconstruction and inheritance learning
Technical Field
The invention relates to network traffic identification and network application classification, in particular to an anonymous network traffic identification method and device based on traffic reconstruction and inheritance learning.
Background
With the continuous development of the internet, the types of network traffic are gradually complex, and different types of application programs are continuously emerged. Applications can generate a large amount of network traffic, and different types of traffic can exhibit different characteristics. The goal of traffic classification is to identify the class of traffic based on its distinguishing characteristics, which is essential to network operators. The traffic classification is the first step of guaranteeing the service quality from the perspective of the user service quality, and is a premise of providing differentiated services for services according to requirements of different service types, and on the other hand, the traffic classification is the first step of detecting abnormal network traffic from the perspective of security, so that the network security can be better protected. In recent years, with the increasing demand of users for privacy protection and the continuous development of anonymized encryption technology, more and more traffic is specially processed, which presents new challenges to network traffic classification.
Classification methods in the field of traffic identification have undergone several changes, and conventional traffic classification methods are mainly classified into two categories: one is a port number-based method, which identifies according to a protocol number corresponding to a port number, but with the advent of anonymous network port obfuscation techniques, this method is becoming ineffective. The other type is an identification method based on Deep Packet Inspection (DPI), and data packet loads are matched to determine the category based on different categories of regular expressions. But this method is not feasible as the traffic anonymization encryption technology is mature. With the loss of function of the traditional methods, researchers began to look for new methods of traffic classification. Machine learning methods that have progressed rapidly in recent years have received considerable attention from researchers. Compared with the traditional classification method, the machine learning technology is more intelligent and convenient, and can effectively avoid the influence of flow encryption by classifying according to the statistical characteristics of the flow. Therefore, researchers have proposed a traffic classification algorithm based on machine learning, and the machine learning algorithms widely used at present include support vector machines, decision trees, random forests, XGBoost methods, and the like. The classification methods have good classification accuracy and are widely accepted by all social circles. However, the traffic classification method based on machine learning requires expert experience to extract and screen traffic characteristics, and the characteristics are not comprehensive enough while consuming time and energy, and have high representativeness requirements on the characteristics and low classification accuracy. The model based on deep learning becomes a research hotspot at present, an end-to-end model is favored by researchers, but in actual deployment, when a novel traffic identification scene is encountered, the model needs to be retrained, a large amount of time is consumed, and the difficulty is encountered in anonymous network traffic application classification at present.
Disclosure of Invention
The invention aims to: the invention aims to provide an anonymous network traffic identification method and device based on traffic reconstruction and inheritance learning, which at least partially solve the problems in the background art.
The technical scheme is as follows: an anonymous network traffic identification method based on traffic reconstruction and inheritance learning comprises the following steps:
collecting original network flow, primarily screening the flow, and removing non-Tor flow;
reconstructing the flow after primary screening, converting the flow into a gray characteristic diagram, comprising: original byte characteristic reconstruction: taking the standard byte asLTo be less thanLThe data packet of each byte is subjected to zero padding operation, exceedingLThe data packet of each byte is cut off and normalized to generatei*iThereby converting the packed byte matrix into a gray image; and reconstructing the characteristics of the uplink and downlink interactive behaviors: constructing horizontal and vertical coordinates according to the size and direction of the data packets and time intervals, and forming a characteristic diagram simulating uplink and downlink interaction behaviors by taking the number of the data packets in each time interval as a gray value of a pixel point;
inputting corresponding uplink and downlink interactive behavior characteristic graphs into a convolutional neural network to extract and obtain interactive information characteristic vectors by taking a data packet as a unitV s Inputting the original byte characteristic diagram into the convolutional neural network to extract and obtain a packet space characteristic vectorV n Grouping the packet space feature vectors and inputting the grouped packet space feature vectors into a recurrent neural network to extract to obtain the stream time sequence feature vectorsV m And fusing the three feature vectors;
inputting the fusion characteristics into a multi-classifier for application classification, wherein the multi-classifier updates classifier parameters through an inheritance learning mechanism when detecting a new flow category;
the home application of the traffic is determined based on majority rules.
The invention also provides an anonymous network flow identification device based on flow reconstruction and inheritance learning, which comprises the following components:
the data acquisition and filtering module is used for acquiring original network flow, primarily screening the flow and eliminating non-Tor flow;
the flow reconstruction module reconstructs the flow after primary screening, converts the flow into a gray characteristic diagram, and comprises: original byte characteristic reconstruction unit: taking the standard byte asLFor less thanLThe data packet of one byte is subjected to zero padding operation, and exceedsLThe data packet of each byte is cut off and normalized to generatei*iThereby converting the packed byte matrix into a gray image; and an uplink and downlink interactive behavior characteristic reconstruction unit: constructing horizontal and vertical coordinates according to the size, the direction and the time intervals of the data packets, and forming a characteristic diagram simulating uplink and downlink interaction behaviors by taking the number of the data packets in each time interval as a gray value of a pixel point;
the feature extraction and fusion module takes a data packet as a unit and inputs the corresponding uplink and downlink interactive behavior feature map into the convolutional neural network to extract and obtain an interactive information feature vectorV s Inputting the original byte characteristic diagram into the convolutional neural network to extract and obtain a packet space characteristic vectorV n Inputting a group of packet space feature vectors into a recurrent neural network to extract and obtain a stream time sequence feature vectorV m And fusing the three feature vectors;
the application classification module is used for inputting the fusion characteristics into a multi-classifier for application classification, and the multi-classifier updates classifier parameters through an inheritance learning mechanism when detecting a new flow category;
and the class judgment module is used for determining the attribution application of the flow based on a majority principle.
The present invention also provides a computer apparatus comprising:
one or more processors;
a memory; and
one or more programs, wherein the one or more programs are stored in the memory and configured for execution by the one or more processors, which when executed by the processors, implement the steps of the anonymous network traffic identification method based on traffic reconstruction and inheritance learning as described above.
The present invention also provides a computer readable storage medium having stored thereon a computer program which, when being executed by a processor, implements the steps of the anonymous network traffic identification method based on traffic reconstruction and inheritance learning as described above.
Has the beneficial effects that: according to the method, the characteristic vectors containing the interactive information, the packet-level spatial information and the flow-level time sequence information with different dimensions are extracted through the reconstruction of the flow characteristic diagram, and application classification is carried out, so that the problem that the classification accuracy is low when the characteristic representativeness is insufficient is solved, the characteristic design process is simplified, the comprehensiveness of the characteristics is enriched, and the requirement of online updating of model parameters is met. Meanwhile, the invention utilizes the inheritance learning mechanism to ensure that the classifier model keeps the memory of the past training, and only needs small-scale training when a new category is added each time. The method of the invention can realize the application classification of the anonymous network flow with high efficiency, accuracy and low cost.
Drawings
FIG. 1 is a general flow diagram of a Tor traffic identification method of the present invention;
FIG. 2 is a flowchart of an embodiment of a Tor traffic application identification method of the present invention;
FIG. 3 is a schematic diagram of interactive behavior traffic reconstruction in accordance with the present invention;
FIG. 4 is a schematic diagram of a convolutional neural network structure employed in the present invention;
FIG. 5 is a schematic diagram of a recurrent neural network architecture employed in the present invention;
FIG. 6 is a schematic diagram of an online updating method for inherited learning mechanism parameters in the present invention;
fig. 7 is a diagram illustrating most principles of determining flow attribution categories according to the present invention.
Detailed Description
The technical solution in the embodiments of the present invention will be clearly and completely described below with reference to the accompanying drawings.
Referring to fig. 1 and fig. 2, the anonymous network traffic identification method based on traffic reconstruction and inheritance learning provided by the present invention includes the following steps:
step 1, collecting original network flow, carrying out flow primary screening, and removing non-Tor flow.
According to The embodiment of The invention, a flow detector is deployed in a network, accounts of various application programs are established, Tor (The on Router, Onion Router) network is used for simulating The behavior of users using various applications, and Tor flow, namely anonymous network flow, is generated. The method comprises the steps of capturing flow by Wireshark, storing the flow in a PCAP mode, and dividing original flow into two-way flows according to a { SrcIP, SrcPort, DstIP, DstPort and Protocol } quintuple mode and then storing the two-way flows. In the quintuple, SrcIP is a source IP address, SrcPort is a source port, DstIP is a destination IP address, DstPort is a destination port, and Protocol represents a Protocol type. One network flow with the same quintuple data is considered to be a unidirectional flow, while the source and destination IPs and source and destination ports of a bidirectional flow may be interchanged simultaneously. For example, a packet containing only a to B is a unidirectional flow, and a packet containing a to B and B to a is a bidirectional flow. The network mainly comprises two types of protocol flows, namely a TCP flow and a UDP flow, wherein the TCP flow uses a SYN zone bit to represent the beginning of transmission, the FIN zone bit is used for finishing transmission, and the UDP flow uses a data packet time interval as a judgment basis. The method and the device use the DPKT library to analyze and divide the PCAP file and reserve the information of all layers of the session stream.
And (3) performing feature extraction on the divided network flow by using a feature extraction tool CICFlowMeter, performing histogram equal-depth discretization on the features, inputting the features into a limit gradient lifting decision tree, and sequentially performing traversal calculation on the value of each feature through a target function consisting of a loss function and a regularization penalty term to find out feature points of the minimized target function, thereby filtering out non-Tor flow and reducing the working complexity. Objective function
Figure 38845DEST_PATH_IMAGE001
As shown in formula (1), wherein
Figure 181114DEST_PATH_IMAGE002
In order to be a function of the loss,
Figure 597052DEST_PATH_IMAGE003
for the penalty function:
Figure 598506DEST_PATH_IMAGE004
(1)
Figure DEST_PATH_IMAGE005
(2)
Figure 859723DEST_PATH_IMAGE006
(3)
in the formula (I), the compound is shown in the specification,
Figure 997443DEST_PATH_IMAGE002
in
Figure DEST_PATH_IMAGE007
The difference between the true value and the predicted value is described,
Figure 554588DEST_PATH_IMAGE009
is a sampleiFirst, thetThe decision tree model generated by the round of fitting,g i is composed of
Figure 902393DEST_PATH_IMAGE007
The first derivative of (a) is,h i is composed of
Figure 108247DEST_PATH_IMAGE007
The second derivative of (c).TIs the number of leaves of the decision tree model,
Figure 490686DEST_PATH_IMAGE010
in order to obtain a learning rate,
Figure 389372DEST_PATH_IMAGE011
for the prediction of the input samples by the decision tree,
Figure 224473DEST_PATH_IMAGE012
to control the constant parameters of the size of the penalty term,
Figure 702859DEST_PATH_IMAGE013
is a decision tree ofjA predicted value for each leaf node. The target function represents the error between the predicted value and the true value, after a training sample is input, the decision tree evaluates different values of each characteristic of Tor and non-Tor flow, the influence of the value of each characteristic on the judgment of a certain sample as Tor or non-Tor flow is detected, namely a loss function is calculated, when the value of the characteristic is a certain value, the sample is always judged as non-Tor, and the value of the characteristic is taken as a splitting characteristic point. So that non-Tor traffic can be filtered out.
And 2, step: and reconstructing the primarily screened flow, and converting the flow into a gray characteristic diagram.
The flow reconstruction of the invention comprises two parts of a data packet original byte characteristic diagram and an uplink and downlink interactive behavior characteristic diagram.
For original byte characteristic reconstruction, taking standard byte as L, carrying out zero filling operation on data packets less than L bytes, carrying out truncation processing on data packets more than L bytes, and generating after standardization/normalizationi*iAnd (4) into a gray scale image. WhereiniDetermined by the packet size distribution. For example, according to the invention, the data packet size is distributed within 1400 bytes, the standard size of the data packet is 1444 bytes, a 38 × 38 packet byte matrix is generated, so that a gray image is obtained, a single input is input into a convolutional neural network to obtain a spatial feature vector, and the obtained packet spatial vector is grouped according to flow and then a cyclic neural network is used to obtain a time sequence feature vector.
And for the reconstruction of the characteristics of the uplink and downlink interactive behaviors, the size and the direction of the data packets of the network flow and the arrival time interval form a three-dimensional characteristic graph, and the number of the data packets in each time interval is used as the gray value of a pixel point to form a gray graph simulating the vertical interactive information. As shown in fig. 3, the abscissa of the grayscale map is the size of the data packet, the maximum value and the minimum value of the data packet in the stream sample are found out and used as the starting position and the ending position of the abscissa, the sizes of all the data packets are normalized to the whole abscissa, the ordinate is equally divided into two parts, which are the arrival time of the uplink packet and the downlink packet respectively, and the depth of the cross pixel point of the abscissa represents the number of the data packets. In the description of the present invention, uplink and downlink refer to bidirectional transmission between two network nodes, and uplink and downlink interactive behavior information refers to paired data packets with opposite destination IP and source IP, for example, a and B transmit with each other, the direction of the generated first data packet represents uplink, and the direction opposite to the first data packet represents downlink.
And 3, step 3: and constructing a neural network model and extracting features.
Convolutional Neural Networks (CNN) are a kind of multi-layered supervised learning neural network, and convolutional layers and pooling layers are core parts of feature extraction. The weight parameters in the network are reversely adjusted layer by adopting a gradient descent algorithm to minimize a loss function, and the accuracy of the network is improved by frequent iterative training. Convolutional neural networks consist of alternating convolutional and pooling layers, followed by a fully-connected layer and a logistic regression classifier such as a Softmax layer. The input of the first fully connected layer is a feature map obtained by feature extraction from the convolutional layer and the pooling layer. A Recurrent Neural Network (RNN) is a special neural network structure in which a sequence of current outputs is also related to previous outputs, and the network memorizes the previous information and applies it to the calculation of the current output.
As shown in fig. 4, the convolutional neural network model structure constructed by the present invention is input layer-convolutional layer (CONV 1) -pooling layer (POOL 1) -convolutional layer (CONV 2) -pooling layer (POOL 2) -convolutional layer (CONV 3) -full-connectivity layer (FC 1) -full-connectivity layer (FC 2) (FC 3 in the figure is a feature simplification step, which is described below). Inputting a gray image with 38 × 1 layers, after convoluting by CONV1, the number of channels is 32, the dimension is 38 × 32, after 2 × 2 convolution kernel sampling by POOL1, the dimension is 19 × 32, after convoluting by CONV2, the number of channels is 64, after 2 × 2 convolution kernel sampling by POOL2, the output dimension is 10 × 64, after 2 × 2 convolution kernel sampling, the dimension is pulled to one dimension by a Flatten function through CONV3 convolution, the dimension can be inputted into the full-connection layers, and finally, the neuron of the full-connection layer FC2 is nnSet to 64, i.e., FC2 outputs a feature vector of 1 × 64.
As shown in fig. 5, the recurrent neural network model constructed by the present invention has a structure of BiGRU layer (BiGRU 1) -BiGRU layer (BiGRU 2) -full connection layer (FC 4) (in the figure, FC5 is a feature simplification step, which is described below), and processes packet space feature packet input. The packet space features obtained by a plurality of data packet feature maps through a CNN model are used as a group, a group of packet feature vectors are input into a BiGRU (bidirectional circular gated neural network) layer to extract high-level time sequence feature vectors, the number of neurons of a full connection layer FC4 is m, and m is set to be 64 in the invention, namely the dimension of the flow time sequence feature vector based on the packet data packets is 1 x 64.
The specific process of extracting the features is as follows:
(a) extracting characteristic vectors of uplink and downlink interactive information;
inputting the up-down interactive behavior reconstruction graph into a convolutional neural network, extracting a spatial feature graph by operation of the first two convolutional layers and the pooling layer of the convolutional neural network, converting the feature graph into a one-dimensional vector by a Flatten function of the third convolutional layer so as to input the one-dimensional vector into a fully-connected layer, extracting 1 s of up-down interactive behavior feature vectors from the fully-connected layer FC2, wherein the neuron of FC2 is 64, and thus obtaining 1 x 64 one-dimensional feature vectors. And saving the interactive behavior feature vector.
(b) Extracting packet-level spatial feature vectors;
the invention inputs CNN to extract packet-level spatial features by taking a data packet as a unit, namely, the invention extracts the packet-level spatial features of a single data packet and intercepts or supplements zero in the data packetkA standard byte, which is converted into a single byte by means of single hot codinglThe dimension vector of the vector is calculated,ka byte can form a framel*kThe gray scale image of (1). In an embodiment of the present invention, the grayscale image set is represented by 9: and 1, training and dividing a test set. Training by using a convolutional neural network, selecting the size of Batchsize to be 64, selecting a cross entropy function as a loss function, using a random gradient descent algorithm in an optimization method, training the training times to be 200, learning rate to be 0.001, adopting a Tanh function as an activation function, and adopting maximum pooling for pooling operation. Extracting 1 from the fully connected layer after inputting the gray scale graph generated by the data packetnThe number of neurons in the full link layer FC2 is 64, and the extracted packet is the same as that of the extracted packet-level feature vectorThe rank feature vector dimension is 1 x 64. Where n = s, n and s are distinguished to indicate that both are features of different nature. And storing the feature vector of each data packet extracted by the CNN model.
(c) Extracting the flow-level time sequence characteristics of the grouped data packets;
the packet feature vectors are input into the recurrent neural network according to stream packets for training, as shown in fig. 5, 10 packet numbers are measured, then 10 packet feature vectors form 1 × 320 input, parameters required for training and the recurrent neural network are operated by the BiGRU layer to obtain stream level timing characteristics. The time sequence characteristics extracted from the BiGRU layer are converted into one-dimensional vectors through a Flatten function and input into the full connection layer FC4, and 1 x is extracted from the full connection layer FC4mIf the dimension of the one-dimensional feature vector of (1) is 64 for the full connection layer FC4 neuron, the extracted flow-level time-series feature vector dimension is 1 × 64, and the time-series feature vector is stored.
Referring to fig. 2 to 4, the present invention performs feature fusion in units of packets after feature simplification. The characteristic simplification means that 1 isnOne-dimensional spatial feature of (1) m And 1. the time sequence characteristic ofsThe interactive behavior characteristics are converted into one-dimensional characteristic vectors smaller than the original dimensionality by adding a full connection layer on the basis of the original model. In the invention, the packet level space characteristics of 1 × 64 and the flow level time sequence characteristics of 1 × 6 are simplified into 1 × 32 dimensions through the full connection layers FC3 and FC5, respectively, and the uplink and downlink interaction behavior characteristics of 1 × 64 are simplified into 1 × 26 dimensions through the full connection layer FC3 with the number of neurons being 26. The weights of the three types of features can be adjusted through feature simplification while facilitating subsequent processing. And the characteristic fusion is carried out by taking the data packet as a unit, namely the interactive behavior characteristic, the data packet space characteristic and the stream time sequence characteristic of the stream after the characteristic simplification are subjected to characteristic fusion according to a single data packet as a sample to obtain a characteristic vector with the dimension of 1 x 90, and the characteristic vector is transferred and input to the multi-classifier.
And 4, step 4: and classifying the traffic application by using a plurality of classifiers.
The multi-classifier adopts a one-dimensional convolutional neural network and has a structure of a first convolutional layer, a first pooling layer, a second convolutional layer, a second pooling layer, a first fully-connected layer, a second fully-connected layer and a Softmax layer. The input layer is a multi-step with 90 x 1 of dimensionality after the characteristic is fusedThe number of channels after the first convolutional layer is 32, the output dimension is 90 x 32, the dimension after the first sampling pooling layer is reduced to 30 x 32, the number of channels after the second convolutional layer is 64, the output dimension is 30 x 64, the dimension after the second sampling pooling layer is reduced to 10 x 64, and the neurons in the two fully-connected layers are 128 and 10cHere, thecIs the desired number of categories. The training parameter is set to be min-batch of 50, the loss function is a cross entropy function, the optimization method is a random gradient descent algorithm, the learning rate is 0.001, and the Epoch is 40. Before the flowNThe fused features of each packet in the data packet are classified for application.
In the embodiment of the invention, when a new flow class appears, the parameters of the multiple classifiers are updated through an inheritance learning mechanism. The specific process is as follows:
(1) sample data preprocessing;
dividing the new class samples and a small amount of old class samples into a training set, a verification set and a test set, wherein the ratio of the training set to the verification set to the test set is 9: 1: 1, predicting samples by using an original classifier, and outputting a normalized vector of a classification result from a final full-connection layerV a The new classifier obtains the normalized vector of the classification result for the prediction sampleV b Remembering the true class label vector asV c
(2) Using an inheritance loss function to learn the parameters of the original model and adapt to the new category at the same time;
referring to fig. 6, the inherited loss function is defined as a weighted sum of the true loss function and the differential loss function, as shown in equation (5) below. Wherein the real loss function describes the real class in the training processV c With new classifier prediction resultsV b The fitting degree of (2) is equivalent to the process of learning new knowledge, and the cross entropy loss function is adopted in the invention, as shown in formula (5). Normalization vector of prediction result of original classifier by using difference loss functionV a Normalizing vectors with new classifier predictorsV b The degree of difference of (a) is equivalent to a process of retaining the originally learned weight information, so that updating of the classifier can be completed more quickly. Hair brushIt is clear that the difference of the two probability distributions is described using the KL divergence loss function. Ratio of old and new classes
Figure 706849DEST_PATH_IMAGE014
As a function of the differential loss, and
Figure 510857DEST_PATH_IMAGE015
then the weight of the true loss function.p(x i )、q(x i ) Respectively for random variable samplesxTwo probability distributions for the predicted result, 0.375 and 0.625 in the present invention, respectively.
Figure 833254DEST_PATH_IMAGE016
(4)
Figure 380910DEST_PATH_IMAGE017
(5)
Figure 472363DEST_PATH_IMAGE018
(6)
(3) Defining retention coefficients to control the learning degree of the parameters of the original classifier;
the original classifier not only represents the classification result of the prediction sample, but also represents the degree similar to or different from other classes, different importance is given to the learning of the normalization vector of the classification result by using a retention coefficient, the retention coefficient is set between 0 and 1 according to the required learning degree, the retention coefficient is increased when the original classifier extracts sufficiently detailed features, and otherwise, a smaller retention coefficient is used.
(4) Using linear mapping to balance the classification preferences of different classes at the fully connected layer;
the parameters of the full-connection layer of the classifier are always most fitted to the latest category when predicting the sample, and in order to balance the fitting degree of the new and old categories, one is defined for the output result of the new categoryA linear mapping model processes the classification result vectors for the new classes. Two parameters of the linear mapping modelabAnd determining by using a verification set, wherein the loss function adopts a cross entropy loss function, and the parameters are stored as a weight file.
Figure 712851DEST_PATH_IMAGE019
(7)
outThe probability given to the classifier.
And 5: and judging the final attribution application of the traffic based on a majority principle.
Determining flow classification using majority rules refers to pre-staging flowNVoting selection is carried out after the classification result of each data packet is obtained,Nmost packets in the packet classification result are classified into a certain type of application, and the flow is determined as the application traffic. As shown in FIG. 7, the present inventionNAnd if the number of the data packets classified into a plurality of categories is equal, comparing the probability sum, and taking the category with the large probability sum as the final data flow attribution category.
Based on the same technical concept as the method embodiment, the invention also provides an anonymous network traffic identification device based on traffic reconstruction and inheritance learning, which comprises the following steps:
the data acquisition and filtering module is used for acquiring original network flow, primarily screening the flow and eliminating non-Tor flow;
the flow reconstruction module is used for reconstructing the primarily screened flow and converting the flow into a gray characteristic diagram; the method comprises the following steps: original byte characteristic reconstruction unit: taking the standard byte asLTo be less thanLThe data packet of one byte is subjected to zero padding operation, and exceedsLThe data packet of each byte is cut off and normalized to generatei*iThereby converting the packed byte matrix into a gray image; and an uplink and downlink interactive behavior characteristic reconstruction unit: constructing horizontal and vertical coordinates according to the size, direction and time interval of the data packets, and taking the number of the data packets in each time interval as the gray value of the pixel pointForming a characteristic diagram for simulating uplink and downlink interactive behaviors;
a feature extraction and fusion module, which takes the data packet as a unit and inputs the corresponding uplink and downlink interactive behavior feature map into a convolutional neural network for extraction to obtain an interactive information feature vectorV s Inputting the original byte characteristic diagram into the convolutional neural network to extract and obtain a packet space characteristic vectorV n Inputting a group of packet space feature vectors into a recurrent neural network to extract and obtain a stream time sequence feature vectorV m And fusing the three feature vectors;
the application classification module is used for inputting the fusion characteristics into a multi-classifier for application classification, and the multi-classifier updates classifier parameters through an inheritance learning mechanism when detecting a new flow category;
and the class judgment module is used for determining the attribution application of the flow based on a majority principle.
It should be understood that the anonymous network traffic identification apparatus provided in this embodiment may implement all technical solutions of the anonymous network traffic identification method, functions of each functional module of the anonymous network traffic identification apparatus may be implemented according to the method in the foregoing method embodiment, and a specific implementation process may refer to relevant descriptions in the foregoing embodiment, which is not described herein again.
As will be appreciated by one skilled in the art, embodiments of the present invention may be provided as a method, system, or computer program product. Accordingly, the present invention may take the form of an entirely hardware embodiment, an entirely software embodiment or an embodiment combining software and hardware aspects. Furthermore, the present invention may take the form of a computer program product embodied on one or more computer-usable storage media (including, but not limited to, disk storage, CD-ROM, optical storage, and the like) having computer-usable program code embodied therein.
The present invention is described with reference to flowchart illustrations and/or block diagrams of methods, apparatus (systems), and computer program products according to embodiments of the invention. It will be understood that each flow and/or block of the flow diagrams and/or block diagrams, and combinations of flows and/or blocks in the flow diagrams and/or block diagrams, can be implemented by computer program instructions. These computer program instructions may be provided to a processor of a general purpose computer, special purpose computer, embedded processor, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, create means for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.
These computer program instructions may also be stored in a computer-readable memory that can direct a computer or other programmable data processing apparatus to function in a particular manner, such that the instructions stored in the computer-readable memory produce an article of manufacture including instruction means which implement the function specified in the flowchart flow or flows and/or block diagram block or blocks.
These computer program instructions may also be loaded onto a computer or other programmable data processing apparatus to cause a series of operational steps to be performed on the computer or other programmable apparatus to produce a computer implemented process such that the instructions which execute on the computer or other programmable apparatus provide steps for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.

Claims (10)

1. An anonymous network traffic identification method based on traffic reconstruction and inheritance learning is characterized by comprising the following steps:
collecting original network flow, carrying out flow primary screening, and removing non-Tor flow;
reconstructing the flow after primary screening, converting the flow into a gray characteristic diagram, comprising: original byte feature reconstruction: taking the standard byte asLFor less thanLThe data packet of one byte is subjected to zero padding operation, and exceedsLThe data packet of each byte is cut off and normalized to generatei*iThereby converting the packed byte matrix into a gray image; and reconstructing the characteristics of the uplink and downlink interactive behaviors: constructing horizontal and vertical coordinates according to the size, direction and time interval of the data packet, each timeThe number of the data packets in the interval is used as the gray value of the pixel point to form a characteristic diagram for simulating the uplink and downlink interaction behavior;
inputting corresponding uplink and downlink interactive behavior characteristic diagrams into a convolutional neural network to extract to obtain interactive information characteristic vectors, inputting an original byte characteristic diagram into the convolutional neural network to extract to obtain packet space characteristic vectors, inputting a group of packet space characteristic vectors into the convolutional neural network to extract to obtain stream time sequence characteristic vectors, and fusing the three characteristic vectors;
inputting the fusion characteristics into a multi-classifier for application classification, wherein the multi-classifier updates classifier parameters through an inheritance learning mechanism when detecting a new flow class;
the home application of the traffic is determined based on majority rules.
2. The anonymous network traffic identification method based on traffic reconstruction and inheritance learning according to claim 1, wherein the collecting of the original network traffic and the preliminary screening of the traffic comprise:
capturing original flow by using a network flow acquisition tool, and dividing the original flow according to a quintuple form;
and (3) performing feature extraction on the divided network flow by using a feature extraction tool, performing histogram equal-depth discretization on the features, inputting the features into a limit gradient lifting decision tree, and sequentially traversing and calculating the value of each feature by using a target function consisting of a loss function and a regularization penalty term to find out the feature points of the minimized target function, thereby filtering out non-Tor flow.
3. The anonymous network traffic identification method based on traffic reconstruction and inheritance learning according to claim 1, wherein the step of constructing horizontal and vertical coordinates according to the size, direction and time interval of the data packet comprises the steps of: the size of the data packet is used as an abscissa, the maximum value and the minimum value of the data packet in the flow sample are found out and used as the starting position and the ending position of the abscissa, the size of all the data packets is normalized to the whole abscissa, the ordinate is equally divided into two parts which are respectively the arrival time of an uplink packet and the arrival time of a downlink packet, and the depth of the crossed pixels of the abscissa and the ordinate represents the number of the data packets.
4. The anonymous network traffic recognition method based on traffic reconstruction and inheritance learning as claimed in claim 1, wherein the convolutional neural network structure is input layer-convolutional layer CONV 1-pooling layer POOL 1-convolutional layer CONV 2-pooling layer POOL 2-convolutional layer CONV 3-full connection layer FC 1-full connection layer FC 2;
the interactive information feature vector is obtained according to the following method: inputting the up-down interactive behavior characteristic diagram into a convolutional neural network, extracting a spatial characteristic diagram by the operation of the first two convolutional layers and a pooling layer, converting the characteristic diagram into a one-dimensional vector by a Flatten function of the convolutional layer CONV3 so as to input the one-dimensional vector into a full-connection layer, and extracting 1 x from the full-connection layer FC2sOne-dimensional feature vector ofV s sNumber of neurons of full junction FC 2;
the packet space feature vector is obtained according to the following method: inputting the gray image converted after processing the original bytes of the packets into a convolutional neural network model for training, extracting a spatial feature map by the operation of the first two convolutional layers and the pooling layer, converting the feature map into a one-dimensional vector by a Flatten function of the convolutional layers CONV3 so as to input the one-dimensional vector into a full connection layer, and extracting 1 from the full connection layer FC2nOne-dimensional feature vector ofV n ,n=s。
5. The anonymous network traffic recognition method based on traffic reconstruction and inheritance learning as claimed in claim 1, wherein the structure of the recurrent neural network model is BiGRU layer BiGRU1-BiGRU layer BiGRU 2-full connectivity layer FC 4;
the stream timing feature vector is obtained according to the following method: inputting the gray level images of the grouped data packets into a recurrent neural network model in batches for training, calculating by a BiGRU layer to obtain a time sequence characteristic diagram, converting into a one-dimensional vector by a Flatten function, inputting into a full connection layer, and extracting 1 x from the full connection layer FC4mThe one-dimensional feature vector of (a),mthe number of neurons of the full junction FC 4.
6. The anonymous network traffic identification method based on traffic reconstruction and inheritance learning according to claim 1, wherein fusing three feature vectors comprises: performing feature fusion by using data packet as unit to obtain dimension 1sThe uplink and downlink interactive behavior feature vector 1nAnd 1 xmThe time sequence feature vectors are converted into one-dimensional feature vectors with lower dimensionality by using a full connection layer, and then the three one-dimensional feature vectors with lower dimensionality are fused to obtain fusion features.
7. The anonymous network traffic identification method based on traffic reconstruction and inheritance learning as claimed in claim 1, wherein the multi-classifier employs a one-dimensional convolutional neural network, comprising a convolutional layer-pooling layer-flight layer-full connection layer-Softmax layer, and performs application classification on the fusion characteristics of each of the first N traffic data packets;
the updating of the classifier parameters by the inheritance learning mechanism when the multi-classifier detects a new traffic class comprises the following steps: the method comprises the steps of reserving part of characteristic parameters learned during pre-training of a classifier, learning new flow class samples, calculating parameter differences before and after learning of the classifier by using an inheritance loss function, updating parameters of the classifier jointly by combining a loss function of the new flow class samples, determining parameter learning degree by using a reservation coefficient, and balancing classification preferences of different classes by using linear mapping at a last full-connection layer.
8. An anonymous network traffic identification device based on traffic reconstruction and inheritance learning, comprising:
the data acquisition and filtering module is used for acquiring original network flow, primarily screening the flow and eliminating non-Tor flow;
the flow reconstruction module reconstructs the flow after primary screening, converts the flow into a gray characteristic diagram, and comprises: original byte characteristic reconstruction unit: taking the standard byte asLFor less thanLThe data packet of one byte is subjected to zero padding operation, and exceedsLThe data packet of each byte is processed by truncation and normalizationPost-generationi*iThereby converting the packed byte matrix into a gray image; and an uplink and downlink interactive behavior characteristic reconstruction unit: constructing horizontal and vertical coordinates according to the size, the direction and the time intervals of the data packets, and forming a characteristic diagram simulating uplink and downlink interaction behaviors by taking the number of the data packets in each time interval as a gray value of a pixel point;
the characteristic extraction and fusion module is used for inputting the corresponding uplink and downlink interactive behavior characteristic diagrams into the convolutional neural network to extract and obtain an interactive information characteristic vector by taking a data packet as a unit, inputting the original byte characteristic diagram into the convolutional neural network to extract and obtain a packet space characteristic vector, inputting a group of packet space characteristic vectors into the cyclic neural network to extract and obtain a flow time sequence characteristic vector, and fusing the three characteristic vectors;
the application classification module is used for inputting the fusion characteristics into a multi-classifier to perform application classification, and the multi-classifier updates classifier parameters through an inheritance learning mechanism when detecting a new flow class;
and the class judgment module is used for determining the attribution application of the flow based on a majority principle.
9. A computer device, comprising:
one or more processors;
a memory; and
one or more programs, wherein the one or more programs are stored in the memory and configured for execution by the one or more processors to perform the steps of the anonymous network traffic identification method based on traffic reconstruction and inheritance learning of any one of claims 1-7 when executed by the processors.
10. A computer-readable storage medium, on which a computer program is stored which, when being executed by a processor, carries out the steps of the method for anonymous network traffic identification based on traffic reconstruction and inheritance learning according to any one of claims 1 to 7.
CN202210506848.6A 2022-05-11 2022-05-11 Anonymous network traffic identification method and device based on traffic reconstruction and inheritance learning Active CN114615093B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202210506848.6A CN114615093B (en) 2022-05-11 2022-05-11 Anonymous network traffic identification method and device based on traffic reconstruction and inheritance learning

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202210506848.6A CN114615093B (en) 2022-05-11 2022-05-11 Anonymous network traffic identification method and device based on traffic reconstruction and inheritance learning

Publications (2)

Publication Number Publication Date
CN114615093A CN114615093A (en) 2022-06-10
CN114615093B true CN114615093B (en) 2022-07-26

Family

ID=81870459

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202210506848.6A Active CN114615093B (en) 2022-05-11 2022-05-11 Anonymous network traffic identification method and device based on traffic reconstruction and inheritance learning

Country Status (1)

Country Link
CN (1) CN114615093B (en)

Families Citing this family (10)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115086006B (en) * 2022-06-13 2024-02-02 安徽工业大学 Distributed application program encryption traffic classification method based on bidirectional gating logic unit
CN114785623A (en) * 2022-06-21 2022-07-22 南京信息工程大学 Network intrusion detection method and device based on discretization characteristic energy system
CN115277585B (en) * 2022-07-08 2023-07-28 南京邮电大学 Multi-granularity business flow identification method based on machine learning
CN115442309B (en) * 2022-09-01 2023-06-09 深圳信息职业技术学院 Packet granularity network traffic classification method based on graph neural network
CN116743506B (en) * 2023-08-14 2023-11-21 南京信息工程大学 Encrypted flow identification method and device based on quaternion convolutional neural network
CN117176664A (en) * 2023-08-28 2023-12-05 枣庄福缘网络科技有限公司 Abnormal flow monitoring system for Internet of things
CN116886637B (en) * 2023-09-05 2023-12-19 北京邮电大学 Single-feature encryption stream detection method and system based on graph integration
CN117113262B (en) * 2023-10-23 2024-02-02 北京中科网芯科技有限公司 Network traffic identification method and system
CN118509372B (en) * 2024-07-18 2024-09-20 广东联想懂的通信有限公司 Flow distribution method and system
CN118659928A (en) * 2024-08-16 2024-09-17 厘壮信息科技(苏州)有限公司 VMess-based intelligent control method and VMess-based intelligent control system

Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108200006A (en) * 2017-11-21 2018-06-22 中国科学院声学研究所 A kind of net flow assorted method and device based on the study of stratification space-time characteristic
CN112367334A (en) * 2020-11-23 2021-02-12 中国科学院信息工程研究所 Network traffic identification method and device, electronic equipment and storage medium
CN112910853A (en) * 2021-01-18 2021-06-04 南京信息工程大学 Encryption flow classification method based on mixed characteristics
CN113037730A (en) * 2021-02-27 2021-06-25 中国人民解放军战略支援部队信息工程大学 Network encryption traffic classification method and system based on multi-feature learning
CN113162908A (en) * 2021-03-04 2021-07-23 中国科学院信息工程研究所 Encrypted flow detection method and system based on deep learning
CN114301636A (en) * 2021-12-10 2022-04-08 南京理工大学 VPN communication behavior analysis method based on flow multi-scale space-time feature fusion

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108200006A (en) * 2017-11-21 2018-06-22 中国科学院声学研究所 A kind of net flow assorted method and device based on the study of stratification space-time characteristic
CN112367334A (en) * 2020-11-23 2021-02-12 中国科学院信息工程研究所 Network traffic identification method and device, electronic equipment and storage medium
CN112910853A (en) * 2021-01-18 2021-06-04 南京信息工程大学 Encryption flow classification method based on mixed characteristics
CN113037730A (en) * 2021-02-27 2021-06-25 中国人民解放军战略支援部队信息工程大学 Network encryption traffic classification method and system based on multi-feature learning
CN113162908A (en) * 2021-03-04 2021-07-23 中国科学院信息工程研究所 Encrypted flow detection method and system based on deep learning
CN114301636A (en) * 2021-12-10 2022-04-08 南京理工大学 VPN communication behavior analysis method based on flow multi-scale space-time feature fusion

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
VoIP Traffic Detection in Tunneled and Anonymous Networks Using Deep Learning;FAIZ UL ISLAM et al.;《IEEE Access》;20210419;全文 *

Also Published As

Publication number Publication date
CN114615093A (en) 2022-06-10

Similar Documents

Publication Publication Date Title
CN114615093B (en) Anonymous network traffic identification method and device based on traffic reconstruction and inheritance learning
CN109698836B (en) Wireless local area network intrusion detection method and system based on deep learning
CN109768985B (en) Intrusion detection method based on flow visualization and machine learning algorithm
CN113162908B (en) Encrypted flow detection method and system based on deep learning
CN111783442A (en) Intrusion detection method, device, server and storage medium
CN109361617A (en) A kind of convolutional neural networks traffic classification method and system based on network payload package
CN113037730A (en) Network encryption traffic classification method and system based on multi-feature learning
CN112906019A (en) Flow data generation method, device and system based on improved DCGAN model
CN115037805B (en) Unknown network protocol identification method, system and device based on deep clustering and storage medium
Soleymanpour et al. An efficient deep learning method for encrypted traffic classification on the web
CN112116078A (en) Information security baseline learning method based on artificial intelligence
CN113705604A (en) Botnet flow classification detection method and device, electronic equipment and storage medium
CN115659966A (en) Rumor detection method and system based on dynamic heteromorphic graph and multi-level attention
CN113239949A (en) Data reconstruction method based on 1D packet convolutional neural network
CN113935398B (en) Network traffic classification method and system based on small sample learning in Internet of things environment
CN117056797A (en) Encryption traffic classification method, device and medium based on unbalanced data
CN114726802A (en) Network traffic identification method and device based on different data dimensions
CN118041689A (en) Network malicious traffic detection method
CN114785548A (en) Virtual flow anomaly detection method and system based on weighted adaptive ensemble learning and intelligent flow monitoring platform
CN112633475A (en) Large-scale network burst flow identification model and method and model training method
Cui et al. Semi-2DCAE: a semi-supervision 2D-CNN AutoEncoder model for feature representation and classification of encrypted traffic
CN112929380B (en) Trojan horse communication detection method and system combining meta-learning and spatiotemporal feature fusion
CN114358177B (en) Unknown network traffic classification method and system based on multidimensional feature compact decision boundary
CN115334005B (en) Encryption flow identification method based on pruning convolutional neural network and machine learning
CN116310728A (en) Browser identification method based on CNN-Linformer model

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant