CN114615093B - Anonymous network traffic identification method and device based on traffic reconstruction and inheritance learning - Google Patents
Anonymous network traffic identification method and device based on traffic reconstruction and inheritance learning Download PDFInfo
- Publication number
- CN114615093B CN114615093B CN202210506848.6A CN202210506848A CN114615093B CN 114615093 B CN114615093 B CN 114615093B CN 202210506848 A CN202210506848 A CN 202210506848A CN 114615093 B CN114615093 B CN 114615093B
- Authority
- CN
- China
- Prior art keywords
- traffic
- feature
- layer
- reconstruction
- inheritance
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Active
Links
- 238000000034 method Methods 0.000 title claims abstract description 62
- 239000013598 vector Substances 0.000 claims abstract description 81
- 230000002452 interceptive effect Effects 0.000 claims abstract description 29
- 238000013527 convolutional neural network Methods 0.000 claims abstract description 28
- 238000012549 training Methods 0.000 claims abstract description 19
- 230000004927 fusion Effects 0.000 claims abstract description 15
- 238000012216 screening Methods 0.000 claims abstract description 12
- 230000007246 mechanism Effects 0.000 claims abstract description 10
- 238000012545 processing Methods 0.000 claims abstract description 8
- 125000004122 cyclic group Chemical group 0.000 claims abstract description 5
- 238000003062 neural network model Methods 0.000 claims abstract description 5
- 230000006870 function Effects 0.000 claims description 40
- 230000006399 behavior Effects 0.000 claims description 28
- 238000011176 pooling Methods 0.000 claims description 17
- 238000013528 artificial neural network Methods 0.000 claims description 13
- 238000004590 computer program Methods 0.000 claims description 10
- 238000000605 extraction Methods 0.000 claims description 9
- 210000002569 neuron Anatomy 0.000 claims description 9
- 230000000306 recurrent effect Effects 0.000 claims description 9
- 238000003066 decision tree Methods 0.000 claims description 8
- 230000003993 interaction Effects 0.000 claims description 8
- 230000014759 maintenance of location Effects 0.000 claims description 6
- 239000011159 matrix material Substances 0.000 claims description 6
- 239000000284 extract Substances 0.000 claims description 5
- 238000001914 filtration Methods 0.000 claims description 5
- 238000003860 storage Methods 0.000 claims description 5
- 238000013507 mapping Methods 0.000 claims description 4
- 238000010606 normalization Methods 0.000 claims description 4
- 238000011144 upstream manufacturing Methods 0.000 claims 2
- 238000013480 data collection Methods 0.000 claims 1
- 238000010586 diagram Methods 0.000 abstract description 28
- 230000008569 process Effects 0.000 abstract description 9
- 238000013461 design Methods 0.000 abstract 1
- 238000010801 machine learning Methods 0.000 description 5
- 238000005070 sampling Methods 0.000 description 5
- 230000002457 bidirectional effect Effects 0.000 description 4
- 238000004422 calculation algorithm Methods 0.000 description 4
- 230000005540 biological transmission Effects 0.000 description 3
- 238000009826 distribution Methods 0.000 description 3
- 238000005516 engineering process Methods 0.000 description 3
- 238000012360 testing method Methods 0.000 description 3
- 238000012795 verification Methods 0.000 description 3
- 102100026278 Cysteine sulfinic acid decarboxylase Human genes 0.000 description 2
- 238000004364 calculation method Methods 0.000 description 2
- 238000011161 development Methods 0.000 description 2
- 238000005457 optimization Methods 0.000 description 2
- 108010064775 protein C activator peptide Proteins 0.000 description 2
- 241000234282 Allium Species 0.000 description 1
- 235000002732 Allium cepa var. cepa Nutrition 0.000 description 1
- 230000002159 abnormal effect Effects 0.000 description 1
- 230000004913 activation Effects 0.000 description 1
- 230000009286 beneficial effect Effects 0.000 description 1
- 238000007635 classification algorithm Methods 0.000 description 1
- 150000001875 compounds Chemical class 0.000 description 1
- 238000013135 deep learning Methods 0.000 description 1
- 238000012938 design process Methods 0.000 description 1
- 230000014509 gene expression Effects 0.000 description 1
- 238000007689 inspection Methods 0.000 description 1
- 238000007477 logistic regression Methods 0.000 description 1
- 238000004519 manufacturing process Methods 0.000 description 1
- 230000003287 optical effect Effects 0.000 description 1
- 238000007781 pre-processing Methods 0.000 description 1
- 238000007637 random forest analysis Methods 0.000 description 1
- 238000011160 research Methods 0.000 description 1
- 239000013589 supplement Substances 0.000 description 1
- 238000012706 support-vector machine Methods 0.000 description 1
Images
Classifications
-
- H—ELECTRICITY
- H04—ELECTRIC COMMUNICATION TECHNIQUE
- H04L—TRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
- H04L63/00—Network architectures or network communication protocols for network security
- H04L63/14—Network architectures or network communication protocols for network security for detecting or protecting against malicious traffic
- H04L63/1408—Network architectures or network communication protocols for network security for detecting or protecting against malicious traffic by monitoring network traffic
- H04L63/1416—Event detection, e.g. attack signature detection
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F18/00—Pattern recognition
- G06F18/20—Analysing
- G06F18/24—Classification techniques
- G06F18/243—Classification techniques relating to the number of classes
- G06F18/2431—Multiple classes
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F18/00—Pattern recognition
- G06F18/20—Analysing
- G06F18/25—Fusion techniques
- G06F18/253—Fusion techniques of extracted features
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/04—Architecture, e.g. interconnection topology
- G06N3/044—Recurrent networks, e.g. Hopfield networks
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/04—Architecture, e.g. interconnection topology
- G06N3/045—Combinations of networks
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/08—Learning methods
-
- H—ELECTRICITY
- H04—ELECTRIC COMMUNICATION TECHNIQUE
- H04L—TRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
- H04L41/00—Arrangements for maintenance, administration or management of data switching networks, e.g. of packet switching networks
- H04L41/16—Arrangements for maintenance, administration or management of data switching networks, e.g. of packet switching networks using machine learning or artificial intelligence
-
- H—ELECTRICITY
- H04—ELECTRIC COMMUNICATION TECHNIQUE
- H04L—TRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
- H04L63/00—Network architectures or network communication protocols for network security
- H04L63/14—Network architectures or network communication protocols for network security for detecting or protecting against malicious traffic
- H04L63/1408—Network architectures or network communication protocols for network security for detecting or protecting against malicious traffic by monitoring network traffic
- H04L63/1425—Traffic logging, e.g. anomaly detection
-
- H—ELECTRICITY
- H04—ELECTRIC COMMUNICATION TECHNIQUE
- H04L—TRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
- H04L63/00—Network architectures or network communication protocols for network security
- H04L63/20—Network architectures or network communication protocols for network security for managing network security; network security policies in general
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- Data Mining & Analysis (AREA)
- Computer Security & Cryptography (AREA)
- Artificial Intelligence (AREA)
- Evolutionary Computation (AREA)
- Computing Systems (AREA)
- General Physics & Mathematics (AREA)
- Life Sciences & Earth Sciences (AREA)
- Computer Networks & Wireless Communication (AREA)
- Software Systems (AREA)
- Signal Processing (AREA)
- Computational Linguistics (AREA)
- Computer Hardware Design (AREA)
- Molecular Biology (AREA)
- General Health & Medical Sciences (AREA)
- Mathematical Physics (AREA)
- Health & Medical Sciences (AREA)
- Computer Vision & Pattern Recognition (AREA)
- Biomedical Technology (AREA)
- Biophysics (AREA)
- Bioinformatics & Cheminformatics (AREA)
- Evolutionary Biology (AREA)
- Bioinformatics & Computational Biology (AREA)
- Databases & Information Systems (AREA)
- Medical Informatics (AREA)
- Data Exchanges In Wide-Area Networks (AREA)
Abstract
The invention discloses an anonymous network traffic identification method and device based on traffic reconstruction and inheritance learning, wherein the method comprises the following steps: collecting original network flow, primarily screening the flow, and removing non-Tor flow; reconstructing the primarily screened flow, and converting the flow into a gray characteristic diagram; processing the feature map after flow reconstruction by using a convolutional neural network model and a cyclic neural network model, extracting an interactive information feature vector, a packet space feature vector and a flow time sequence feature vector, and fusing the three feature vectors; inputting the fusion characteristics into a multi-classifier for application classification, wherein the multi-classifier updates classifier parameters through an inheritance learning mechanism when detecting a new flow class; the home application of the traffic is determined based on majority rules. The invention simplifies the process of feature design, enriches the comprehensiveness of features, meets the requirement of online updating of model parameters, keeps the model remembering the past training, and only needs small-scale training each time a new category is added.
Description
Technical Field
The invention relates to network traffic identification and network application classification, in particular to an anonymous network traffic identification method and device based on traffic reconstruction and inheritance learning.
Background
With the continuous development of the internet, the types of network traffic are gradually complex, and different types of application programs are continuously emerged. Applications can generate a large amount of network traffic, and different types of traffic can exhibit different characteristics. The goal of traffic classification is to identify the class of traffic based on its distinguishing characteristics, which is essential to network operators. The traffic classification is the first step of guaranteeing the service quality from the perspective of the user service quality, and is a premise of providing differentiated services for services according to requirements of different service types, and on the other hand, the traffic classification is the first step of detecting abnormal network traffic from the perspective of security, so that the network security can be better protected. In recent years, with the increasing demand of users for privacy protection and the continuous development of anonymized encryption technology, more and more traffic is specially processed, which presents new challenges to network traffic classification.
Classification methods in the field of traffic identification have undergone several changes, and conventional traffic classification methods are mainly classified into two categories: one is a port number-based method, which identifies according to a protocol number corresponding to a port number, but with the advent of anonymous network port obfuscation techniques, this method is becoming ineffective. The other type is an identification method based on Deep Packet Inspection (DPI), and data packet loads are matched to determine the category based on different categories of regular expressions. But this method is not feasible as the traffic anonymization encryption technology is mature. With the loss of function of the traditional methods, researchers began to look for new methods of traffic classification. Machine learning methods that have progressed rapidly in recent years have received considerable attention from researchers. Compared with the traditional classification method, the machine learning technology is more intelligent and convenient, and can effectively avoid the influence of flow encryption by classifying according to the statistical characteristics of the flow. Therefore, researchers have proposed a traffic classification algorithm based on machine learning, and the machine learning algorithms widely used at present include support vector machines, decision trees, random forests, XGBoost methods, and the like. The classification methods have good classification accuracy and are widely accepted by all social circles. However, the traffic classification method based on machine learning requires expert experience to extract and screen traffic characteristics, and the characteristics are not comprehensive enough while consuming time and energy, and have high representativeness requirements on the characteristics and low classification accuracy. The model based on deep learning becomes a research hotspot at present, an end-to-end model is favored by researchers, but in actual deployment, when a novel traffic identification scene is encountered, the model needs to be retrained, a large amount of time is consumed, and the difficulty is encountered in anonymous network traffic application classification at present.
Disclosure of Invention
The invention aims to: the invention aims to provide an anonymous network traffic identification method and device based on traffic reconstruction and inheritance learning, which at least partially solve the problems in the background art.
The technical scheme is as follows: an anonymous network traffic identification method based on traffic reconstruction and inheritance learning comprises the following steps:
collecting original network flow, primarily screening the flow, and removing non-Tor flow;
reconstructing the flow after primary screening, converting the flow into a gray characteristic diagram, comprising: original byte characteristic reconstruction: taking the standard byte asLTo be less thanLThe data packet of each byte is subjected to zero padding operation, exceedingLThe data packet of each byte is cut off and normalized to generatei*iThereby converting the packed byte matrix into a gray image; and reconstructing the characteristics of the uplink and downlink interactive behaviors: constructing horizontal and vertical coordinates according to the size and direction of the data packets and time intervals, and forming a characteristic diagram simulating uplink and downlink interaction behaviors by taking the number of the data packets in each time interval as a gray value of a pixel point;
inputting corresponding uplink and downlink interactive behavior characteristic graphs into a convolutional neural network to extract and obtain interactive information characteristic vectors by taking a data packet as a unitV s Inputting the original byte characteristic diagram into the convolutional neural network to extract and obtain a packet space characteristic vectorV n Grouping the packet space feature vectors and inputting the grouped packet space feature vectors into a recurrent neural network to extract to obtain the stream time sequence feature vectorsV m And fusing the three feature vectors;
inputting the fusion characteristics into a multi-classifier for application classification, wherein the multi-classifier updates classifier parameters through an inheritance learning mechanism when detecting a new flow category;
the home application of the traffic is determined based on majority rules.
The invention also provides an anonymous network flow identification device based on flow reconstruction and inheritance learning, which comprises the following components:
the data acquisition and filtering module is used for acquiring original network flow, primarily screening the flow and eliminating non-Tor flow;
the flow reconstruction module reconstructs the flow after primary screening, converts the flow into a gray characteristic diagram, and comprises: original byte characteristic reconstruction unit: taking the standard byte asLFor less thanLThe data packet of one byte is subjected to zero padding operation, and exceedsLThe data packet of each byte is cut off and normalized to generatei*iThereby converting the packed byte matrix into a gray image; and an uplink and downlink interactive behavior characteristic reconstruction unit: constructing horizontal and vertical coordinates according to the size, the direction and the time intervals of the data packets, and forming a characteristic diagram simulating uplink and downlink interaction behaviors by taking the number of the data packets in each time interval as a gray value of a pixel point;
the feature extraction and fusion module takes a data packet as a unit and inputs the corresponding uplink and downlink interactive behavior feature map into the convolutional neural network to extract and obtain an interactive information feature vectorV s Inputting the original byte characteristic diagram into the convolutional neural network to extract and obtain a packet space characteristic vectorV n Inputting a group of packet space feature vectors into a recurrent neural network to extract and obtain a stream time sequence feature vectorV m And fusing the three feature vectors;
the application classification module is used for inputting the fusion characteristics into a multi-classifier for application classification, and the multi-classifier updates classifier parameters through an inheritance learning mechanism when detecting a new flow category;
and the class judgment module is used for determining the attribution application of the flow based on a majority principle.
The present invention also provides a computer apparatus comprising:
one or more processors;
a memory; and
one or more programs, wherein the one or more programs are stored in the memory and configured for execution by the one or more processors, which when executed by the processors, implement the steps of the anonymous network traffic identification method based on traffic reconstruction and inheritance learning as described above.
The present invention also provides a computer readable storage medium having stored thereon a computer program which, when being executed by a processor, implements the steps of the anonymous network traffic identification method based on traffic reconstruction and inheritance learning as described above.
Has the beneficial effects that: according to the method, the characteristic vectors containing the interactive information, the packet-level spatial information and the flow-level time sequence information with different dimensions are extracted through the reconstruction of the flow characteristic diagram, and application classification is carried out, so that the problem that the classification accuracy is low when the characteristic representativeness is insufficient is solved, the characteristic design process is simplified, the comprehensiveness of the characteristics is enriched, and the requirement of online updating of model parameters is met. Meanwhile, the invention utilizes the inheritance learning mechanism to ensure that the classifier model keeps the memory of the past training, and only needs small-scale training when a new category is added each time. The method of the invention can realize the application classification of the anonymous network flow with high efficiency, accuracy and low cost.
Drawings
FIG. 1 is a general flow diagram of a Tor traffic identification method of the present invention;
FIG. 2 is a flowchart of an embodiment of a Tor traffic application identification method of the present invention;
FIG. 3 is a schematic diagram of interactive behavior traffic reconstruction in accordance with the present invention;
FIG. 4 is a schematic diagram of a convolutional neural network structure employed in the present invention;
FIG. 5 is a schematic diagram of a recurrent neural network architecture employed in the present invention;
FIG. 6 is a schematic diagram of an online updating method for inherited learning mechanism parameters in the present invention;
fig. 7 is a diagram illustrating most principles of determining flow attribution categories according to the present invention.
Detailed Description
The technical solution in the embodiments of the present invention will be clearly and completely described below with reference to the accompanying drawings.
Referring to fig. 1 and fig. 2, the anonymous network traffic identification method based on traffic reconstruction and inheritance learning provided by the present invention includes the following steps:
According to The embodiment of The invention, a flow detector is deployed in a network, accounts of various application programs are established, Tor (The on Router, Onion Router) network is used for simulating The behavior of users using various applications, and Tor flow, namely anonymous network flow, is generated. The method comprises the steps of capturing flow by Wireshark, storing the flow in a PCAP mode, and dividing original flow into two-way flows according to a { SrcIP, SrcPort, DstIP, DstPort and Protocol } quintuple mode and then storing the two-way flows. In the quintuple, SrcIP is a source IP address, SrcPort is a source port, DstIP is a destination IP address, DstPort is a destination port, and Protocol represents a Protocol type. One network flow with the same quintuple data is considered to be a unidirectional flow, while the source and destination IPs and source and destination ports of a bidirectional flow may be interchanged simultaneously. For example, a packet containing only a to B is a unidirectional flow, and a packet containing a to B and B to a is a bidirectional flow. The network mainly comprises two types of protocol flows, namely a TCP flow and a UDP flow, wherein the TCP flow uses a SYN zone bit to represent the beginning of transmission, the FIN zone bit is used for finishing transmission, and the UDP flow uses a data packet time interval as a judgment basis. The method and the device use the DPKT library to analyze and divide the PCAP file and reserve the information of all layers of the session stream.
And (3) performing feature extraction on the divided network flow by using a feature extraction tool CICFlowMeter, performing histogram equal-depth discretization on the features, inputting the features into a limit gradient lifting decision tree, and sequentially performing traversal calculation on the value of each feature through a target function consisting of a loss function and a regularization penalty term to find out feature points of the minimized target function, thereby filtering out non-Tor flow and reducing the working complexity. Objective functionAs shown in formula (1), whereinIn order to be a function of the loss,for the penalty function:
in the formula (I), the compound is shown in the specification,inThe difference between the true value and the predicted value is described,is a sampleiFirst, thetThe decision tree model generated by the round of fitting,g i is composed ofThe first derivative of (a) is,h i is composed ofThe second derivative of (c).TIs the number of leaves of the decision tree model,in order to obtain a learning rate,for the prediction of the input samples by the decision tree,to control the constant parameters of the size of the penalty term,is a decision tree ofjA predicted value for each leaf node. The target function represents the error between the predicted value and the true value, after a training sample is input, the decision tree evaluates different values of each characteristic of Tor and non-Tor flow, the influence of the value of each characteristic on the judgment of a certain sample as Tor or non-Tor flow is detected, namely a loss function is calculated, when the value of the characteristic is a certain value, the sample is always judged as non-Tor, and the value of the characteristic is taken as a splitting characteristic point. So that non-Tor traffic can be filtered out.
And 2, step: and reconstructing the primarily screened flow, and converting the flow into a gray characteristic diagram.
The flow reconstruction of the invention comprises two parts of a data packet original byte characteristic diagram and an uplink and downlink interactive behavior characteristic diagram.
For original byte characteristic reconstruction, taking standard byte as L, carrying out zero filling operation on data packets less than L bytes, carrying out truncation processing on data packets more than L bytes, and generating after standardization/normalizationi*iAnd (4) into a gray scale image. WhereiniDetermined by the packet size distribution. For example, according to the invention, the data packet size is distributed within 1400 bytes, the standard size of the data packet is 1444 bytes, a 38 × 38 packet byte matrix is generated, so that a gray image is obtained, a single input is input into a convolutional neural network to obtain a spatial feature vector, and the obtained packet spatial vector is grouped according to flow and then a cyclic neural network is used to obtain a time sequence feature vector.
And for the reconstruction of the characteristics of the uplink and downlink interactive behaviors, the size and the direction of the data packets of the network flow and the arrival time interval form a three-dimensional characteristic graph, and the number of the data packets in each time interval is used as the gray value of a pixel point to form a gray graph simulating the vertical interactive information. As shown in fig. 3, the abscissa of the grayscale map is the size of the data packet, the maximum value and the minimum value of the data packet in the stream sample are found out and used as the starting position and the ending position of the abscissa, the sizes of all the data packets are normalized to the whole abscissa, the ordinate is equally divided into two parts, which are the arrival time of the uplink packet and the downlink packet respectively, and the depth of the cross pixel point of the abscissa represents the number of the data packets. In the description of the present invention, uplink and downlink refer to bidirectional transmission between two network nodes, and uplink and downlink interactive behavior information refers to paired data packets with opposite destination IP and source IP, for example, a and B transmit with each other, the direction of the generated first data packet represents uplink, and the direction opposite to the first data packet represents downlink.
And 3, step 3: and constructing a neural network model and extracting features.
Convolutional Neural Networks (CNN) are a kind of multi-layered supervised learning neural network, and convolutional layers and pooling layers are core parts of feature extraction. The weight parameters in the network are reversely adjusted layer by adopting a gradient descent algorithm to minimize a loss function, and the accuracy of the network is improved by frequent iterative training. Convolutional neural networks consist of alternating convolutional and pooling layers, followed by a fully-connected layer and a logistic regression classifier such as a Softmax layer. The input of the first fully connected layer is a feature map obtained by feature extraction from the convolutional layer and the pooling layer. A Recurrent Neural Network (RNN) is a special neural network structure in which a sequence of current outputs is also related to previous outputs, and the network memorizes the previous information and applies it to the calculation of the current output.
As shown in fig. 4, the convolutional neural network model structure constructed by the present invention is input layer-convolutional layer (CONV 1) -pooling layer (POOL 1) -convolutional layer (CONV 2) -pooling layer (POOL 2) -convolutional layer (CONV 3) -full-connectivity layer (FC 1) -full-connectivity layer (FC 2) (FC 3 in the figure is a feature simplification step, which is described below). Inputting a gray image with 38 × 1 layers, after convoluting by CONV1, the number of channels is 32, the dimension is 38 × 32, after 2 × 2 convolution kernel sampling by POOL1, the dimension is 19 × 32, after convoluting by CONV2, the number of channels is 64, after 2 × 2 convolution kernel sampling by POOL2, the output dimension is 10 × 64, after 2 × 2 convolution kernel sampling, the dimension is pulled to one dimension by a Flatten function through CONV3 convolution, the dimension can be inputted into the full-connection layers, and finally, the neuron of the full-connection layer FC2 is nnSet to 64, i.e., FC2 outputs a feature vector of 1 × 64.
As shown in fig. 5, the recurrent neural network model constructed by the present invention has a structure of BiGRU layer (BiGRU 1) -BiGRU layer (BiGRU 2) -full connection layer (FC 4) (in the figure, FC5 is a feature simplification step, which is described below), and processes packet space feature packet input. The packet space features obtained by a plurality of data packet feature maps through a CNN model are used as a group, a group of packet feature vectors are input into a BiGRU (bidirectional circular gated neural network) layer to extract high-level time sequence feature vectors, the number of neurons of a full connection layer FC4 is m, and m is set to be 64 in the invention, namely the dimension of the flow time sequence feature vector based on the packet data packets is 1 x 64.
The specific process of extracting the features is as follows:
(a) extracting characteristic vectors of uplink and downlink interactive information;
inputting the up-down interactive behavior reconstruction graph into a convolutional neural network, extracting a spatial feature graph by operation of the first two convolutional layers and the pooling layer of the convolutional neural network, converting the feature graph into a one-dimensional vector by a Flatten function of the third convolutional layer so as to input the one-dimensional vector into a fully-connected layer, extracting 1 s of up-down interactive behavior feature vectors from the fully-connected layer FC2, wherein the neuron of FC2 is 64, and thus obtaining 1 x 64 one-dimensional feature vectors. And saving the interactive behavior feature vector.
(b) Extracting packet-level spatial feature vectors;
the invention inputs CNN to extract packet-level spatial features by taking a data packet as a unit, namely, the invention extracts the packet-level spatial features of a single data packet and intercepts or supplements zero in the data packetkA standard byte, which is converted into a single byte by means of single hot codinglThe dimension vector of the vector is calculated,ka byte can form a framel*kThe gray scale image of (1). In an embodiment of the present invention, the grayscale image set is represented by 9: and 1, training and dividing a test set. Training by using a convolutional neural network, selecting the size of Batchsize to be 64, selecting a cross entropy function as a loss function, using a random gradient descent algorithm in an optimization method, training the training times to be 200, learning rate to be 0.001, adopting a Tanh function as an activation function, and adopting maximum pooling for pooling operation. Extracting 1 from the fully connected layer after inputting the gray scale graph generated by the data packetnThe number of neurons in the full link layer FC2 is 64, and the extracted packet is the same as that of the extracted packet-level feature vectorThe rank feature vector dimension is 1 x 64. Where n = s, n and s are distinguished to indicate that both are features of different nature. And storing the feature vector of each data packet extracted by the CNN model.
(c) Extracting the flow-level time sequence characteristics of the grouped data packets;
the packet feature vectors are input into the recurrent neural network according to stream packets for training, as shown in fig. 5, 10 packet numbers are measured, then 10 packet feature vectors form 1 × 320 input, parameters required for training and the recurrent neural network are operated by the BiGRU layer to obtain stream level timing characteristics. The time sequence characteristics extracted from the BiGRU layer are converted into one-dimensional vectors through a Flatten function and input into the full connection layer FC4, and 1 x is extracted from the full connection layer FC4mIf the dimension of the one-dimensional feature vector of (1) is 64 for the full connection layer FC4 neuron, the extracted flow-level time-series feature vector dimension is 1 × 64, and the time-series feature vector is stored.
Referring to fig. 2 to 4, the present invention performs feature fusion in units of packets after feature simplification. The characteristic simplification means that 1 isnOne-dimensional spatial feature of (1) m And 1. the time sequence characteristic ofsThe interactive behavior characteristics are converted into one-dimensional characteristic vectors smaller than the original dimensionality by adding a full connection layer on the basis of the original model. In the invention, the packet level space characteristics of 1 × 64 and the flow level time sequence characteristics of 1 × 6 are simplified into 1 × 32 dimensions through the full connection layers FC3 and FC5, respectively, and the uplink and downlink interaction behavior characteristics of 1 × 64 are simplified into 1 × 26 dimensions through the full connection layer FC3 with the number of neurons being 26. The weights of the three types of features can be adjusted through feature simplification while facilitating subsequent processing. And the characteristic fusion is carried out by taking the data packet as a unit, namely the interactive behavior characteristic, the data packet space characteristic and the stream time sequence characteristic of the stream after the characteristic simplification are subjected to characteristic fusion according to a single data packet as a sample to obtain a characteristic vector with the dimension of 1 x 90, and the characteristic vector is transferred and input to the multi-classifier.
And 4, step 4: and classifying the traffic application by using a plurality of classifiers.
The multi-classifier adopts a one-dimensional convolutional neural network and has a structure of a first convolutional layer, a first pooling layer, a second convolutional layer, a second pooling layer, a first fully-connected layer, a second fully-connected layer and a Softmax layer. The input layer is a multi-step with 90 x 1 of dimensionality after the characteristic is fusedThe number of channels after the first convolutional layer is 32, the output dimension is 90 x 32, the dimension after the first sampling pooling layer is reduced to 30 x 32, the number of channels after the second convolutional layer is 64, the output dimension is 30 x 64, the dimension after the second sampling pooling layer is reduced to 10 x 64, and the neurons in the two fully-connected layers are 128 and 10cHere, thecIs the desired number of categories. The training parameter is set to be min-batch of 50, the loss function is a cross entropy function, the optimization method is a random gradient descent algorithm, the learning rate is 0.001, and the Epoch is 40. Before the flowNThe fused features of each packet in the data packet are classified for application.
In the embodiment of the invention, when a new flow class appears, the parameters of the multiple classifiers are updated through an inheritance learning mechanism. The specific process is as follows:
(1) sample data preprocessing;
dividing the new class samples and a small amount of old class samples into a training set, a verification set and a test set, wherein the ratio of the training set to the verification set to the test set is 9: 1: 1, predicting samples by using an original classifier, and outputting a normalized vector of a classification result from a final full-connection layerV a The new classifier obtains the normalized vector of the classification result for the prediction sampleV b Remembering the true class label vector asV c 。
(2) Using an inheritance loss function to learn the parameters of the original model and adapt to the new category at the same time;
referring to fig. 6, the inherited loss function is defined as a weighted sum of the true loss function and the differential loss function, as shown in equation (5) below. Wherein the real loss function describes the real class in the training processV c With new classifier prediction resultsV b The fitting degree of (2) is equivalent to the process of learning new knowledge, and the cross entropy loss function is adopted in the invention, as shown in formula (5). Normalization vector of prediction result of original classifier by using difference loss functionV a Normalizing vectors with new classifier predictorsV b The degree of difference of (a) is equivalent to a process of retaining the originally learned weight information, so that updating of the classifier can be completed more quickly. Hair brushIt is clear that the difference of the two probability distributions is described using the KL divergence loss function. Ratio of old and new classesAs a function of the differential loss, andthen the weight of the true loss function.p(x i )、q(x i ) Respectively for random variable samplesxTwo probability distributions for the predicted result, 0.375 and 0.625 in the present invention, respectively.
(3) Defining retention coefficients to control the learning degree of the parameters of the original classifier;
the original classifier not only represents the classification result of the prediction sample, but also represents the degree similar to or different from other classes, different importance is given to the learning of the normalization vector of the classification result by using a retention coefficient, the retention coefficient is set between 0 and 1 according to the required learning degree, the retention coefficient is increased when the original classifier extracts sufficiently detailed features, and otherwise, a smaller retention coefficient is used.
(4) Using linear mapping to balance the classification preferences of different classes at the fully connected layer;
the parameters of the full-connection layer of the classifier are always most fitted to the latest category when predicting the sample, and in order to balance the fitting degree of the new and old categories, one is defined for the output result of the new categoryA linear mapping model processes the classification result vectors for the new classes. Two parameters of the linear mapping modela、bAnd determining by using a verification set, wherein the loss function adopts a cross entropy loss function, and the parameters are stored as a weight file.
outThe probability given to the classifier.
And 5: and judging the final attribution application of the traffic based on a majority principle.
Determining flow classification using majority rules refers to pre-staging flowNVoting selection is carried out after the classification result of each data packet is obtained,Nmost packets in the packet classification result are classified into a certain type of application, and the flow is determined as the application traffic. As shown in FIG. 7, the present inventionNAnd if the number of the data packets classified into a plurality of categories is equal, comparing the probability sum, and taking the category with the large probability sum as the final data flow attribution category.
Based on the same technical concept as the method embodiment, the invention also provides an anonymous network traffic identification device based on traffic reconstruction and inheritance learning, which comprises the following steps:
the data acquisition and filtering module is used for acquiring original network flow, primarily screening the flow and eliminating non-Tor flow;
the flow reconstruction module is used for reconstructing the primarily screened flow and converting the flow into a gray characteristic diagram; the method comprises the following steps: original byte characteristic reconstruction unit: taking the standard byte asLTo be less thanLThe data packet of one byte is subjected to zero padding operation, and exceedsLThe data packet of each byte is cut off and normalized to generatei*iThereby converting the packed byte matrix into a gray image; and an uplink and downlink interactive behavior characteristic reconstruction unit: constructing horizontal and vertical coordinates according to the size, direction and time interval of the data packets, and taking the number of the data packets in each time interval as the gray value of the pixel pointForming a characteristic diagram for simulating uplink and downlink interactive behaviors;
a feature extraction and fusion module, which takes the data packet as a unit and inputs the corresponding uplink and downlink interactive behavior feature map into a convolutional neural network for extraction to obtain an interactive information feature vectorV s Inputting the original byte characteristic diagram into the convolutional neural network to extract and obtain a packet space characteristic vectorV n Inputting a group of packet space feature vectors into a recurrent neural network to extract and obtain a stream time sequence feature vectorV m And fusing the three feature vectors;
the application classification module is used for inputting the fusion characteristics into a multi-classifier for application classification, and the multi-classifier updates classifier parameters through an inheritance learning mechanism when detecting a new flow category;
and the class judgment module is used for determining the attribution application of the flow based on a majority principle.
It should be understood that the anonymous network traffic identification apparatus provided in this embodiment may implement all technical solutions of the anonymous network traffic identification method, functions of each functional module of the anonymous network traffic identification apparatus may be implemented according to the method in the foregoing method embodiment, and a specific implementation process may refer to relevant descriptions in the foregoing embodiment, which is not described herein again.
As will be appreciated by one skilled in the art, embodiments of the present invention may be provided as a method, system, or computer program product. Accordingly, the present invention may take the form of an entirely hardware embodiment, an entirely software embodiment or an embodiment combining software and hardware aspects. Furthermore, the present invention may take the form of a computer program product embodied on one or more computer-usable storage media (including, but not limited to, disk storage, CD-ROM, optical storage, and the like) having computer-usable program code embodied therein.
The present invention is described with reference to flowchart illustrations and/or block diagrams of methods, apparatus (systems), and computer program products according to embodiments of the invention. It will be understood that each flow and/or block of the flow diagrams and/or block diagrams, and combinations of flows and/or blocks in the flow diagrams and/or block diagrams, can be implemented by computer program instructions. These computer program instructions may be provided to a processor of a general purpose computer, special purpose computer, embedded processor, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, create means for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.
These computer program instructions may also be stored in a computer-readable memory that can direct a computer or other programmable data processing apparatus to function in a particular manner, such that the instructions stored in the computer-readable memory produce an article of manufacture including instruction means which implement the function specified in the flowchart flow or flows and/or block diagram block or blocks.
These computer program instructions may also be loaded onto a computer or other programmable data processing apparatus to cause a series of operational steps to be performed on the computer or other programmable apparatus to produce a computer implemented process such that the instructions which execute on the computer or other programmable apparatus provide steps for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.
Claims (10)
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202210506848.6A CN114615093B (en) | 2022-05-11 | 2022-05-11 | Anonymous network traffic identification method and device based on traffic reconstruction and inheritance learning |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202210506848.6A CN114615093B (en) | 2022-05-11 | 2022-05-11 | Anonymous network traffic identification method and device based on traffic reconstruction and inheritance learning |
Publications (2)
Publication Number | Publication Date |
---|---|
CN114615093A CN114615093A (en) | 2022-06-10 |
CN114615093B true CN114615093B (en) | 2022-07-26 |
Family
ID=81870459
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202210506848.6A Active CN114615093B (en) | 2022-05-11 | 2022-05-11 | Anonymous network traffic identification method and device based on traffic reconstruction and inheritance learning |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN114615093B (en) |
Families Citing this family (13)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN115086006B (en) * | 2022-06-13 | 2024-02-02 | 安徽工业大学 | Distributed application program encryption traffic classification method based on bidirectional gating logic unit |
CN114785623A (en) * | 2022-06-21 | 2022-07-22 | 南京信息工程大学 | Network intrusion detection method and device based on discretization characteristic energy system |
CN115277585B (en) * | 2022-07-08 | 2023-07-28 | 南京邮电大学 | Multi-granularity business flow identification method based on machine learning |
CN115442309B (en) * | 2022-09-01 | 2023-06-09 | 深圳信息职业技术学院 | Packet granularity network traffic classification method based on graph neural network |
CN116743506B (en) * | 2023-08-14 | 2023-11-21 | 南京信息工程大学 | Encrypted flow identification method and device based on quaternion convolutional neural network |
CN117176664A (en) * | 2023-08-28 | 2023-12-05 | 枣庄福缘网络科技有限公司 | An abnormal traffic monitoring system for the Internet of Things |
CN116886637B (en) * | 2023-09-05 | 2023-12-19 | 北京邮电大学 | Single-feature encryption stream detection method and system based on graph integration |
CN117113262B (en) * | 2023-10-23 | 2024-02-02 | 北京中科网芯科技有限公司 | Network traffic identification method and system |
CN118573635A (en) * | 2024-05-29 | 2024-08-30 | 烽火通信科技股份有限公司 | Space-time feature extraction algorithm, flow identification method and model |
CN118413387B (en) * | 2024-06-14 | 2025-02-07 | 四川大学 | A Tor anonymous network traffic identification method based on multi-dimensional feature deep learning |
CN118433121B (en) * | 2024-07-05 | 2024-10-29 | 南京信息工程大学 | Network traffic content type identification method and device based on deep learning |
CN118509372B (en) * | 2024-07-18 | 2024-09-20 | 广东联想懂的通信有限公司 | Flow distribution method and system |
CN118659928B (en) * | 2024-08-16 | 2024-12-03 | 厘壮信息科技(苏州)有限公司 | Intelligent control method and system based on VMess |
Citations (6)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN108200006A (en) * | 2017-11-21 | 2018-06-22 | 中国科学院声学研究所 | A kind of net flow assorted method and device based on the study of stratification space-time characteristic |
CN112367334A (en) * | 2020-11-23 | 2021-02-12 | 中国科学院信息工程研究所 | Network traffic identification method and device, electronic equipment and storage medium |
CN112910853A (en) * | 2021-01-18 | 2021-06-04 | 南京信息工程大学 | Encryption flow classification method based on mixed characteristics |
CN113037730A (en) * | 2021-02-27 | 2021-06-25 | 中国人民解放军战略支援部队信息工程大学 | Network encryption traffic classification method and system based on multi-feature learning |
CN113162908A (en) * | 2021-03-04 | 2021-07-23 | 中国科学院信息工程研究所 | Encrypted flow detection method and system based on deep learning |
CN114301636A (en) * | 2021-12-10 | 2022-04-08 | 南京理工大学 | VPN communication behavior analysis method based on multi-scale spatiotemporal feature fusion of traffic |
-
2022
- 2022-05-11 CN CN202210506848.6A patent/CN114615093B/en active Active
Patent Citations (6)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN108200006A (en) * | 2017-11-21 | 2018-06-22 | 中国科学院声学研究所 | A kind of net flow assorted method and device based on the study of stratification space-time characteristic |
CN112367334A (en) * | 2020-11-23 | 2021-02-12 | 中国科学院信息工程研究所 | Network traffic identification method and device, electronic equipment and storage medium |
CN112910853A (en) * | 2021-01-18 | 2021-06-04 | 南京信息工程大学 | Encryption flow classification method based on mixed characteristics |
CN113037730A (en) * | 2021-02-27 | 2021-06-25 | 中国人民解放军战略支援部队信息工程大学 | Network encryption traffic classification method and system based on multi-feature learning |
CN113162908A (en) * | 2021-03-04 | 2021-07-23 | 中国科学院信息工程研究所 | Encrypted flow detection method and system based on deep learning |
CN114301636A (en) * | 2021-12-10 | 2022-04-08 | 南京理工大学 | VPN communication behavior analysis method based on multi-scale spatiotemporal feature fusion of traffic |
Non-Patent Citations (1)
Title |
---|
VoIP Traffic Detection in Tunneled and Anonymous Networks Using Deep Learning;FAIZ UL ISLAM et al.;《IEEE Access》;20210419;全文 * |
Also Published As
Publication number | Publication date |
---|---|
CN114615093A (en) | 2022-06-10 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN114615093B (en) | Anonymous network traffic identification method and device based on traffic reconstruction and inheritance learning | |
CN112163594B (en) | Network encryption traffic identification method and device | |
CN112398779B (en) | Network traffic data analysis method and system | |
CN109902740B (en) | Re-learning industrial control intrusion detection method based on multi-algorithm fusion parallelism | |
CN110751222A (en) | Online encrypted traffic classification method based on CNN and LSTM | |
CN107465664B (en) | Intrusion detection method based on parallel multi-worker bee colony algorithm and support vector machine | |
CN113298186A (en) | Network abnormal flow detection method for confluent flow model confrontation generation network and clustering algorithm | |
Soleymanpour et al. | An efficient deep learning method for encrypted traffic classification on the web | |
CN113746707A (en) | Encrypted traffic classification method based on classifier and network structure | |
CN112116078A (en) | Information security baseline learning method based on artificial intelligence | |
CN113705604A (en) | Botnet flow classification detection method and device, electronic equipment and storage medium | |
CN109951357A (en) | Network Application Recognition Method Based on Multilayer Neural Network | |
CN112367303A (en) | Distributed self-learning abnormal flow cooperative detection method and system | |
Novikova et al. | Autoencoder anomaly detection on large CAN bus data | |
CN115277888B (en) | Method and system for analyzing message type of mobile application encryption protocol | |
CN112929380B (en) | Trojan horse communication detection method and system combining meta-learning and spatiotemporal feature fusion | |
CN106453294A (en) | Security situation prediction method based on niche technology with fuzzy elimination mechanism | |
CN111130942B (en) | Application flow identification method based on message size analysis | |
CN115643115A (en) | Method and system for predicting security situation of industrial control network based on big data | |
Cui et al. | Semi-2DCAE: a semi-supervision 2D-CNN AutoEncoder model for feature representation and classification of encrypted traffic | |
CN112633475A (en) | Large-scale network burst flow identification model and method and model training method | |
CN114358177B (en) | Unknown network traffic classification method and system based on multidimensional feature compact decision boundary | |
CN114169390B (en) | A network anomaly detection method integrating GBDT and neural network | |
CN116599694A (en) | Botnet detection method based on CNN and LSTM-DAE | |
Jia et al. | Trojan traffic detection based on meta-learning |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |