CN114500396A - MFD chromatographic characteristic extraction method and system for distinguishing anonymous Tor application flow - Google Patents

MFD chromatographic characteristic extraction method and system for distinguishing anonymous Tor application flow Download PDF

Info

Publication number
CN114500396A
CN114500396A CN202210120936.2A CN202210120936A CN114500396A CN 114500396 A CN114500396 A CN 114500396A CN 202210120936 A CN202210120936 A CN 202210120936A CN 114500396 A CN114500396 A CN 114500396A
Authority
CN
China
Prior art keywords
mfd
size
flow
tor
anonymous
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN202210120936.2A
Other languages
Chinese (zh)
Other versions
CN114500396B (en
Inventor
王良民
何刘坤
傅涛
冯霞
周强
言洪萍
徐伊凡
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Jiangsu University
Original Assignee
Jiangsu University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Jiangsu University filed Critical Jiangsu University
Priority to CN202210120936.2A priority Critical patent/CN114500396B/en
Publication of CN114500396A publication Critical patent/CN114500396A/en
Application granted granted Critical
Publication of CN114500396B publication Critical patent/CN114500396B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L47/00Traffic control in data switching networks
    • H04L47/10Flow control; Congestion control
    • H04L47/24Traffic characterised by specific attributes, e.g. priority or QoS
    • H04L47/2441Traffic characterised by specific attributes, e.g. priority or QoS relying on flow classification, e.g. using integrated services [IntServ]
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/23Clustering techniques

Landscapes

  • Engineering & Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Theoretical Computer Science (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Artificial Intelligence (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Signal Processing (AREA)
  • Evolutionary Biology (AREA)
  • Evolutionary Computation (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Computer Networks & Wireless Communication (AREA)
  • Data Exchanges In Wide-Area Networks (AREA)

Abstract

The invention discloses an MFD chromatographic characteristic extraction method and system for distinguishing anonymous Tor application flow. The MFD characteristics comprise the size and distribution of anonymous flow data packets, the frequency distribution of different types of packets, the sending direction of different packets and other flow characteristics, and are the visualization method characteristics of the flow characteristics corresponding to a grid graph to an RGB color space.

Description

MFD chromatographic characteristic extraction method and system for distinguishing anonymous Tor application flow
Technical Field
The invention relates to a network security technology, in particular to an MFD chromatographic characteristic extraction method and system for distinguishing anonymous Tor application flow.
Background
Network traffic is composed of data packets belonging to different application data, and with the wide use of mobile devices such as notebook computers, smart phones and the like, users can install different applications from application stores or networks at any time and any place, so that the current network traffic types become increasingly abundant, and network security administrators or operators need to identify network traffic generated by user devices by means of traffic classification technology to allocate higher priority to traffic of applications popular in the current network to obtain better user experience while preventing unexamined application traffic from passing through the network.
The traffic classification means that network traffic is classified and identified, and the classification and identification result is an application program or a service type, and the like, so that the traffic classification has strong potential in different fields such as network security management, network optimization, traffic engineering and the like. However, with the increasing use of encryption technology and other evasive techniques, the efficiency of the traditional port-based or packet payload-based technology is decreasing, and in recent years, in order to solve the classification problem of encrypted network traffic, many researchers have proposed a series of technologies and schemes, which mostly utilize original features or statistical features rich in semantic information extracted from preprocessed encrypted traffic in combination with a machine learning method to train an efficient classifier, and exhibit better performance.
However, as the interest of users on privacy gradually increases, more and more users use anonymous networks such as Tor and I2P to protect their own privacy, Tor is the most typical anonymous communication system using the onion routing technology, identity information of senders and recipients is forwarded and hidden through a multi-layer proxy, a multiplexing technology is also adopted in the Tor network, so that traffic originally directed to different destinations is divided into the same links, traffic patterns generated by anonymous Tor application are more complicated and overlapped, features extracted by the existing scheme are used as inputs of a classifier, and trained models cannot effectively distinguish anonymous application Tor traffic.
There is therefore a need for a more efficient traffic feature extraction method that distinguishes anonymous Tor applications.
Disclosure of Invention
The purpose of the invention is as follows: the invention aims to solve the defects in the prior art and provides an MFD chromatographic feature extraction method and system for distinguishing anonymous Tor application flow.
The technical scheme is as follows: the invention discloses an MFD chromatographic characteristic extraction method for distinguishing anonymous Tor application flow, which comprises the following steps of: :
step (1), collecting network traffic generated by target application in a Tor network to form a target traffic set, dividing the target traffic set into a plurality of streams by using a sliding time window with a fixed step length, removing non-active streams and noise by adopting a double-threshold preprocessing strategy for each stream, and further extracting a data packet size sequence S and a data packet direction sequence D of each stream I; in which the sliding time window is typically set to 10-60 s in order to ensure that there are a sufficient number of training samples, so that the shorter the duration of the originally collected traffic, the smaller the setting window T, and vice versa.
Step (2), grouping the size sequence and the direction sequence of the data packets obtained in the step (1) according to the size of the data packets (namely, grouping the directions of the data packets with the same size into the same group), and calculating the number ratio distribution M _ size, the frequency distribution F _ size and the direction distribution D _ size of the data packets with different sizes according to the window length of a sliding time window and the total number of the data packets sent in the window;
step (3), the three distributions obtained in the step (2) are fused into MFD characteristics, clustering is carried out on the MFD characteristics by using a spectral clustering algorithm, and meanwhile, the optimal cluster number is selected according to the distribution conditions of different types of MFD characteristics and the same types of MFD characteristics in each cluster; for each cluster, randomly deleting other types of MFD features and keeping the MFD features with the largest type in the cluster above a certain proportion (which can be set to 90% for ensuring accuracy);
and (4) mapping the MFD characteristics processed in the step (3) to an RGB color space by adopting a grid-diagram-based visualization method, then performing image compression and format conversion, and storing the MFD characteristics as MFD chromatographic characteristics, so as to visually display a differentiation mode of anonymously applying the Tor flow and be used for subsequent classification.
Further, the specific process of the step (1) is as follows:
firstly, removing a data packet using a non-TCP protocol and a data packet with the size not within a [1,1500] interval in target flow NT;
setting a splitting threshold value delta and a denoising threshold value tau, setting the number of data packets which are at least contained in one stream, setting the occupation ratio of background flow data packets which are at most contained in one stream by tau, ensuring that the method has certain robustness on background flow, simultaneously splitting NT into a plurality of streams according to every T seconds by using a sliding time window T with fixed step length in order to increase the number of samples in a data set and prevent overfitting, judging whether the number of the data packets in each stream exceeds delta or not, deleting the stream if the number of the data packets in each stream does not exceed delta, continuously judging whether the occupation ratio of the background flow data packets exceeds tau or not if the number of the data packets in each stream exceeds tau, and deleting the stream if the occupation ratio of the background flow data packets exceeds tau;
finally, for each length-N stream I ═ p1,p2,...pi...,pN-1,pN]Respectively extracting the data packet size sequence S ═ S1,s2,...si...,sN-1,sN]And a packet direction sequence D ═ D1,d2,...dN-1,dN]Wherein p isiRepresents the ith packet in flow I, and I belongs to [1, N ]]。
Further, the specific process of the step (2) is as follows:
creating a packet size-to-number MapsnAnd packet size-to-average direction MapsdTraversing the data packet size sequence S and the data packet direction sequence D extracted in the step (1), and if Map is availablesnAnd MapsdIn the presence of a size siKey of (1), then MapsnMiddle key si Adds 1 to the value of (c), and determines whether to forward Map based on whether the packet is being sent or receivedsdMiddle key siWhether the value of (d) remains unchanged or is increased by 1; if MapsnAnd MapsdIn all of which no bond s is presentiThen both initialization values are 0, i.e.:
Figure BDA0003498063770000031
Figure BDA0003498063770000032
wherein s isiIndicating the packet size sequence S ═ S1,s2,...sN-1,sN]I.e. the ith data packet p in stream IiThe packet size of (d);
then, according to the total number N of the data packets sent in the window, the window size T and the Mapsn、MapsdMapping and calculating the number proportion distribution of data packets with different sizes
Figure BDA0003498063770000033
Frequency distribution
Figure BDA0003498063770000034
And directional distribution
Figure BDA0003498063770000035
Differentiated patterns for representing different anonymous application Tor traffic.
Further, the specific process of step (3) is as follows:
firstly serially combining M _ size, F _ size and D _ size into MFD characteristics, using the MFD characteristics as nodes, using the similarity between the MFD characteristics as weighted non-directional edges, constructing an adjacent matrix E and a Laplace matrix L-G-E, and normalizing L to obtain the L
Figure BDA0003498063770000036
Wherein the similarity calculation formula is a Gaussian kernel function
Figure BDA0003498063770000037
MFDaAnd MFDbFor MFD eigenvectors of two flows, sigma bandwidth, local contribution of Sim is controlledThe range, G is the degree matrix, and the value of the diagonal element in G is the sum of the corresponding row elements in the matrix E;
then calculate LnorThe feature vector F ═ F corresponding to the first F smallest feature values of (a) { F ═ F1,F2,..,FfGet through the cluster number from 2 to KmaxClustering F by using a K-means algorithm to obtain a probability matrix BQ×KWherein Q and K represent the number of samples and clusters, respectively, bq,kAs the probability that the qth sample belongs to the kth cluster,
Figure BDA0003498063770000038
Figure BDA0003498063770000041
and selecting the optimal cluster number according to the distribution of different types of MFD characteristics and the distribution of the same type of MFD characteristics in each cluster, namely a maximization formula:
Figure BDA0003498063770000042
wherein ω isqRepresents the proportion of the type sample q in the training set, max (b)q,k) The maximum value of the qth row of the matrix B, that is, the maximum probability that a sample q belongs to a certain cluster;
and finally, randomly deleting other types of MFD characteristics in each cluster, and keeping the MFD characteristics with the largest type ratio above a certain proportion.
Further, the specific process of the step (4) is as follows:
for the three distributions M _ size, M _ size and D _ size extracted in each time window obtained in step (3), in order to ensure that all three distribution values are within one dimension and prevent the normalization of some distribution values that are too large and have small influence, first, the original distribution is linearly transformed using a robust normalization method and the result falls into the [0,255] interval, that is:
Figure BDA0003498063770000043
Figure BDA0003498063770000044
Figure BDA0003498063770000045
wherein M _ sizeh、F_sizeh、D_sizehRepresents the h value of the three distributions, mean represents the median of the distributions, IQR represents the range between the 1 st quartile (25%) and the 3 rd quartile (75%) of the distributions;
then, a packet size-color dictionary SC _ dic ═ { SC1,sc2,...scM-1,scMThe SC _ dic stores a mapping of packet size to color (one color for each packet size, SC _ dic)m=sm:clolorm) The size and distribution of the dictionary are the same;
then, creating a square picture containing 1500 grids, numbering each grid in sequence from left to right and from top to bottom, wherein the number j is in one-to-one correspondence with the size of a data packet, and j belongs to [1,1500], then traversing the step SC _ dic, if a key j exists in the SC _ dic, coloring the grid corresponding to the number by using a color code stored in the SC _ dic (j), and if the key j does not exist in the SC _ dic, coloring the grid by using a mild coloring scheme and extracting the color corresponding to the j from a default color array, namely SC _ dic (j) (hex (j/1499) 255) until all grids are colored;
wherein the creet array comprises RGB values of 256 colors with different depths, and ceil represents rounding up;
and finally, adjusting the picture to be in a proper size according to the residual condition of the storage space of the equipment and storing the picture as the MFD chromatographic characteristic of the anonymous application Tor flow.
The invention also discloses an MFD chromatographic characteristic extraction system for distinguishing the anonymous Tor application flow, which comprises a flow acquisition module, a flow distribution calculation module, an MFD chromatographic characteristic extraction module and an anonymous application Tor flow classification module;
the flow acquisition module acquires and converges the flow by using a port mirroring mode at each key node of the experimental environment and the data center network equipment respectively, and the flow is preprocessed to distribute an anonymous Tor application label for the flow under the experimental environment and is stored on the physical equipment together with the data center flow;
the flow distribution calculation module splits the original flow according to the data packet time stamp, extracts the data packet size sequence and the data packet direction sequence of the split flow, and calculates the number proportion distribution, the frequency distribution and the direction distribution of the data packets with different sizes according to the sequence;
the MFD chromatographic characteristic extraction module fills three distributions obtained by the flow distribution calculation module as RGB three colors into a square picture containing 1500 grids, and names the picture by using an anonymous application Tor flow label;
and the Tor flow classification module for anonymous application takes the extracted MFD chromatographic characteristics as the input of a machine learning model, adjusts the model structure to be optimal according to the MFD chromatographic characteristics, and finally classifies the Tor flow for anonymous application by using the model.
And performing the operations of the steps to perform deep learning classification through the target flow data in the experimental environment to obtain a classification model, then processing the actual target flow data in the data center network equipment to obtain corresponding MFD characteristics, inputting the MFD characteristics into the obtained classification model, and finally outputting the actual classification.
When deep learning is performed, any machine learning model (for example, random forest RF) can use the MFD features, the processed MFD features are used as RF input, a sample subset and the MFD feature subset are randomly selected, attributes are divided according to an information gain strategy, a plurality of decision trees are established to form a random forest, and a model structure is adjusted to be optimal through a cross validation method.
The invention also discloses a computer storage medium, wherein an MFD chromatographic characteristic extraction program for distinguishing the anonymous Tor application flow is stored in the computer storage medium, and the MFD chromatographic characteristic extraction method for distinguishing the anonymous Tor application flow is realized when the program is executed.
Has the advantages that: the invention provides an MFD chromatographic characteristic extraction method for distinguishing anonymous Tor application flow aiming at the difference of data distribution among anonymous application Tor flows, combining the multiplexing characteristic of a Tor network and the requirement of effective input characteristics in a fine-grained Tor flow classification task, extracts the number proportion distribution, the frequency distribution and the direction distribution of data packets with different sizes and fuses the data packets into MFD characteristics by observing and analyzing the size and the direction sequence of the data packets in a fixed time window of the anonymous application Tor flow, uses a spectral clustering algorithm to cluster and filter the MFD characteristics, shows the separability of the MFD characteristics to the different types of anonymous application Tor flows according to the distribution condition of the characteristics in clusters and among the clusters, finally uses a visualization method based on a grid diagram to map the MFD characteristics into an RGB color space to obtain the MFD chromatographic characteristics, visually shows the differentiation mode of the anonymous application Tor flows, the effect of distinguishing the Tor traffic of anonymous application can be achieved.
Drawings
FIG. 1 is a schematic overall flow diagram of the present invention;
FIG. 2 is a schematic diagram of a system in an embodiment;
FIG. 3 is a flowchart of the MFD chromatographic feature extraction module of the embodiment;
figure 4 is a flow diagram of the operation of the anonymous application Tor traffic classification module in an embodiment.
Detailed Description
The technical solution of the present invention is described in detail below, but the scope of the present invention is not limited to the embodiments.
The method combines the Tor network multiplexing characteristic and the requirement on effective input characteristics in a fine-grained Tor flow classification task, and performs differentiated characteristic extraction on anonymous application Tor flow generated by a network user. Firstly, collecting a target Tor flow set in an experimental environment, splitting the flow set into a plurality of flow slices according to a fixed time window after preprocessing, respectively extracting the size and direction sequence of the data packets for each segmentation, calculating the number proportion distribution, frequency distribution and direction distribution of the data packets with different sizes, fusing to obtain the MFD characteristics, clustering and filtering the MFD characteristics by using a spectral clustering algorithm, the separability of MFD characteristics to different types of anonymous application Tor flow is shown according to the distribution conditions of the MFD characteristics in clusters and among the clusters, the MFD characteristics are mapped to an RGB color space by using a visualization method based on a grid diagram to obtain MFD chromatographic characteristics, the differentiation mode of the anonymous application Tor flow is visually shown, the defects that the original characteristics cannot effectively distinguish different anonymous application Tor flow and the interpretability is low are overcome, and the effective characteristic representation of the anonymous application Tor flow is realized.
Example 1:
as shown in fig. 1, the detailed process of the MFD chromatographic feature extraction method for distinguishing anonymous Tor application flow according to the present embodiment is as follows:
s101: collecting network flow generated by target application in a Tor network to form a target flow set, dividing the target flow set into a plurality of flows by using a sliding time window with a fixed step length, adopting a double-threshold preprocessing strategy to omit non-active flows and noise, and extracting a data packet size sequence and a data packet direction sequence of each flow, wherein the specific steps are as follows:
first, packets using non-TCP protocol in the target traffic NT and having a size not in [1,1500]]Setting a splitting threshold value delta and a denoising threshold value tau in a data packet in an interval, wherein delta specifies the number of data packets which are least contained in one stream, tau specifies the percentage of background traffic data packets which are most contained in the stream, dividing NT into a plurality of streams according to every T seconds by using a sliding time window with a fixed step length, judging whether the number of the data packets in each stream exceeds delta, if not, deleting the stream, if so, continuously judging whether the percentage of the background traffic data packets exceeds tau, if so, deleting the stream, and finally, regarding each stream I with the length of N as [ p [ [ p ]1,p2,...pi...,pN-1,pN]Respectively extracting the data packet size sequence S ═ S1,s2,...si...,sN-1,sN]And a packet direction sequence D ═ D1,d2,...dN-1,dN]Wherein p isiRepresenting the ith data packet in the flow I;
s102: grouping the data packet size sequence and the data packet direction sequence in the step S101 according to size, and calculating the number ratio distribution, frequency distribution and direction distribution of the data packets with different sizes according to the window length and the total number of data packets sent in the window, specifically as follows:
creating a packet size-to-number MapsnAnd packet size-to-average direction MapsdTraversing the packet size sequence S and the packet direction sequence D extracted in S101, if MapsnAnd MapsdIn the presence of a size siKey of (1), then MapsnMiddle key si Adds 1 to the value of (1) while Map is sent or received depending on whether the direction is sending or receivingsdMiddle key siIf no bond s is present, remains unchanged or is increased by 1iThen the values are all initialized to 0, i.e.:
Figure BDA0003498063770000071
Figure BDA0003498063770000072
then according to the total data packet number N sent in the window, the window size T and Mapsn、MapsdMapping and calculating the number proportion distribution of data packets with different sizes
Figure BDA0003498063770000073
Frequency distribution
Figure BDA0003498063770000074
And directional distribution
Figure BDA0003498063770000075
Differentiated patterns for representing different anonymous application Tor traffic.
S103: the three distributions obtained in S102 are fused into MFD characteristics, clustering is performed on the MFD characteristics by using a spectral clustering algorithm, and the optimal cluster number is selected according to the distribution conditions of different types of MFD characteristics and the same types of MFD characteristics in each cluster. For each cluster, randomly deleting the MFD features of other types and reserving the MFD features with the largest type ratio in the cluster above a certain proportion specifically as follows:
firstly serially combining M _ size, F _ size and D _ size into MFD characteristics, using the MFD characteristics as nodes, using the similarity between the MFD characteristics as weighted non-directional edges, constructing an adjacent matrix E and a Laplace matrix L-G-E, and normalizing L to obtain the L
Figure BDA0003498063770000081
Wherein the similarity calculation formula is a Gaussian kernel function
Figure BDA0003498063770000082
MFDaAnd MFDbThe MFD characteristic vectors of 1 row and u columns of the two streams are adopted, sigma is the bandwidth, the local action range of Sim is controlled, G is a degree matrix, and the value of diagonal elements in G is the sum of corresponding row elements in a matrix E;
then calculate LnorThe feature vector F ═ F corresponding to the first F smallest feature values of (a) { F ═ F1,F2,..,FfGet through the cluster number from 2 to KmaxClustering F by using a K-means algorithm to obtain a probability matrix BQ×KWherein Q and K represent the number of samples and clusters, respectively, bq,kAs the probability that the qth sample belongs to the kth cluster,
Figure BDA0003498063770000083
Figure BDA0003498063770000084
selecting the optimal cluster number according to the distribution of different types of MFD characteristics and the distribution of the same type of MFD characteristics in each cluster, namely a maximization formula
Figure BDA0003498063770000085
Wherein ω isqRepresentsThe proportion of the type sample q in the training set, max (b)q,k) The maximum value of the qth row of the matrix B, that is, the maximum probability that a sample q belongs to a certain cluster;
and finally, randomly deleting other types of MFD characteristics in each cluster, and keeping the MFD characteristics with the largest type ratio above a certain proportion.
S104: the MFD characteristics obtained in the step S103 are mapped into an RGB color space by adopting a grid diagram-based visualization method, and then are stored as MFD chromatographic characteristics after image compression and format conversion, so that a differentiation mode of anonymously applying Tor flow is visually shown and used for subsequent classification, and the method specifically comprises the following steps:
for the three distributions M _ size, F _ size and D _ size extracted in each time window in S103, to ensure that the three distribution values are all in one dimension and to prevent the normalization of some distribution values that are too large and have less influence on the original distribution, the robust normalization method is first used to linearly transform the original distribution and make the result fall to [0,255 []Intervals, i.e.
Figure BDA0003498063770000086
Figure BDA0003498063770000087
Wherein M _ sizeh、F_sizeh、D_sizehRepresents the ith value of the three distributions, median represents the median of the distributions, and IQR represents the range between the 1 st quartile (25%) and the 3 rd quartile (75%) of the distributions.
Then, a packet size-color dictionary SC _ dic ═ { SC ═ is created1,sc2,..scM-1,scM-storing size and color mapping, the size and distribution of the dictionary being the same.
Then, a square picture containing 1500 grids is created, each grid is numbered sequentially from left to right and from top to bottom, the number j and the size of a data packet are in a one-to-one correspondence, j belongs to [1,1500], then, a step SC _ dic is traversed, if a key j exists in the SC _ dic, a color code stored in the SC _ dic (j) is used for coloring the grid corresponding to the number, if the key j does not exist in the SC _ dic, a gentle coloring scheme is used for extracting the color at the position corresponding to j from a default color array to color the grid, namely SC _ dic (hex (ceil (j/1499) 255) until all grids are colored, wherein the crest array contains RGB values of 256 colors with different depths, and ceil represents upward rounding.
And finally, adjusting the picture to a proper size according to the residual condition of the storage space of the equipment and storing the picture as the MFD chromatographic characteristic of the anonymous application Tor flow.
The detailed process of the data packet size-color dictionary generation algorithm comprises the following steps:
Figure BDA0003498063770000091
example 2:
as shown in fig. 2, the system for implementing the MFD chromatographic feature extraction method for distinguishing anonymous Tor application traffic in this embodiment includes a traffic collection module 100, a traffic distribution calculation module 200, an MFD chromatographic feature extraction module 300, and an anonymous application Tor traffic classification module 400.
The flow collection module 100 collects and converges the flow in an experimental environment and each key node of the data center network device in a port mirroring manner, and after preprocessing, the anonymous Tor application tag is distributed to the flow in the experimental environment and is stored on the physical device together with the data center flow.
The flow distribution calculation module 200 splits the original flow according to the data packet timestamp, extracts the size and direction sequence of the data packets of the split flow, and calculates the number proportion distribution, frequency distribution and direction distribution of the data packets with different sizes according to the sequence.
The MFD chromatographic feature extraction module 300 fills the three distributions obtained by the flow distribution calculation module as RGB three colors into a square picture including 1500 grids, and names the picture by anonymously applying the Tor flow label.
The Tor traffic classification module 400 is used anonymously to apply Tor traffic anonymously by using the model classification, taking the extracted chromatographic characteristics of the MFD as input of the machine learning model, adjusting the model structure to be optimal according to the chromatographic characteristics of the MFD, and finally classifying by using the model.
Example 3:
based on embodiment 2, as shown in fig. 3, the MFD chromatographic feature extraction module 301 of this embodiment first uses a robust normalization method to eliminate the dimension of the number-to-ratio distribution, the frequency distribution, and the direction distribution of the packets with different sizes, weaken the influence of the boundary value, and scale the boundary value into the [0,255] interval, and then converts the three distributions into color distributions, and stores the color distributions together with the corresponding packet sizes into the packet size-color dictionary. And finally, creating a square picture containing 1500 grids, numbering the grids by using numbers 1-1500, searching the color at the corresponding position in the data packet size-color dictionary according to the numbers, coloring the grids, if the color to which the number belongs does not exist in the dictionary, using a gentle coloring scheme, namely selecting the color at the position ceil (number/1499 255) from color arrays creet with different depths in the default 256 as a filling color until all the grids are colored completely, and finally adjusting the picture to 224 < 3 > according to the residual condition of the storage space of the equipment and storing the picture as the MFD chromatographic characteristic for anonymously applying the Tor flow.
Example 4:
on the basis of embodiment 2, as shown in fig. 4, in the Tor traffic classification module 401 for anonymous application of this embodiment, in the first stage, first, in an experimental environment, Tor traffic of 21 android platform anonymous applications (e.g., wechat, bilibili) is collected, traffic is divided according to a time window with a size of 15s, packets in each stream are traversed and grouped according to the packet size, three distributions M _ size, F _ size, and D _ size (e.g., 0.88,148.8,0.99 when the bilii packet size is 1448) of packets with different sizes are calculated respectively, MFD chromatographic characteristics of each stream are extracted and application labels are assigned, then the distributions are input into a pretrained convolutional neural network model ResNet50 to extract a deeper depth characteristic, fine tuning of model parameters is performed in the process, classification loss is calculated by using a small-batch stochastic gradient descent algorithm MBGD, model parameters are updated by back propagation until the model converges, then remove the last fully connected layer and sort with softmaxLayer and save as model Mbase,MbaseOnly the convolutional and pooling layers are reserved for feature extraction. In the second stage, 200 flow samples are randomly extracted from the data set collected by the experiment at a time, 1800 flow samples are extracted from the data set collected by the data center network, and the model M trained in the first stage is utilizedbaseRespectively extracting feature vectors, and defining a graph G (Node, Edge), wherein Node represents a vertex set, one vertex represents a stream feature vector, each feature vector is numbered, Node is a set of all numbers, Edge represents an Edge set and is used for representing that two vertexes have an association relation, and cosine similarity is used for determining whether the two vertexes are connected, namely:
Figure BDA0003498063770000111
wherein XAAnd XBRespectively representing the feature vectors of the vertex A and the vertex B, making epsilon be the accuracy of a model ResNet50, judging that a connection exists between two nodes if cos _ sim of the two nodes is larger than epsilon, storing the number pairs of the two nodes into an edge set E, acquiring the attention scores of every two vertices by using an attention mechanism, aggregating the feature vectors of all first-order neighbors nearby by using the attention scores as weights for each vertex to update the feature vector of each vertex, and finally classifying the aggregated feature vectors by using a machine learning classifier.
The related experiments prove that:
the accuracy for classification using the present invention is 90.9% and 88.9% on the UNB ISCXtor and the self-collected dataset, respectively, over existing solutions (e.g., Petagna E et al and Shapira T et al propose corresponding solutions).
That is to say, aiming at the problems that the characteristics used by the existing flow classification method can not effectively distinguish different anonymous Tor applications and the interpretability is low, the MFD chromatographic characteristics of anonymous Tor flow are extracted, the difference of anonymous Tor flow in different terminal program applications is visually shown, the defect that the existing characteristics can not effectively express Tor flow modes is overcome, and the characteristics are used as the input of a flow classification model to further effectively distinguish different anonymous Tor applications.

Claims (7)

1. An MFD chromatographic feature extraction method for distinguishing anonymous Tor application flow is characterized in that: the method comprises the following steps:
step (1), collecting network traffic generated by target application in a Tor network to form a target traffic set, dividing the target traffic set into a plurality of streams by using a sliding time window with a fixed step length, removing non-active streams and noise by adopting a double-threshold preprocessing strategy for each stream, and further extracting a data packet size sequence S and a data packet direction sequence D of each stream I;
step (2), grouping the size sequence and the direction sequence of the data packets obtained in the step (1) according to the size of the data packets, and calculating the number ratio distribution M _ size, the frequency distribution F _ size and the direction distribution D _ size of the data packets with different sizes according to the window length of the sliding time window and the total number of the data packets sent in the window;
step (3), the three distributions obtained in the step (2) are fused into MFD characteristics, clustering is carried out on the MFD characteristics by using a spectral clustering algorithm, and meanwhile, the optimal cluster number is selected according to the distribution conditions of different types of MFD characteristics and the same types of MFD characteristics in each cluster; for each cluster, randomly deleting other types of MFD characteristics and keeping the MFD characteristics with the largest type in the cluster above a certain proportion;
and (4) mapping the MFD characteristics processed in the step (3) to an RGB color space by adopting a grid-diagram-based visualization method, then performing image compression and format conversion, and storing the MFD characteristics as MFD chromatographic characteristics, so as to visually display a differentiation mode of anonymously applying the Tor flow and be used for subsequent classification.
2. The MFD chromatographic feature extraction method for differentiating anonymous Tor application traffic as recited in claim 1, wherein: the specific process of the step (1) is as follows:
firstly, removing a data packet using a non-TCP protocol and a data packet with a size not within a [1,1500] interval in target flow NT;
then setting a splitting threshold value delta and a denoising threshold value tau, setting the number of data packets which are contained in one stream at least, setting the ratio of background flow data packets which are contained in one stream at most,
dividing NT into a plurality of streams according to every T seconds by using a sliding time window T with a fixed step length, judging whether the number of data packets in each stream exceeds delta, if not, deleting the stream, if so, continuously judging whether the background flow data packet ratio exceeds tau, and if so, deleting the stream;
finally, for each length-N stream I ═ p1,p2,...pi…,pN-1,pN]Respectively extracting the data packet size sequence S ═ S1,s2,...si…,sN-1,sN]And a packet direction sequence D ═ D1,d2,...dN-1,dN]Wherein p isiRepresents the ith packet in flow I, I belongs to [1, N ]]。
3. The MFD chromatographic feature extraction method for differentiating anonymous Tor application traffic as recited in claim 1, wherein: the specific process of the step (2) is as follows:
creating a packet size-to-number MapsnAnd packet size-to-average direction MapsdTraversing the data packet size sequence S and the data packet direction sequence D extracted in the step (1), and if Map is availablesnAnd MapsdIn the presence of a size siKey of (1), then MapsnMiddle key siAdds 1 to the value of (c), and determines whether to forward Map based on whether the packet is being sent or receivedsdMiddle key siWhether the value of (d) remains unchanged or is increased by 1; if MapsnAnd MapsdIn all of which no bond s is presentiThen both initialization values are 0, i.e.:
Figure FDA0003498063760000021
Figure FDA0003498063760000022
wherein s isiIndicating the packet size sequence S ═ S1,s2,...sN-1,sN]I.e. the ith data packet p in stream IiThe packet size of (d);
then according to the total number N of the data packets sent in the window, the window size T and the Mapsn、MapsdMapping and calculating the number proportion distribution of data packets with different sizes
Figure FDA0003498063760000023
Frequency distribution
Figure FDA0003498063760000024
And directional distribution
Figure FDA0003498063760000025
4. The MFD chromatographic feature extraction method for differentiating anonymous Tor application traffic as recited in claim 1, wherein: the specific process of the step (3) is as follows:
firstly serially combining M _ size, F _ size and D _ size into MFD characteristics, using the MFD characteristics as nodes, using the similarity between the MFD characteristics as weighted non-directional edges, constructing an adjacent matrix E and a Laplace matrix L-G-E, and normalizing L to obtain the L
Figure FDA0003498063760000026
Wherein the similarity calculation formula is a Gaussian kernel function
Figure FDA0003498063760000027
MFDaAnd MFDbThe MFD characteristic vectors of 1 row and u columns of the two streams are adopted, sigma is the bandwidth, the local action range of Sim is controlled, G is a degree matrix, and the value of diagonal elements in G is the sum of corresponding row elements in a matrix E;
then calculate LnorThe eigenvector F corresponding to the first F smallest eigenvalues of (F) ═ F1,F2,..,FfGet through the cluster number from 2 to KmaxClustering F by using a K-means algorithm to obtain a probability matrix BQ×KWherein Q and K represent the number of samples and clusters, respectively, bq,kAs the probability that the qth sample belongs to the kth cluster,
Figure FDA0003498063760000031
Figure FDA0003498063760000032
and selecting the optimal cluster number according to the distribution of different types of MFD characteristics and the distribution of the same type of MFD characteristics in each cluster, namely a maximization formula:
Figure FDA0003498063760000033
wherein ω isqRepresents the proportion of the type sample q in the training set, max (b)q,k) The maximum value of the qth row of the matrix B, that is, the maximum probability that a sample q belongs to a certain cluster;
and finally, randomly deleting other types of MFD characteristics in each cluster, and keeping the MFD characteristics with the largest type ratio above a certain proportion.
5. The MFD chromatographic feature extraction method for differentiating anonymous Tor application traffic as recited in claim 1, wherein: the specific process of the step (4) is as follows:
for the three distributions M _ size, F _ size, and D _ size extracted in each time window from step (3), first, the original distribution is linearly transformed using a robust normalization method, and the result falls into the [0,255] interval, that is:
Figure FDA0003498063760000034
Figure FDA0003498063760000035
Figure FDA0003498063760000036
wherein M _ sizeh、F_sizeh、D_sizehRepresents the h-th value of the three distributions, mean represents the median of the distributions, IQR represents the range between the 1 st quartile and the 3 rd quartile of the distributions;
then, a packet size-color dictionary SC _ dic ═ { SC1,sc2,...scM-1,scMSC _ dic stores the mapping of the size and the color of the data packet, and the size and the distribution of the dictionary are the same;
then, creating a square picture containing 1500 grids, numbering each grid in sequence from left to right and from top to bottom, wherein the number j is in one-to-one correspondence with the size of a data packet, and j belongs to [1,1500], then traversing the step SC _ dic, if a key j exists in the SC _ dic, coloring the grid corresponding to the number by using a color code stored in the SC _ dic (j), and if the key j does not exist in the SC _ dic, coloring the grid by using a mild coloring scheme and extracting the color corresponding to the j from a default color array, namely SC _ dic (j) (hex (j/1499) 255) until all grids are colored;
wherein the creet array comprises RGB values of 256 colors with different depths, and ceil represents rounding up;
and finally, adjusting the picture to be in a proper size according to the residual condition of the storage space of the equipment and storing the picture as the MFD chromatographic characteristic of the anonymous application Tor flow.
6. An MFD chromatographic feature extraction system for distinguishing anonymous Tor application flow, characterized in that: the system comprises a flow acquisition module, a flow distribution calculation module, an MFD chromatographic characteristic extraction module and an anonymous application Tor flow classification module;
the flow acquisition module acquires and converges the flow by using a port mirroring mode at each key node of the experimental environment and the data center network equipment respectively, and the flow is preprocessed to distribute an anonymous Tor application label for the flow under the experimental environment and is stored on the physical equipment together with the data center flow;
the flow distribution calculation module splits the original flow according to the data packet time stamp, extracts the data packet size sequence and the data packet direction sequence of the split flow, and calculates the number proportion distribution, the frequency distribution and the direction distribution of the data packets with different sizes according to the sequence;
the MFD chromatographic characteristic extraction module fills three distributions obtained by the flow distribution calculation module as RGB three colors into a square picture containing 1500 grids, and names the picture by using an anonymous application Tor flow label;
and the Tor flow classification module for anonymous application takes the extracted MFD chromatographic characteristics as the input of a machine learning model, adjusts the model structure to be optimal according to the MFD chromatographic characteristics, and finally classifies the Tor flow for anonymous application by using the model.
7. A computer storage medium, characterized in that: the computer storage medium stores an MFD chromatographic feature extraction program for distinguishing anonymous Tor application flow, and the MFD chromatographic feature extraction program is executed to realize the MFD chromatographic feature extraction method for distinguishing anonymous Tor application flow according to any one of claims 1 to 5.
CN202210120936.2A 2022-02-09 2022-02-09 MFD chromatographic feature extraction method and system for distinguishing anonymous Torr application flow Active CN114500396B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202210120936.2A CN114500396B (en) 2022-02-09 2022-02-09 MFD chromatographic feature extraction method and system for distinguishing anonymous Torr application flow

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202210120936.2A CN114500396B (en) 2022-02-09 2022-02-09 MFD chromatographic feature extraction method and system for distinguishing anonymous Torr application flow

Publications (2)

Publication Number Publication Date
CN114500396A true CN114500396A (en) 2022-05-13
CN114500396B CN114500396B (en) 2024-04-16

Family

ID=81478030

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202210120936.2A Active CN114500396B (en) 2022-02-09 2022-02-09 MFD chromatographic feature extraction method and system for distinguishing anonymous Torr application flow

Country Status (1)

Country Link
CN (1) CN114500396B (en)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116068111A (en) * 2023-03-23 2023-05-05 华谱科仪(北京)科技有限公司 Chromatographic data analysis method, chromatographic data analysis device, chromatographic data analysis equipment and chromatographic data analysis computer medium
CN116319086A (en) * 2023-05-17 2023-06-23 南京信息工程大学 Flow association method and system for Torr anonymous network

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN104135385A (en) * 2014-07-30 2014-11-05 南京市公安局 Method of application classification in Tor anonymous communication flow
CN106953837A (en) * 2015-11-03 2017-07-14 丛林网络公司 With the visual integrating security system of threat
US20190052567A1 (en) * 2018-06-29 2019-02-14 Intel Corporation Non-random flowlet-based routing
CN111901300A (en) * 2020-06-24 2020-11-06 武汉绿色网络信息服务有限责任公司 Method and device for classifying network traffic

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN104135385A (en) * 2014-07-30 2014-11-05 南京市公安局 Method of application classification in Tor anonymous communication flow
CN106953837A (en) * 2015-11-03 2017-07-14 丛林网络公司 With the visual integrating security system of threat
US20190052567A1 (en) * 2018-06-29 2019-02-14 Intel Corporation Non-random flowlet-based routing
CN111901300A (en) * 2020-06-24 2020-11-06 武汉绿色网络信息服务有限责任公司 Method and device for classifying network traffic

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
何永忠;李响;陈美玲;王伟;: "基于云流量混淆的Tor匿名通信识别方法", 工程科学与技术, no. 02, 20 March 2017 (2017-03-20) *

Cited By (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116068111A (en) * 2023-03-23 2023-05-05 华谱科仪(北京)科技有限公司 Chromatographic data analysis method, chromatographic data analysis device, chromatographic data analysis equipment and chromatographic data analysis computer medium
CN116068111B (en) * 2023-03-23 2023-05-30 华谱科仪(北京)科技有限公司 Chromatographic data analysis method, chromatographic data analysis device, chromatographic data analysis equipment and chromatographic data analysis computer medium
CN116319086A (en) * 2023-05-17 2023-06-23 南京信息工程大学 Flow association method and system for Torr anonymous network
CN116319086B (en) * 2023-05-17 2023-07-21 南京信息工程大学 Flow association method and system for Torr anonymous network

Also Published As

Publication number Publication date
CN114500396B (en) 2024-04-16

Similar Documents

Publication Publication Date Title
CN109768985B (en) Intrusion detection method based on flow visualization and machine learning algorithm
CN109981691B (en) SDN controller-oriented real-time DDoS attack detection system and method
CN109218223B (en) Robust network traffic classification method and system based on active learning
CN108229550B (en) Cloud picture classification method based on multi-granularity cascade forest network
CN111131069B (en) Abnormal encryption flow detection and classification method based on deep learning strategy
CN114500396B (en) MFD chromatographic feature extraction method and system for distinguishing anonymous Torr application flow
CN113989583A (en) Method and system for detecting malicious traffic of internet
CN114172688B (en) Method for automatically extracting key nodes of network threat of encrypted traffic based on GCN-DL (generalized traffic channel-DL)
CN114615093A (en) Anonymous network traffic identification method and device based on traffic reconstruction and inheritance learning
He et al. Deep-feature-based autoencoder network for few-shot malicious traffic detection
CN107483451B (en) Method and system for processing network security data based on serial-parallel structure and social network
CN111786951B (en) Traffic data feature extraction method, malicious traffic identification method and network system
CN110222795B (en) Convolutional neural network-based P2P traffic identification method and related device
CN113901448A (en) Intrusion detection method based on convolutional neural network and lightweight gradient elevator
CN109450876B (en) DDos identification method and system based on multi-dimensional state transition matrix characteristics
CN108494620B (en) Network service flow characteristic selection and classification method
CN104468276A (en) Network traffic identification method based on random sampling multiple classifiers
CN114666273B (en) Flow classification method for application layer unknown network protocol
CN114358177B (en) Unknown network traffic classification method and system based on multidimensional feature compact decision boundary
CN114124565B (en) Network intrusion detection method based on graph embedding
CN113746707B (en) Encrypted traffic classification method based on classifier and network structure
CN113537053B (en) Method for constructing radio frequency fingerprint identification model in civil aviation field
CN115348198A (en) Unknown encryption protocol identification and classification method, device and medium based on feature retrieval
CN114021637A (en) Decentralized application encrypted flow classification method and device based on measurement space
CN114265954A (en) Graph representation learning method based on position and structure information

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant