CN114500396A

CN114500396A - MFD chromatographic characteristic extraction method and system for distinguishing anonymous Tor application flow

Info

Publication number: CN114500396A
Application number: CN202210120936.2A
Authority: CN
Inventors: 王良民; 何刘坤; 傅涛; 冯霞; 周强; 言洪萍; 徐伊凡
Original assignee: Jiangsu University
Current assignee: Jiangsu University
Priority date: 2022-02-09
Filing date: 2022-02-09
Publication date: 2022-05-13
Anticipated expiration: 2042-02-09
Also published as: CN114500396B

Abstract

The invention discloses an MFD chromatographic characteristic extraction method and system for distinguishing anonymous Tor application flow. The MFD characteristics comprise the size and distribution of anonymous flow data packets, the frequency distribution of different types of packets, the sending direction of different packets and other flow characteristics, and are the visualization method characteristics of the flow characteristics corresponding to a grid graph to an RGB color space.

Description

MFD chromatographic characteristic extraction method and system for distinguishing anonymous Tor application flow

Technical Field

The invention relates to a network security technology, in particular to an MFD chromatographic characteristic extraction method and system for distinguishing anonymous Tor application flow.

Background

Network traffic is composed of data packets belonging to different application data, and with the wide use of mobile devices such as notebook computers, smart phones and the like, users can install different applications from application stores or networks at any time and any place, so that the current network traffic types become increasingly abundant, and network security administrators or operators need to identify network traffic generated by user devices by means of traffic classification technology to allocate higher priority to traffic of applications popular in the current network to obtain better user experience while preventing unexamined application traffic from passing through the network.

The traffic classification means that network traffic is classified and identified, and the classification and identification result is an application program or a service type, and the like, so that the traffic classification has strong potential in different fields such as network security management, network optimization, traffic engineering and the like. However, with the increasing use of encryption technology and other evasive techniques, the efficiency of the traditional port-based or packet payload-based technology is decreasing, and in recent years, in order to solve the classification problem of encrypted network traffic, many researchers have proposed a series of technologies and schemes, which mostly utilize original features or statistical features rich in semantic information extracted from preprocessed encrypted traffic in combination with a machine learning method to train an efficient classifier, and exhibit better performance.

However, as the interest of users on privacy gradually increases, more and more users use anonymous networks such as Tor and I2P to protect their own privacy, Tor is the most typical anonymous communication system using the onion routing technology, identity information of senders and recipients is forwarded and hidden through a multi-layer proxy, a multiplexing technology is also adopted in the Tor network, so that traffic originally directed to different destinations is divided into the same links, traffic patterns generated by anonymous Tor application are more complicated and overlapped, features extracted by the existing scheme are used as inputs of a classifier, and trained models cannot effectively distinguish anonymous application Tor traffic.

There is therefore a need for a more efficient traffic feature extraction method that distinguishes anonymous Tor applications.

Disclosure of Invention

The purpose of the invention is as follows: the invention aims to solve the defects in the prior art and provides an MFD chromatographic feature extraction method and system for distinguishing anonymous Tor application flow.

The technical scheme is as follows: the invention discloses an MFD chromatographic characteristic extraction method for distinguishing anonymous Tor application flow, which comprises the following steps of: :

step (1), collecting network traffic generated by target application in a Tor network to form a target traffic set, dividing the target traffic set into a plurality of streams by using a sliding time window with a fixed step length, removing non-active streams and noise by adopting a double-threshold preprocessing strategy for each stream, and further extracting a data packet size sequence S and a data packet direction sequence D of each stream I; in which the sliding time window is typically set to 10-60 s in order to ensure that there are a sufficient number of training samples, so that the shorter the duration of the originally collected traffic, the smaller the setting window T, and vice versa.

Step (2), grouping the size sequence and the direction sequence of the data packets obtained in the step (1) according to the size of the data packets (namely, grouping the directions of the data packets with the same size into the same group), and calculating the number ratio distribution M _ size, the frequency distribution F _ size and the direction distribution D _ size of the data packets with different sizes according to the window length of a sliding time window and the total number of the data packets sent in the window;

step (3), the three distributions obtained in the step (2) are fused into MFD characteristics, clustering is carried out on the MFD characteristics by using a spectral clustering algorithm, and meanwhile, the optimal cluster number is selected according to the distribution conditions of different types of MFD characteristics and the same types of MFD characteristics in each cluster; for each cluster, randomly deleting other types of MFD features and keeping the MFD features with the largest type in the cluster above a certain proportion (which can be set to 90% for ensuring accuracy);

and (4) mapping the MFD characteristics processed in the step (3) to an RGB color space by adopting a grid-diagram-based visualization method, then performing image compression and format conversion, and storing the MFD characteristics as MFD chromatographic characteristics, so as to visually display a differentiation mode of anonymously applying the Tor flow and be used for subsequent classification.

Further, the specific process of the step (1) is as follows:

firstly, removing a data packet using a non-TCP protocol and a data packet with the size not within a [1,1500] interval in target flow NT;

setting a splitting threshold value delta and a denoising threshold value tau, setting the number of data packets which are at least contained in one stream, setting the occupation ratio of background flow data packets which are at most contained in one stream by tau, ensuring that the method has certain robustness on background flow, simultaneously splitting NT into a plurality of streams according to every T seconds by using a sliding time window T with fixed step length in order to increase the number of samples in a data set and prevent overfitting, judging whether the number of the data packets in each stream exceeds delta or not, deleting the stream if the number of the data packets in each stream does not exceed delta, continuously judging whether the occupation ratio of the background flow data packets exceeds tau or not if the number of the data packets in each stream exceeds tau, and deleting the stream if the occupation ratio of the background flow data packets exceeds tau;

finally, for each length-N stream I ═ p¹，p²，...pⁱ...，p^N-1，p^N]Respectively extracting the data packet size sequence S ═ S¹，s²，...sⁱ...，s^N-1，s^N]And a packet direction sequence D ═ D¹，d²，...d^N-1，d^N]Wherein p isⁱRepresents the ith packet in flow I, and I belongs to [1, N ]]。

Further, the specific process of the step (2) is as follows:

creating a packet size-to-number Map_snAnd packet size-to-average direction Map_sdTraversing the data packet size sequence S and the data packet direction sequence D extracted in the step (1), and if Map is available_snAnd Map_sdIn the presence of a size sⁱKey of (1), then Map_snMiddle key sⁱ Adds 1 to the value of (c), and determines whether to forward Map based on whether the packet is being sent or received_sdMiddle key sⁱWhether the value of (d) remains unchanged or is increased by 1; if Map_snAnd Map_sdIn all of which no bond s is presentⁱThen both initialization values are 0, i.e.:

wherein s isⁱIndicating the packet size sequence S ═ S¹，s²，...s^N-1，s^N]I.e. the ith data packet p in stream IⁱThe packet size of (d);

then, according to the total number N of the data packets sent in the window, the window size T and the Map_sn、Map_sdMapping and calculating the number proportion distribution of data packets with different sizes

Frequency distribution

And directional distribution

Differentiated patterns for representing different anonymous application Tor traffic.

Further, the specific process of step (3) is as follows:

firstly serially combining M _ size, F _ size and D _ size into MFD characteristics, using the MFD characteristics as nodes, using the similarity between the MFD characteristics as weighted non-directional edges, constructing an adjacent matrix E and a Laplace matrix L-G-E, and normalizing L to obtain the L

Wherein the similarity calculation formula is a Gaussian kernel function

MFD_aAnd MFD_bFor MFD eigenvectors of two flows, sigma bandwidth, local contribution of Sim is controlledThe range, G is the degree matrix, and the value of the diagonal element in G is the sum of the corresponding row elements in the matrix E;

then calculate L_norThe feature vector F ═ F corresponding to the first F smallest feature values of (a) { F ═ F₁，F₂，..，F_fGet through the cluster number from 2 to K_maxClustering F by using a K-means algorithm to obtain a probability matrix B^Q×KWherein Q and K represent the number of samples and clusters, respectively, b_q，kAs the probability that the qth sample belongs to the kth cluster,

and selecting the optimal cluster number according to the distribution of different types of MFD characteristics and the distribution of the same type of MFD characteristics in each cluster, namely a maximization formula:

wherein ω is_qRepresents the proportion of the type sample q in the training set, max (b)_q，k) The maximum value of the qth row of the matrix B, that is, the maximum probability that a sample q belongs to a certain cluster;

and finally, randomly deleting other types of MFD characteristics in each cluster, and keeping the MFD characteristics with the largest type ratio above a certain proportion.

Further, the specific process of the step (4) is as follows:

for the three distributions M _ size, M _ size and D _ size extracted in each time window obtained in step (3), in order to ensure that all three distribution values are within one dimension and prevent the normalization of some distribution values that are too large and have small influence, first, the original distribution is linearly transformed using a robust normalization method and the result falls into the [0,255] interval, that is:

wherein M _ size_h、F_size_h、D_size_hRepresents the h value of the three distributions, mean represents the median of the distributions, IQR represents the range between the 1 st quartile (25%) and the 3 rd quartile (75%) of the distributions;

then, a packet size-color dictionary SC _ dic ═ { SC¹，sc²，...sc^M-1，sc^MThe SC _ dic stores a mapping of packet size to color (one color for each packet size, SC _ dic)^m＝s^m：clolor^m) The size and distribution of the dictionary are the same;

then, creating a square picture containing 1500 grids, numbering each grid in sequence from left to right and from top to bottom, wherein the number j is in one-to-one correspondence with the size of a data packet, and j belongs to [1,1500], then traversing the step SC _ dic, if a key j exists in the SC _ dic, coloring the grid corresponding to the number by using a color code stored in the SC _ dic (j), and if the key j does not exist in the SC _ dic, coloring the grid by using a mild coloring scheme and extracting the color corresponding to the j from a default color array, namely SC _ dic (j) (hex (j/1499) 255) until all grids are colored;

wherein the creet array comprises RGB values of 256 colors with different depths, and ceil represents rounding up;

and finally, adjusting the picture to be in a proper size according to the residual condition of the storage space of the equipment and storing the picture as the MFD chromatographic characteristic of the anonymous application Tor flow.

The invention also discloses an MFD chromatographic characteristic extraction system for distinguishing the anonymous Tor application flow, which comprises a flow acquisition module, a flow distribution calculation module, an MFD chromatographic characteristic extraction module and an anonymous application Tor flow classification module;

the flow acquisition module acquires and converges the flow by using a port mirroring mode at each key node of the experimental environment and the data center network equipment respectively, and the flow is preprocessed to distribute an anonymous Tor application label for the flow under the experimental environment and is stored on the physical equipment together with the data center flow;

the flow distribution calculation module splits the original flow according to the data packet time stamp, extracts the data packet size sequence and the data packet direction sequence of the split flow, and calculates the number proportion distribution, the frequency distribution and the direction distribution of the data packets with different sizes according to the sequence;

the MFD chromatographic characteristic extraction module fills three distributions obtained by the flow distribution calculation module as RGB three colors into a square picture containing 1500 grids, and names the picture by using an anonymous application Tor flow label;

and the Tor flow classification module for anonymous application takes the extracted MFD chromatographic characteristics as the input of a machine learning model, adjusts the model structure to be optimal according to the MFD chromatographic characteristics, and finally classifies the Tor flow for anonymous application by using the model.

And performing the operations of the steps to perform deep learning classification through the target flow data in the experimental environment to obtain a classification model, then processing the actual target flow data in the data center network equipment to obtain corresponding MFD characteristics, inputting the MFD characteristics into the obtained classification model, and finally outputting the actual classification.

When deep learning is performed, any machine learning model (for example, random forest RF) can use the MFD features, the processed MFD features are used as RF input, a sample subset and the MFD feature subset are randomly selected, attributes are divided according to an information gain strategy, a plurality of decision trees are established to form a random forest, and a model structure is adjusted to be optimal through a cross validation method.

The invention also discloses a computer storage medium, wherein an MFD chromatographic characteristic extraction program for distinguishing the anonymous Tor application flow is stored in the computer storage medium, and the MFD chromatographic characteristic extraction method for distinguishing the anonymous Tor application flow is realized when the program is executed.

Has the advantages that: the invention provides an MFD chromatographic characteristic extraction method for distinguishing anonymous Tor application flow aiming at the difference of data distribution among anonymous application Tor flows, combining the multiplexing characteristic of a Tor network and the requirement of effective input characteristics in a fine-grained Tor flow classification task, extracts the number proportion distribution, the frequency distribution and the direction distribution of data packets with different sizes and fuses the data packets into MFD characteristics by observing and analyzing the size and the direction sequence of the data packets in a fixed time window of the anonymous application Tor flow, uses a spectral clustering algorithm to cluster and filter the MFD characteristics, shows the separability of the MFD characteristics to the different types of anonymous application Tor flows according to the distribution condition of the characteristics in clusters and among the clusters, finally uses a visualization method based on a grid diagram to map the MFD characteristics into an RGB color space to obtain the MFD chromatographic characteristics, visually shows the differentiation mode of the anonymous application Tor flows, the effect of distinguishing the Tor traffic of anonymous application can be achieved.

Drawings

FIG. 1 is a schematic overall flow diagram of the present invention;

FIG. 2 is a schematic diagram of a system in an embodiment;

FIG. 3 is a flowchart of the MFD chromatographic feature extraction module of the embodiment;

figure 4 is a flow diagram of the operation of the anonymous application Tor traffic classification module in an embodiment.

Detailed Description

The technical solution of the present invention is described in detail below, but the scope of the present invention is not limited to the embodiments.

The method combines the Tor network multiplexing characteristic and the requirement on effective input characteristics in a fine-grained Tor flow classification task, and performs differentiated characteristic extraction on anonymous application Tor flow generated by a network user. Firstly, collecting a target Tor flow set in an experimental environment, splitting the flow set into a plurality of flow slices according to a fixed time window after preprocessing, respectively extracting the size and direction sequence of the data packets for each segmentation, calculating the number proportion distribution, frequency distribution and direction distribution of the data packets with different sizes, fusing to obtain the MFD characteristics, clustering and filtering the MFD characteristics by using a spectral clustering algorithm, the separability of MFD characteristics to different types of anonymous application Tor flow is shown according to the distribution conditions of the MFD characteristics in clusters and among the clusters, the MFD characteristics are mapped to an RGB color space by using a visualization method based on a grid diagram to obtain MFD chromatographic characteristics, the differentiation mode of the anonymous application Tor flow is visually shown, the defects that the original characteristics cannot effectively distinguish different anonymous application Tor flow and the interpretability is low are overcome, and the effective characteristic representation of the anonymous application Tor flow is realized.

Example 1:

as shown in fig. 1, the detailed process of the MFD chromatographic feature extraction method for distinguishing anonymous Tor application flow according to the present embodiment is as follows:

s101: collecting network flow generated by target application in a Tor network to form a target flow set, dividing the target flow set into a plurality of flows by using a sliding time window with a fixed step length, adopting a double-threshold preprocessing strategy to omit non-active flows and noise, and extracting a data packet size sequence and a data packet direction sequence of each flow, wherein the specific steps are as follows:

first, packets using non-TCP protocol in the target traffic NT and having a size not in [1,1500]]Setting a splitting threshold value delta and a denoising threshold value tau in a data packet in an interval, wherein delta specifies the number of data packets which are least contained in one stream, tau specifies the percentage of background traffic data packets which are most contained in the stream, dividing NT into a plurality of streams according to every T seconds by using a sliding time window with a fixed step length, judging whether the number of the data packets in each stream exceeds delta, if not, deleting the stream, if so, continuously judging whether the percentage of the background traffic data packets exceeds tau, if so, deleting the stream, and finally, regarding each stream I with the length of N as [ p [ [ p ]¹，p²，...pⁱ...，p^N-1，p^N]Respectively extracting the data packet size sequence S ═ S¹，s²，...sⁱ...，s^N-1，s^N]And a packet direction sequence D ═ D¹，d²，...d^N-1，d^N]Wherein p isⁱRepresenting the ith data packet in the flow I;

s102: grouping the data packet size sequence and the data packet direction sequence in the step S101 according to size, and calculating the number ratio distribution, frequency distribution and direction distribution of the data packets with different sizes according to the window length and the total number of data packets sent in the window, specifically as follows:

creating a packet size-to-number Map_snAnd packet size-to-average direction Map_sdTraversing the packet size sequence S and the packet direction sequence D extracted in S101, if Map_snAnd Map_sdIn the presence of a size sⁱKey of (1), then Map_snMiddle key sⁱ Adds 1 to the value of (1) while Map is sent or received depending on whether the direction is sending or receiving_sdMiddle key sⁱIf no bond s is present, remains unchanged or is increased by 1ⁱThen the values are all initialized to 0, i.e.:

then according to the total data packet number N sent in the window, the window size T and Map_sn、Map_sdMapping and calculating the number proportion distribution of data packets with different sizes

Frequency distribution

And directional distribution

S103: the three distributions obtained in S102 are fused into MFD characteristics, clustering is performed on the MFD characteristics by using a spectral clustering algorithm, and the optimal cluster number is selected according to the distribution conditions of different types of MFD characteristics and the same types of MFD characteristics in each cluster. For each cluster, randomly deleting the MFD features of other types and reserving the MFD features with the largest type ratio in the cluster above a certain proportion specifically as follows:

Wherein the similarity calculation formula is a Gaussian kernel function

MFD_aAnd MFD_bThe MFD characteristic vectors of 1 row and u columns of the two streams are adopted, sigma is the bandwidth, the local action range of Sim is controlled, G is a degree matrix, and the value of diagonal elements in G is the sum of corresponding row elements in a matrix E;

selecting the optimal cluster number according to the distribution of different types of MFD characteristics and the distribution of the same type of MFD characteristics in each cluster, namely a maximization formula

Wherein ω is_qRepresentsThe proportion of the type sample q in the training set, max (b)_q，k) The maximum value of the qth row of the matrix B, that is, the maximum probability that a sample q belongs to a certain cluster;

S104: the MFD characteristics obtained in the step S103 are mapped into an RGB color space by adopting a grid diagram-based visualization method, and then are stored as MFD chromatographic characteristics after image compression and format conversion, so that a differentiation mode of anonymously applying Tor flow is visually shown and used for subsequent classification, and the method specifically comprises the following steps:

for the three distributions M _ size, F _ size and D _ size extracted in each time window in S103, to ensure that the three distribution values are all in one dimension and to prevent the normalization of some distribution values that are too large and have less influence on the original distribution, the robust normalization method is first used to linearly transform the original distribution and make the result fall to [0,255 []Intervals, i.e.

Wherein M _ size_h、F_size_h、D_size_hRepresents the ith value of the three distributions, median represents the median of the distributions, and IQR represents the range between the 1 st quartile (25%) and the 3 rd quartile (75%) of the distributions.

Then, a packet size-color dictionary SC _ dic ═ { SC ═ is created¹，sc²，..sc^M-1，sc^M-storing size and color mapping, the size and distribution of the dictionary being the same.

Then, a square picture containing 1500 grids is created, each grid is numbered sequentially from left to right and from top to bottom, the number j and the size of a data packet are in a one-to-one correspondence, j belongs to [1,1500], then, a step SC _ dic is traversed, if a key j exists in the SC _ dic, a color code stored in the SC _ dic (j) is used for coloring the grid corresponding to the number, if the key j does not exist in the SC _ dic, a gentle coloring scheme is used for extracting the color at the position corresponding to j from a default color array to color the grid, namely SC _ dic (hex (ceil (j/1499) 255) until all grids are colored, wherein the crest array contains RGB values of 256 colors with different depths, and ceil represents upward rounding.

And finally, adjusting the picture to a proper size according to the residual condition of the storage space of the equipment and storing the picture as the MFD chromatographic characteristic of the anonymous application Tor flow.

The detailed process of the data packet size-color dictionary generation algorithm comprises the following steps:

example 2:

as shown in fig. 2, the system for implementing the MFD chromatographic feature extraction method for distinguishing anonymous Tor application traffic in this embodiment includes a traffic collection module 100, a traffic distribution calculation module 200, an MFD chromatographic feature extraction module 300, and an anonymous application Tor traffic classification module 400.

The flow collection module 100 collects and converges the flow in an experimental environment and each key node of the data center network device in a port mirroring manner, and after preprocessing, the anonymous Tor application tag is distributed to the flow in the experimental environment and is stored on the physical device together with the data center flow.

The flow distribution calculation module 200 splits the original flow according to the data packet timestamp, extracts the size and direction sequence of the data packets of the split flow, and calculates the number proportion distribution, frequency distribution and direction distribution of the data packets with different sizes according to the sequence.

The MFD chromatographic feature extraction module 300 fills the three distributions obtained by the flow distribution calculation module as RGB three colors into a square picture including 1500 grids, and names the picture by anonymously applying the Tor flow label.

The Tor traffic classification module 400 is used anonymously to apply Tor traffic anonymously by using the model classification, taking the extracted chromatographic characteristics of the MFD as input of the machine learning model, adjusting the model structure to be optimal according to the chromatographic characteristics of the MFD, and finally classifying by using the model.

Example 3:

based on embodiment 2, as shown in fig. 3, the MFD chromatographic feature extraction module 301 of this embodiment first uses a robust normalization method to eliminate the dimension of the number-to-ratio distribution, the frequency distribution, and the direction distribution of the packets with different sizes, weaken the influence of the boundary value, and scale the boundary value into the [0,255] interval, and then converts the three distributions into color distributions, and stores the color distributions together with the corresponding packet sizes into the packet size-color dictionary. And finally, creating a square picture containing 1500 grids, numbering the grids by using numbers 1-1500, searching the color at the corresponding position in the data packet size-color dictionary according to the numbers, coloring the grids, if the color to which the number belongs does not exist in the dictionary, using a gentle coloring scheme, namely selecting the color at the position ceil (number/1499 255) from color arrays creet with different depths in the default 256 as a filling color until all the grids are colored completely, and finally adjusting the picture to 224 < 3 > according to the residual condition of the storage space of the equipment and storing the picture as the MFD chromatographic characteristic for anonymously applying the Tor flow.

Example 4:

on the basis of embodiment 2, as shown in fig. 4, in the Tor traffic classification module 401 for anonymous application of this embodiment, in the first stage, first, in an experimental environment, Tor traffic of 21 android platform anonymous applications (e.g., wechat, bilibili) is collected, traffic is divided according to a time window with a size of 15s, packets in each stream are traversed and grouped according to the packet size, three distributions M _ size, F _ size, and D _ size (e.g., 0.88,148.8,0.99 when the bilii packet size is 1448) of packets with different sizes are calculated respectively, MFD chromatographic characteristics of each stream are extracted and application labels are assigned, then the distributions are input into a pretrained convolutional neural network model ResNet50 to extract a deeper depth characteristic, fine tuning of model parameters is performed in the process, classification loss is calculated by using a small-batch stochastic gradient descent algorithm MBGD, model parameters are updated by back propagation until the model converges, then remove the last fully connected layer and sort with softmaxLayer and save as model M_base，M_baseOnly the convolutional and pooling layers are reserved for feature extraction. In the second stage, 200 flow samples are randomly extracted from the data set collected by the experiment at a time, 1800 flow samples are extracted from the data set collected by the data center network, and the model M trained in the first stage is utilized_baseRespectively extracting feature vectors, and defining a graph G (Node, Edge), wherein Node represents a vertex set, one vertex represents a stream feature vector, each feature vector is numbered, Node is a set of all numbers, Edge represents an Edge set and is used for representing that two vertexes have an association relation, and cosine similarity is used for determining whether the two vertexes are connected, namely:

wherein X^AAnd X^BRespectively representing the feature vectors of the vertex A and the vertex B, making epsilon be the accuracy of a model ResNet50, judging that a connection exists between two nodes if cos _ sim of the two nodes is larger than epsilon, storing the number pairs of the two nodes into an edge set E, acquiring the attention scores of every two vertices by using an attention mechanism, aggregating the feature vectors of all first-order neighbors nearby by using the attention scores as weights for each vertex to update the feature vector of each vertex, and finally classifying the aggregated feature vectors by using a machine learning classifier.

The related experiments prove that:

the accuracy for classification using the present invention is 90.9% and 88.9% on the UNB ISCXtor and the self-collected dataset, respectively, over existing solutions (e.g., Petagna E et al and Shapira T et al propose corresponding solutions).

That is to say, aiming at the problems that the characteristics used by the existing flow classification method can not effectively distinguish different anonymous Tor applications and the interpretability is low, the MFD chromatographic characteristics of anonymous Tor flow are extracted, the difference of anonymous Tor flow in different terminal program applications is visually shown, the defect that the existing characteristics can not effectively express Tor flow modes is overcome, and the characteristics are used as the input of a flow classification model to further effectively distinguish different anonymous Tor applications.

Claims

1. An MFD chromatographic feature extraction method for distinguishing anonymous Tor application flow is characterized in that: the method comprises the following steps:

step (1), collecting network traffic generated by target application in a Tor network to form a target traffic set, dividing the target traffic set into a plurality of streams by using a sliding time window with a fixed step length, removing non-active streams and noise by adopting a double-threshold preprocessing strategy for each stream, and further extracting a data packet size sequence S and a data packet direction sequence D of each stream I;

step (2), grouping the size sequence and the direction sequence of the data packets obtained in the step (1) according to the size of the data packets, and calculating the number ratio distribution M _ size, the frequency distribution F _ size and the direction distribution D _ size of the data packets with different sizes according to the window length of the sliding time window and the total number of the data packets sent in the window;

step (3), the three distributions obtained in the step (2) are fused into MFD characteristics, clustering is carried out on the MFD characteristics by using a spectral clustering algorithm, and meanwhile, the optimal cluster number is selected according to the distribution conditions of different types of MFD characteristics and the same types of MFD characteristics in each cluster; for each cluster, randomly deleting other types of MFD characteristics and keeping the MFD characteristics with the largest type in the cluster above a certain proportion;

2. The MFD chromatographic feature extraction method for differentiating anonymous Tor application traffic as recited in claim 1, wherein: the specific process of the step (1) is as follows:

firstly, removing a data packet using a non-TCP protocol and a data packet with a size not within a [1,1500] interval in target flow NT;

then setting a splitting threshold value delta and a denoising threshold value tau, setting the number of data packets which are contained in one stream at least, setting the ratio of background flow data packets which are contained in one stream at most,

dividing NT into a plurality of streams according to every T seconds by using a sliding time window T with a fixed step length, judging whether the number of data packets in each stream exceeds delta, if not, deleting the stream, if so, continuously judging whether the background flow data packet ratio exceeds tau, and if so, deleting the stream;

finally, for each length-N stream I ═ p¹，p²，...pⁱ…，p^N-1，p^N]Respectively extracting the data packet size sequence S ═ S¹，s²，...sⁱ…，s^N-1，s^N]And a packet direction sequence D ═ D¹，d²，...d^N-1，d^N]Wherein p isⁱRepresents the ith packet in flow I, I belongs to [1, N ]]。

3. The MFD chromatographic feature extraction method for differentiating anonymous Tor application traffic as recited in claim 1, wherein: the specific process of the step (2) is as follows:

creating a packet size-to-number Map_snAnd packet size-to-average direction Map_sdTraversing the data packet size sequence S and the data packet direction sequence D extracted in the step (1), and if Map is available_snAnd Map_sdIn the presence of a size sⁱKey of (1), then Map_snMiddle key sⁱAdds 1 to the value of (c), and determines whether to forward Map based on whether the packet is being sent or received_sdMiddle key sⁱWhether the value of (d) remains unchanged or is increased by 1; if Map_snAnd Map_sdIn all of which no bond s is presentⁱThen both initialization values are 0, i.e.:

then according to the total number N of the data packets sent in the window, the window size T and the Map_sn、Map_sdMapping and calculating the number proportion distribution of data packets with different sizes

Frequency distribution

And directional distribution

4. The MFD chromatographic feature extraction method for differentiating anonymous Tor application traffic as recited in claim 1, wherein: the specific process of the step (3) is as follows:

Wherein the similarity calculation formula is a Gaussian kernel function

then calculate L_norThe eigenvector F corresponding to the first F smallest eigenvalues of (F) ═ F₁，F₂，..，F_fGet through the cluster number from 2 to K_maxClustering F by using a K-means algorithm to obtain a probability matrix B^Q×KWherein Q and K represent the number of samples and clusters, respectively, b_q，kAs the probability that the qth sample belongs to the kth cluster,

5. The MFD chromatographic feature extraction method for differentiating anonymous Tor application traffic as recited in claim 1, wherein: the specific process of the step (4) is as follows:

for the three distributions M _ size, F _ size, and D _ size extracted in each time window from step (3), first, the original distribution is linearly transformed using a robust normalization method, and the result falls into the [0,255] interval, that is:

wherein M _ size_h、F_size_h、D_size_hRepresents the h-th value of the three distributions, mean represents the median of the distributions, IQR represents the range between the 1 st quartile and the 3 rd quartile of the distributions;

then, a packet size-color dictionary SC _ dic ═ { SC¹，sc²，...sc^M-1，sc^MSC _ dic stores the mapping of the size and the color of the data packet, and the size and the distribution of the dictionary are the same;

6. An MFD chromatographic feature extraction system for distinguishing anonymous Tor application flow, characterized in that: the system comprises a flow acquisition module, a flow distribution calculation module, an MFD chromatographic characteristic extraction module and an anonymous application Tor flow classification module;

7. A computer storage medium, characterized in that: the computer storage medium stores an MFD chromatographic feature extraction program for distinguishing anonymous Tor application flow, and the MFD chromatographic feature extraction program is executed to realize the MFD chromatographic feature extraction method for distinguishing anonymous Tor application flow according to any one of claims 1 to 5.