CN114710310B

CN114710310B - Method and system for recognizing Tor user access website based on network traffic frequency domain fingerprint

Info

Publication number: CN114710310B
Application number: CN202210056538.9A
Authority: CN
Inventors: 罗向阳; 孙玉宸; 王菡; 马照瑞; 李玲玲; 刘粉林
Original assignee: Information Engineering University of PLA Strategic Support Force
Current assignee: Information Engineering University of PLA Strategic Support Force
Priority date: 2022-01-18
Filing date: 2022-01-18
Publication date: 2023-06-09
Anticipated expiration: 2042-01-18
Also published as: CN114710310A

Abstract

The invention belongs to the technical field of an anonymous communication system Torr, and particularly relates to a Torr user access website identification method and system based on network flow frequency domain fingerprints, wherein the method extracts the direction and length characteristics of a time domain cell sequence in access flow, combines the direction and length characteristics to form a cell feature sequence, carries out discrete wavelet transformation on the cell feature sequence and acquires a low-frequency part of the cell frequency domain feature sequence, removes a high-frequency part containing noise, and can effectively reduce the influence of noise generated in the process of accessing the website by a user when the discrete wavelet transformation is used as a frequency domain processing method; by using a deep learning classification model combining CNN, FC and Self-attribute, the inherent relation between frequency domain feature sequences can be found, the identification and classification of access traffic can be efficiently completed, and a plurality of regularization techniques are used in the model to prevent the occurrence of over-fitting problems in the model training process. The invention can improve the fingerprint identification accuracy of the website to a certain extent.

Description

Method and system for recognizing Tor user access website based on network traffic frequency domain fingerprint

Technical Field

The invention belongs to the technical field of an anonymous communication system (Torr), and particularly relates to a Torr user access website identification method and system based on network traffic frequency domain fingerprints.

Background

With the driving of benefits, a large number of network intrusion behaviors are generated in the Internet. Tor is the most popular anonymous communication system at present that provides privacy services for users exceeding 200million a daily basis. Tor protects the anonymity of user access by establishing an encrypted link that is a three-hop relay. These relays are randomly selected and the links are periodically replaced during client access to the server. Although it is very difficult to directly crack the Tor anonymous communication system, it has been demonstrated by previous studies that network traffic analysis can affect the security of the Tor, particularly the website fingerprint attack (Website Fingerprint, WF). The user may generate different network traffic characteristics, such as different numbers of packets, different traffic burst patterns, etc., when accessing each website. In a WF attack, law enforcement intercepts traffic and extracts features of traffic packets in the encrypted connection between the monitored user and the Tor ingress node. And determining whether the intercepted flow has a corresponding relation with the website of interest by the classifier, and if the intercepted flow is matched with the classifier, indicating that the monitored user is visiting the website of interest. WF attacks enable law enforcement to determine whether a monitored user is browsing an illegal website, particularly a website that conducts a black transaction, which is of great importance in fighting illegal crimes.

In order to make the Tor network safer, researchers have proposed some defensive measures to resist WF attacks, and the basic principle is to operate on data packet traffic (measures such as adding, deleting, delaying data packets, etc.) so as to achieve the purpose of confusing traffic characteristics.

The original purpose of Tor is to provide anonymity to users during data communications. Tor is required to avoid WF attacks as much as possible so as not to affect the security, and therefore, defensive measures against WF attacks are proposed. However, for law enforcement officers, since a large number of illegal actions occur in the Tor, monitoring of illegal persons and websites is necessary, and thus further research into WF attacks by the Tor using defensive measures is required. Because of the provision of the countermeasure, traffic bursts in original Torr traffic are basically reduced, traffic confusion is carried out on the original Torr traffic, and WF attack efficiency is obviously reduced. For future measures that may be used by the Tor, it is important to improve the recognition accuracy thereof. Second, onion services are the safest services provided by the Torr, which contain a large number of illicit transactions. WF attacks on Tor networks using onion services are also a concern. Accessing the web site by the user in the onion service requires more complex links to be established and has a more sophisticated security authentication mechanism. This adds a lot of traffic noise to the access traffic that is generated for authentication purposes. The existing method has not ideal fingerprint identification effect on the Tor flow using onion service. Although these methods can discover the behavior patterns of users accessing different websites from different characteristics such as time sequence, direction and the like of traffic, none of them can reduce the influence of traffic noise on fingerprint identification.

Disclosure of Invention

The existing method usually classifies the flow characteristics by manually extracting the flow characteristics of a user accessing a website and constructing a machine learning or deep learning model, and the method has poor classification effect in the Tor network under the condition of defensive measures or onion service, so that the invention provides the Tor user accessing website identification method and system based on network flow frequency domain fingerprints, and the website fingerprint identification accuracy is improved to a certain extent.

In order to achieve the above purpose, the present invention adopts the following technical scheme:

the invention provides a method for identifying a user access website by Torr based on network traffic frequency domain fingerprints, which comprises the following steps:

capturing background traffic in the process of accessing a website by a user, and generating an original traffic data packet;

extracting the direction and length information of the cell sequence in the original flow data packet, and combining the direction and length information to form a cell characteristic sequence;

converting the cell characteristic sequence into a cell frequency domain characteristic sequence through discrete wavelet transformation, and reserving a low-frequency sequence generated after the discrete wavelet transformation;

storing the cell frequency domain feature sequence and the corresponding website label into a database;

extracting a cell frequency domain feature sequence and a corresponding website label thereof from a database according to model training requirements, and generating a training sequence matrix and a training label matrix;

Constructing a deep learning classification model according to the data type and the characteristics of the flow;

training the deep learning classification model by using a training sequence matrix and a training label matrix, and selecting proper super parameters through training;

extracting a cell frequency domain characteristic sequence to be tested from a database to generate a test sequence matrix;

and predicting the test sequence matrix by using a deep learning classification model, obtaining a website label corresponding to the cell frequency domain feature sequence to be detected, completing the identification of unknown flow, and associating the flow with a website.

Further, the extracting the direction and length information of the cell sequence in the original flow data packet, and combining them to form a cell feature sequence includes:

mapping the original cell sequence to [ +1, -1]In the value field of (2), the direction of data flowing into law enforcement is defined as "+1", the direction of data flowing out of law enforcement is defined as "+1", and the cell direction sequence Seq is constructed _dir ；

The client and the server interact through TCP protocol, firstly, the cells which do not contain TCP protocol are filtered out, then the length of the cells of the TCP protocol layer is extracted, and a cell length sequence Seq is formed _len ；

Combining the cell direction sequence and the cell length sequence, and constructing the cell characteristic sequence Seq by multiplying the two sequences _mix As shown in formula (1):

Seq _mix ＝Seq _len ×Seq _dir (1)。

further, the discrete wavelet transformation uses a band-pass filter to perform one-layer architecture decomposition on the cell characteristic sequence, and the multiple Q of the downsampling filter is set to 2, and the sequence decomposition method is as shown in formula (2) and formula (3):

where L (k) denotes a low-pass filter, H (k) denotes a high-pass filter, n denotes a cell characteristic in the time domain, k denotes a cell characteristic in the frequency domain, and n and k are variables; the cell characteristic sequence is processed by the frequency domain of the formula (2) and the formula (3) to obtain a low-frequency sequence x _1,L (n) and a high-frequency sequence x _1,H (n), low frequency sequence x _1,L (n) the part which is slowly changed in the characteristic sequence of the cell is the basic frame of the sequence, belongs to the approximate information of the sequence, and is the high-frequency sequence x _1,H (n) contains the rapid change part of the cell characteristic sequence, which belongs to the detail information of the sequence, and contains noise, thus the low frequency sequence x _1,L (n) leave behind, remove the high frequency sequence x _1,H (n)。

Further, the deep learning classification model comprises a basic module layer, a full connection layer and a self-attention mechanism layer; the basic module Layer sequentially comprises Conv Layer, pad, batch Normalization, ELU or ReLU, max Pooling, pad and Dropout; the full-connection Layer sequentially comprises FC layers, batch Normalization, reLU and Dropout; the self-Attention mechanism Layer includes Embedding, self-Attention Layer, batch Normalization, reLU, dropout and Label smoothening in order.

Further, dropout, batch Normalization and Label Smoothing belong to regularization techniques to prevent overfitting during model training; the Dropout is used for reducing interaction among hidden nodes, and generalization of the model is enhanced in a mode that a certain neuron stops working in probability; the Batch Normalization is used for normalizing the output result to enable the output to conform to standard normal distribution; the Label Smoothing is used to cause the classification probability result after activation of the softmax activation function in the neural network to approach the correct classification.

Further, for the one-dimensional cell frequency domain feature sequences, the cell frequency domain feature sequences with different lengths are set as fixed thresholds, sequences with the lengths smaller than the thresholds are filled with 0, sequences with the lengths larger than the thresholds are cut off, and all the processed cell frequency domain feature sequences are combined to form an input matrix of the deep learning classification model.

Further, the super parameters include Wavelet, base Model, number of FC Layers, FC, self-Attention, optimizer, batch Size, and Dropout [ Base Model, self-Attention, FC ].

Further, a value range is defined for the super-parameters, the super-parameters with smaller value ranges are traversed to be valued, and the super-parameters with larger value ranges are valued by using a dichotomy.

The invention also provides a system for identifying the Tor user access website based on the network traffic frequency domain fingerprint, which comprises the following steps:

the flow data packet capturing module is used for capturing background flow in the process of accessing the website by the user and generating an original flow data packet;

the cell characteristic sequence extracting module is used for extracting the direction and length information of the cell sequence in the original flow data packet and combining the direction and length information to form a cell characteristic sequence;

the cell frequency domain feature sequence generation module is used for converting the cell feature sequence into a cell frequency domain feature sequence through discrete wavelet transformation and reserving a low-frequency sequence generated after the discrete wavelet transformation;

the database module is used for storing the cell frequency domain feature sequence and the corresponding website label thereof into a database;

the training set generation module is used for extracting the cell frequency domain feature sequences and the corresponding website labels thereof from the database according to the model training requirements and generating a training sequence matrix and a training label matrix;

the model construction module is used for constructing a deep learning classification model according to the data type and the characteristics of the flow;

the model training module is used for training the deep learning classification model by utilizing the training sequence matrix and the training label matrix, and selecting proper super parameters through training;

The test set generation module is used for extracting the cell frequency domain characteristic sequences to be tested from the database and generating a test sequence matrix;

and the model classification module is used for predicting the test sequence matrix by using the deep learning classification model, acquiring a website label corresponding to the cell frequency domain feature sequence to be detected, completing the identification of unknown flow, and associating the flow with a website.

Compared with the prior art, the invention has the following advantages:

according to the Tor user access website identification method (FDF) based on the network flow frequency domain fingerprint, the direction and length characteristics of the time domain cell sequence in the access flow are extracted, the time domain cell sequence is combined to form the cell feature sequence, the cell feature sequence is subjected to discrete wavelet transformation, the low-frequency part of the cell frequency domain feature sequence is obtained, the high-frequency part containing noise is removed, the discrete wavelet transformation is used as the frequency domain processing method, the influence of noise generated in the process of accessing the website by a user on fingerprint identification can be effectively reduced, and the fingerprint identification accuracy is improved. By using a deep learning classification model combining a Convolutional Neural Network (CNN), a full connection layer (FC) and a Self-Attention mechanism layer (Self-Attention), the intrinsic relation between frequency domain feature sequences can be found, the identification and classification of access flow can be efficiently completed, and a plurality of regularization techniques are used in the model to prevent the fitting problem in the model training process, so that the model classification accuracy is further improved. Particularly, under the scene of adopting defensive measures and onion services, the influence of noise on fingerprint identification can be larger due to the increase of a safety mechanism, and the accuracy of the fingerprint identification of the website in the environments is obviously superior to that of the existing method.

Drawings

In order to more clearly illustrate the embodiments of the invention or the technical solutions in the prior art, the drawings that are required in the embodiments or the description of the prior art will be briefly described, it being obvious that the drawings in the following description are only some embodiments of the invention, and that other drawings may be obtained according to these drawings without inventive effort for a person skilled in the art.

FIG. 1 is a schematic diagram of a conventional website fingerprint identification method;

FIG. 2 is a flow chart of a method for identifying a user to access a website based on a Tor of network traffic frequency domain fingerprint according to an embodiment of the present invention;

FIG. 3 is a diagram of a discrete wavelet transform decomposition process architecture according to an embodiment of the present invention;

FIG. 4 is an exploded view of a cell signature sequence according to an embodiment of the present invention;

FIG. 5 is a diagram of a deep learning classification model architecture of an embodiment of the present invention;

FIG. 6 is a time-consuming graph of different model training processes for embodiments of the present invention;

fig. 7 is a Precision-Recall graph of an attack in the open world of an embodiment of the present invention.

Detailed Description

The following description of the embodiments of the present invention will be made clearly and completely with reference to the accompanying drawings, in which it is apparent that the embodiments described are only some embodiments of the present invention, but not all embodiments. All other embodiments, which can be made by those skilled in the art based on the embodiments of the invention without making any inventive effort, are intended to be within the scope of the invention.

Tor constitutes a worldwide volunteer overlay network of thousands of relays to direct Internet traffic. During the process of accessing websites by users, the access traffic of the users can be encrypted by multiple layers, so that an attacker cannot know which websites the users are accessing. Many illegal websites appear in onion services, and users can log into the website to complete transactions without being tracked. Thus, obtaining a website that a user is visiting with knowledge of the identity of the user is a valuable issue. Although the Tor can effectively protect the security privacy of the user, the anonymity of the user can be reduced by means of flow analysis. When a user accesses a website, a series of associated traffic is generated, and the traffic is relatively fixed in mode for a certain period of time. That is, users in the same area access the same website within a certain time range can acquire similar data packets. The website that the user is visiting can be discriminated by analyzing the visiting traffic of the user. As shown in fig. 1, a law enforcement agent is deployed locally, and network traffic between a client and a server is collected to identify a website that a user is visiting. This law enforcement may be a router, internet Service Provider (ISP), autonomous service, etc., capable of arbitrarily collecting encrypted traffic between clients and ingress nodes. Law enforcement is unable to discard, modify, insert, and delay packets. If the stream is tampered with during the process of the user accessing the website, errors or anomalies may occur in the user returning to the page. This not only affects the user's browsing, but also alerts the user that his privacy may be compromised. Especially for illegal users, the difficulty of collecting the criminals is increased.

For web sites of interest to law enforcement we call them monitored web sites. For other types of web sites we refer to as non-monitored web sites. Among the website fingerprint recognition tasks, the task of law enforcement is to identify the monitored website. The law enforcement needs to set up a classifier, and in addition, he should access the monitored website through the Tor network cycle to collect the traffic during the access. After the collection is completed, the flow characteristics are manually extracted, and all the processed flow data are constructed into a flow matrix for training of the classifier. After classifier training is completed, law enforcement can passively collect encrypted traffic during the monitored user's access to the server, process the traffic in the same manner as the training set, and then use the classifier to classify the traffic to determine whether the website being accessed by the monitored user is a monitored website.

In the existing website fingerprint identification method, the main factors influencing the classification of the Torr flow fingerprint are noises in the flow, and the noises can effectively confuse the characteristics of the original Torr flow, so that the classification accuracy is reduced. To solve the problem, the influence of noise on classification can be reduced by frequency domain transformation, based on which, the embodiment proposes a method for identifying a Tor user accessing a website based on network traffic frequency domain fingerprint, as shown in FIG. 2, comprising the following steps:

Step S1, capturing background traffic in the process of accessing a website by a user, and generating an original traffic data packet.

And S2, extracting the direction and length information of the cell sequence in the original flow data packet, and combining the direction and length information to form a cell characteristic sequence.

And S3, converting the cell characteristic sequence into a cell frequency domain characteristic sequence through discrete wavelet transformation, reserving a low-frequency sequence generated after the discrete wavelet transformation, and increasing the flow mode difference of different websites through a frequency domain transformation method to obtain a better classification result.

And S4, storing the cell frequency domain feature sequence and the corresponding website label into a database.

And S5, extracting a large number of cell frequency domain feature sequences and corresponding website labels from the database according to model training requirements, and generating a training sequence matrix and a training label matrix.

And S6, constructing a deep learning classification model according to the data type and the characteristics of the flow, and improving the accuracy of model classification by using a series of methods for preventing overfitting.

And S7, training the deep learning classification model by using a training sequence matrix and a training label matrix, and selecting proper super parameters through training.

And S8, extracting the cell frequency domain characteristic sequence to be tested from the database to generate a test sequence matrix.

And S9, predicting the test sequence matrix by using a deep learning classification model, obtaining a website label corresponding to the cell frequency domain feature sequence to be detected, completing the identification of unknown flow, and associating the flow with a website.

Specifically, the step S2 of extracting the cell feature sequence includes:

the data packet cell sequence of the website can be obtained by capturing the data packet in the process that the user accesses the website. By analyzing the sequence, various characteristics such as direction, length, timing, burst, etc. of the cell sequence can be extracted. We select the direction and length of the cell sequence therein as the key features for extraction.

Mapping the original cell sequence to [ +1, -1]In the range of (2), law enforcement will typically monitor before entry relay, specify a data flow direction into law enforcement of "+1", a data flow direction out of law enforcement of "+1", and construct a cell direction sequence Seq by this method _dir 。

Each cell in the cell sequence is transmitted after protocol packaging, and the client and the server interact through TCP protocol, so that the cells not containing TCP protocol are filtered out firstly, and then the length of TCP protocol layer cells is extracted to form a cell length sequence Seq _len 。

Seq _mix ＝Seq _len ×Seq _dir (1)。

in previous studies, researchers have experimentally demonstrated that verifying the length of a cell sequence does not significantly improve the accuracy of the attack. A good attack can be achieved by using only the direction of the cell sequence. In our method, however, the length of the cell sequence is necessary. Any time sequence can be regarded as being formed by infinitely overlapping sine waves with different frequencies. Amplitude is the most fundamental feature of a sine wave, and if only the direction of the cell sequence is used, the whole information of the sine wave cannot be represented. We therefore consider that combining the length and direction of the cell sequence can achieve better results in the frequency domain transform.

Specifically, the step S3 of generating a cell frequency domain feature sequence includes:

a sequence of time-based data packet cells can be understood as a result of a signal varying over time. The frequency domain analysis of the cell sequence of the data packet can obtain more useful information, and the analysis of the frequency composition of the sequence, more precisely, the sequence can be decomposed into several sub-sequences, and the internal connection of each cell in the cell sequence is represented in this way, so that the method is convenient for obtaining better effects in the subsequent neural network training process.

DWT (Discrete Wavelet Transformation) are capable of discretizing the scale and translation of the basic wavelet. The method can analyze the frequency domain characteristics in the local time domain process, and is more suitable for analyzing the non-stationary process. The discrete wavelet transform uses a band-pass filter to decompose the cell characteristic sequence into a plurality of frequency domain components, thus greatly reducing the interference of noise and leading the expression form to be more visual. Discrete sequence discrete wavelet transform decomposition process architecture is shown in fig. 3.

In fig. 3, L (n) and H (n) represent a low-pass filter and a high-pass filter, respectively, and Q represents a Q-time downsampling filter. The sequence decomposed at the alpha layer in the architecture can be represented by the relations (2) and (3). The high frequency components are extracted in each layer, while the low frequency components are deployed to the next layer for decomposition. Since each layer is downsampled Q times, if the length of the input cell signature sequence is L, x in layer alpha _α,L (i) And x _α,H (i) Are all L/Q in length ^α 。

In this example, we perform a layer of architectural decomposition on the data packet cell feature sequence and set the multiple of the downsampling filter to 2, so the sequence decomposition method is as shown in formula (4) and formula (5):

In the method, in the process of the invention,l (k) denotes a low-pass filter, H (k) denotes a high-pass filter, n denotes a cell characteristic in the time domain, k denotes a cell characteristic in the frequency domain, and n and k are variables; as shown in FIG. 4, the cell characteristic sequence is subjected to frequency domain processing according to the formula (4) and the formula (5) to obtain a low-frequency sequence x _1,L (n) and a high-frequency sequence x _1,H (n), low frequency sequence x _1,L (n) the part which is slowly changed in the characteristic sequence of the cell is the basic frame of the sequence, belongs to the approximate information of the sequence, and is the high-frequency sequence x _1,H (n) contains the rapid change part of the cell characteristic sequence, which belongs to the detail information of the sequence and contains noise. Thus, a low frequency sequence x capable of representing the sequence profile features will be _1,L (n) remaining for training of the model, removing the high frequency sequence x _1,H (n) the interference of noise in the sequence to fingerprint recognition can be reduced.

The reason why the characteristic frequency domain processing only performs one layer of DWT decomposition is that: in other applications of DWT, multiple layers of wavelet decomposition are often required to achieve better results. Each decomposition of the DWT results in two components, high and low frequency, which are the same length. In our method, we use the decomposed low frequency components each time. That is, the cell signature sequence length is halved for each layer of DWT decomposition. For example, the input length of the cell signature sequence in the FDF model is 5000, and the original cell signature sequence length is 20000 if two layers DWT are performed. We have found through statistical analysis that all original cell signature sequences are less than 10000 in length, so that a large number of fills are required for the cell signature sequences. This has a great influence on the original sequence, resulting in a reduced accuracy of the classification result. In summary, we choose to perform a layer of DWT decomposition on the cell signature sequence.

Specifically, the step S6 of constructing a deep learning classification model includes:

the identification of web site fingerprints on Tor is a supervised classification problem. Starting from the deep fingerprint attack (DF), the deep learning technology achieves good effect on the problem of website fingerprint identification. In DF, two convolutional layers are used before each Max Pooling. Researchers believe that adding more convolutional layers to each Base Model can achieve a deeper network and more efficient extraction of features. In our model, only one convolution layer is used before each Max Pooling, which can effectively reduce the complexity of the neural network. After the Base Model we add a Self-Attention layer, which is done because the CNN only considers the information in the receptive field and acts on a local area only. While Self-Attention considers information on the characteristic sequence of the whole cell frequency domain, the range is more widely contained. Therefore, we consider that the local feature in the cell frequency domain feature sequence is extracted through CNN, and then the global feature is extracted through Self-attribute, so as to form a complete model. By the aid of the method, complexity of the neural network is reduced, and feature extraction is not affected. The architecture of the deep learning classification model is shown in fig. 5, and specifically comprises a basic module layer, a full connection layer and a self-attention mechanism layer; the basic module Layer sequentially comprises Conv Layer (convolution Layer), pad (filling Layer), batch Normalization (regularization), ELU or ReLU (activation function), max Pooling Layer, pad (filling Layer) and Dropout (regularization); the full connection Layer sequentially comprises an FC Layer (full connection Layer), batch Normalization (regularization), a ReLU (activation function) and a Dropout (regularization); the Self-Attention mechanism Layer comprises an encryption function, a Self-Attention Layer, batch Normalization (regularization), a ReLU (activation function), a Dropout (regularization) and a Label smoothening (regularization) in sequence.

Because the neural network has fixed requirement on the input size, for the one-dimensional cell frequency domain feature sequences, the cell frequency domain feature sequences with different lengths are required to be set as fixed thresholds, sequences with the lengths smaller than the thresholds are filled with 0, sequences with the lengths larger than the thresholds are cut off, and all the processed cell frequency domain feature sequences are combined to form an input matrix of the deep learning classification model.

Many useless data packets are generated when users access websites in Tor due to network congestion, identity verification and the like, which may cause the same user to generate different flows when accessing the same website for multiple times at adjacent time points, and the noisy data packets may cause fitting problems in the neural network training process. For the over-fitting problem we use regularization techniques such as Dropout, batch Normalization (BN) and Label Smoothing methods. Dropout can reduce interactions between hidden nodes, and enhance generalization of the model by letting a certain neuron probability stop working. BN can normalize the output results, subject the output to a standard normal distribution, and reduce internal covariate offset (ICS), which not only helps the network fit faster, but also reduces the over-fitting problem. Dropout is rarely used after the convolutional layer due to the few parameters of the convolutional layer, generally BN is used. Thus in our model, BN is connected immediately after each CNN, and Dropout is used after Max Pooling to prevent overfitting. The parameters in the FC and Self-attribute processes are more, so BN and Dropout can be used together.

In order to make the probability distribution of prediction approach to the real distribution in the neural network prediction process, it is common practice to encode the real label by using a one-hot method, and this encoding method may make the model lack adaptability, and the prediction of the model is too confident, so that the over-fitting problem occurs. Label Smoothing makes the empirical distribution of the difference between the maximum prediction and the average value of other categories smoother by adding a Smoothing coefficient, softens one-hot coding and reduces the over-fitting problem. Label Smoothing essentially forces the classification probability result after Softmax activation function activation in the neural network to approach the correct classification, thus placing it in the final part of the model.

Aiming at the problem of selecting the super parameters in different modules, a value range is defined for the super parameters according to experience. And traversing the super parameters with smaller value ranges to take values. And for the super-parameters with larger value ranges, a dichotomy is used for taking the values of the super-parameters. In the process of model construction, the super parameters are screened block by block, and finally the optimal super parameter combination is obtained. The selection of the superparameters in the model is shown in table 1, which includes the range of values for each superparameter and the best effect values obtained. The collected closed data set without defense is used for carrying out parameter adjustment experiments, and other data sets are used for verification, so that good effects are achieved.

Table 1 super-parametric selection of FDF

Correspondingly to the above method for identifying the Tor user accessing website based on the network traffic frequency domain fingerprint, the embodiment also provides a system for identifying the Tor user accessing website based on the network traffic frequency domain fingerprint, which comprises:

The invention will be better illustrated by experiments and analyses.

To verify the performance of the proposed FDF method, a series of experiments were performed based on the Undefended, WTF-PAD and on Sites datasets.

(1) Data set

Closed world data set: we crawl the home pages of the top 100 Alexa ranked websites cyclically through the Torr network, each of which crawl 1000 times in total. We have deployed this work on LXD containers of ten different national VPS servers.

Open world data set: since access to all internet sites is impractical, we choose a part of the sites to use for experiments simulating the open world. And the open world website set is much larger than the closed world website set. We visited the top 40000 Alexa ranked websites in order. These websites are non-monitored websites and cannot contain 100 monitored websites collected in a closed world experiment. We have deployed this work in the same ten VPS servers.

The on service data set: the onion domain names were collected by Overdorf et al and the web sites were fingerprinted. They published the dataset used for the experiment in the form of a tshark log onto the internet. Since collecting a large number of onion domain names is a difficult task, we chose to use this dataset for experiments.

Defensive data set: we carried out evaluation tests on WTF-PAD defense methods. For WTF-PAD defense, a researcher is used for modifying the original flow collected by the researcher in a script code issued by Github, and the method is used for simulating the flow generated by filling according to a defense protocol in the access process in a real environment.

In our experiments a total of 5 datasets were used. In a closed world scenario, data collection was performed for three methods, undefended, WTF-PAD defense and Onion services, generating Undefended (CW), WTF-PAD (CW) and on Sites (CW) datasets, respectively. In an open world scenario, data collection is performed on the two methods of Undeffendend and WTF-PAD defense, and Undefended (OW) and WTF-PAD (OW) data sets are respectively generated. Table 2 shows the categories of websites in each dataset and the number of instances of websites visited. We randomly divide each data set into three parts, training set, validation set and test set. Due to the large size of the dataset, we split in a 8:1:1 ratio.

Table 2 number of classes and instances in each dataset

(2) Website fingerprint identification experiment on closed world dataset

The core of the FDF method is to perform frequency domain processing on the cell sequence before performing deep learning fingerprint recognition on the cell sequence. In addition to DWT, discrete fourier transform DFT and discrete cosine transform DCT are also mainstream frequency domain processing methods, and have good application in image processing directions. We have performed experimental comparisons of the three frequency domain processing methods described above in a closed world environment. FFT is an efficient and fast algorithm for DFT, which can shorten the operation time, so we use FFT instead of DFT to perform experiments.

Table 3 shows the accurate accuracy results of fingerprint identification for different data sets after processing by three frequency domain processing methods. It can be found that the accuracy of both the DCT and FFT methods are relatively close. This is because the DCT is a special form of DFT, a subset of DFT whose Fourier series contains only cosine terms. Meanwhile, DWT is significantly better than the other two methods on three different data sets, because the cell sequence belongs to the non-stationary signal, DWT has better effect on the non-stationary signal, and FFT is more suitable for processing stationary signal.

Table 3 comparison of attack accuracy for three frequency domain processing methods in the closed world

To show good attack effects of FDF, we compared with K-NN, K-FP, COMUL, DF and Tik-Tok attacks.

TABLE 4 comparison of FDF versus other method attack accuracy in closed world

Table 4 shows the accuracy results of different attack methods in three environments in a closed world scenario. FDF was found to be superior to other attacks on both Undefended, WTF-PAD and on datasets. The attacks do not perform well on the on dataset, which we consider to be related to the dataset, which has 539 categories, each category having only 77 traffic data, the amount of data is much smaller than the other datasets, thus reducing accuracy.

(3) Training time consumption comparison

To evaluate training time consumption, we tested whether to use GPU acceleration. We found that the accuracy of the FDF tended to be substantially stable after completion of 30 epochs. Therefore, we will complete 30 epochs as training time consumption for the FDF. We deployed this experiment on NVIDIA GTX 1080Ti and used Pytorch as the basic framework for the FDF deep learning model. As shown in fig. 6, the training time of the FDF was 22 minutes when using GPU acceleration. In the same environment and in the same dataset, DF takes 5 hours to complete training and Tik-Tok takes 5.5 hours to train. Without GPU acceleration, FDF takes about 2 hours, DF takes about 28.5 hours, and Tik-Tok takes about 29 hours. In summary, the training time consumption of the FDF attack is far less than that of other models, and especially when GPU acceleration is used, training can be completed fastest, so that the FDF has significant advantages in training time.

(4) Website fingerprint identification experiment on open world dataset

To simulate the real environment, we have performed experiments in a more realistic open world scenario. In the open world, law enforcement first determines whether traffic data belongs to a monitored web site or an unmonitored web site, and secondarily classifies all traffic belonging to the monitored web site according to a limited set of monitored web sites.

For open world scenarios, precision and Recall are proposed for the evaluation of classifiers. True Positive Rate (TPR) and False Positive Rate (FPR) may cause errors in interpretation of the performance of a model attack due to the large size difference between the limited set of monitored sites and the limited set of non-monitored sites.

In the performance evaluation process, we use the standard model proposed in DF attack. In an open world scenario, a dataset of monitored websites is trained in the same manner as a closed world. The data set of the non-monitored website is trained as an additional class. We evaluated the underwent and WTF-PAD in an open world scenario, attacking tuned for Precision and Recall, respectively. Precision and Recall are shown in relation (6) and (7). The attack tuned for Precision is performed by increasing the proportion of monitored web site traffic in all identified monitored web site traffic. When the monitored website traffic is attacked, the proportion of the monitored website traffic which is correctly identified as the monitored website traffic is increased.

Tables 5 and 6 show the results of different methods to tune the attacks to Precision and Recall, respectively, for two data sets in an open world scenario. Fig. 7 shows the Precision-Recall curve of an attack in the open world. From the above graphs, it can be found that DWT shows good effect for the underwided dataset. When adjusting Precision, a Precision of 0.99 and a Recall of 0.94 can be achieved. When Recall is adjusted, a Precision of 0.93 and a Recall of 0.99 can be achieved. For the WTF-PAD dataset, precision and Recall for all methods have decreased due to the increased defenses. The best performing DWT, when the Precision is adjusted, can reach a Precision of 0.98 and a Recall of 0.76. When Recall is adjusted, a Precision of 0.75 and a Recall of 0.96 can be achieved.

Table 5 Tuned for Precision and Tuned for Recall on Undefended (OW) dataset in open world

TABLE 6 Tuned for Precision and Tuned for Recall on WTF-PAD (OW) dataset in the open world

The Tor user access website identification method based on the network traffic frequency domain fingerprint provided by the embodiment constructs key characteristics of traffic analysis by performing DWT (discrete wavelet transform) processing on the length and direction characteristics of the cell sequence. And the neural network is used for completing the learning and classification of the flow frequency domain characteristics. The effect of the FDF was evaluated and verified in a closed world scenario (assuming that the monitored user only visited the web sites of interest to us, through which the classifier performance can be more clearly observed) and in a more realistic open world scenario (assuming that the monitored user can randomly visit different web sites, which may be web sites of interest to us or web sites of no interest to us, through which the open world can simulate a more realistic environment). In the closed world, FDFs outperform other WF attacks both at Undefended, WTF-PAD and on-the-air data sets. In the open world, FDFs continue to perform well, enabling more optimal Precision and Recall, indicating that the model is also effective in more realistic environments. In general, our results demonstrate that transforming the cell sequence into the frequency domain for deep learning can achieve good results.

It should be noted that, in this document, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus.

Finally, it should be noted that: the foregoing description is only illustrative of the preferred embodiments of the present invention, and is not intended to limit the scope of the present invention. Any modification, equivalent replacement, improvement, etc. made within the spirit and principle of the present invention are included in the protection scope of the present invention.

Claims

1. The method for identifying the Tor user access website based on the network traffic frequency domain fingerprint is characterized by comprising the following steps:

converting the cell characteristic sequence into a cell frequency domain characteristic sequence through discrete wavelet transformation, and reserving a low-frequency sequence generated after the discrete wavelet transformation; the method specifically comprises the following steps: the discrete wavelet transformation uses a band-pass filter to perform one-layer architecture decomposition on the cell characteristic sequence, and the multiple Q of a downsampling filter is set to be 2, and the sequence decomposition method is as shown in a formula (2) and a formula (3):

Where L (k) denotes a low-pass filter, H (k) denotes a high-pass filter, n denotes a cell characteristic in the time domain, k denotes a cell characteristic in the frequency domain, and n and k are variables; the cell characteristic sequence is processed by the frequency domain of the formula (2) and the formula (3) to obtain a low-frequency sequence x _1,L (n) and a high-frequency sequence x _1,H (n), low frequency sequence x _1,L (n) the part which is slowly changed in the characteristic sequence of the cell is the basic frame of the sequence, belongs to the approximate information of the sequence, and is the high-frequency sequence x _1,H (n) contains the rapid change part of the cell characteristic sequence, which belongs to the detail information of the sequence, and contains noise, thus the low frequency sequence x _1,L (n) leave behind, remove the high frequency sequence x _1,H (n)；

constructing a deep learning classification model according to the data type and the characteristics of the flow; the deep learning classification model comprises a basic module layer, a full connection layer and a self-attention mechanism layer; the basic module Layer sequentially comprises Conv Layer, pad, batch Normalization, ELU or ReLU, max Pooling, pad and Dropout; the full-connection Layer sequentially comprises FC layers, batch Normalization, reLU and Dropout; the self-Attention mechanism Layer sequentially comprises Embedding, self-Attention layers, batch Normalization, reLU, dropout and Label smoothening;

2. The method for identifying a website by Tor user access based on network traffic frequency domain fingerprint according to claim 1, wherein extracting direction and length information of a cell sequence in an original traffic data packet, combining them to form a cell feature sequence comprises:

Seq _mix ＝Seq _len ×Seq _dir (1)。

3. the method for identifying a website by a Tor user access based on network traffic frequency domain fingerprints of claim 1, wherein Dropout, batch Normalization and Label Smoothing belong to a regularization technique to prevent overfitting during model training; the Dropout is used for reducing interaction among hidden nodes, and generalization of the model is enhanced in a mode that a certain neuron stops working in probability; the Batch Normalization is used for normalizing the output result to enable the output to conform to standard normal distribution; the Label Smoothing is used to cause the classification probability result after activation of the softmax activation function in the neural network to approach the correct classification.

4. The method for recognizing the Tor user access website based on the network traffic frequency domain fingerprint according to claim 1, wherein for the one-dimensional cell frequency domain feature sequences, cell frequency domain feature sequences with different lengths are set as fixed thresholds, sequences with lengths smaller than the thresholds are filled with 0, sequences with lengths larger than the thresholds are cut off, and all the processed cell frequency domain feature sequences are combined to form an input matrix of the deep learning classification model.

5. The method of claim 1, wherein the super parameters include Wavelet, base Model, number of FC Layers, FC, self-Attention, optimizer, batch Size, and Dropout [ Base Model, self-Attention, FC ].

6. The method for recognizing the access of the Tor user to the website based on the network traffic frequency domain fingerprint according to claim 5, wherein the super parameters are defined with a value range, the super parameters with smaller value ranges are traversed and valued, and the super parameters with larger value ranges are valued by using a dichotomy.

7. A Tor user access website identification system based on network traffic frequency domain fingerprints, comprising:

the cell frequency domain feature sequence generation module is used for converting the cell feature sequence into a cell frequency domain feature sequence through discrete wavelet transformation and reserving a low-frequency sequence generated after the discrete wavelet transformation; the method specifically comprises the following steps: the discrete wavelet transformation uses a band-pass filter to perform one-layer architecture decomposition on the cell characteristic sequence, and the multiple Q of a downsampling filter is set to be 2, and the sequence decomposition method is as shown in a formula (2) and a formula (3):

the model construction module is used for constructing a deep learning classification model according to the data type and the characteristics of the flow; the deep learning classification model comprises a basic module layer, a full connection layer and a self-attention mechanism layer; the basic module Layer sequentially comprises Conv Layer, pad, batch Normalization, ELU or ReLU, max Pooling, pad and Dropout; the full-connection Layer sequentially comprises FC layers, batch Normalization, reLU and Dropout; the self-Attention mechanism Layer sequentially comprises Embedding, self-Attention layers, batch Normalization, reLU, dropout and Label smoothening;