CN112235254A

CN112235254A - Rapid identification method for Tor network bridge in high-speed backbone network

Info

Publication number: CN112235254A
Application number: CN202011003470.5A
Authority: CN
Inventors: 吴桦; 郭树一; 程光
Original assignee: Southeast University
Current assignee: Southeast University
Priority date: 2020-09-22
Filing date: 2020-09-22
Publication date: 2021-01-15
Anticipated expiration: 2040-09-22
Also published as: CN112235254B

Abstract

The invention provides a method for rapidly identifying a Tor network bridge in a high-speed backbone network, which comprises the following specific steps: selecting relevant characteristics capable of being used for Tor bridge identification in a high-speed backbone network, and constructing a small-scale traffic data training set for model training; sampling a data packet in a high-speed backbone network, and performing statistics of data packet record and extraction of characteristic values by using a multiple Count Bloom Filter algorithm; and identifying and classifying the records of the sampled data packets by using the trained model to obtain a network bridge list. The invention can quickly and accurately identify the Tor network bridge in the backbone network, provide a network bridge list for a network manager and effectively improve the efficiency of network management; because the selected features are mostly proportional features, the selected features can be extracted from the sampled incomplete flow data and used for identification and classification, and the storage consumption of the features is reduced.

Description

Rapid identification method for Tor network bridge in high-speed backbone network

Technical Field

The invention belongs to the technical field of network space safety, and relates to a rapid identification method for a Tor network bridge in a high-speed backbone network.

Background

With the increasing security situation of the cyberspace, the supervision of the cyberspace is more strict. To evade surveillance, more and more lawbreakers choose to conduct illegal activities through the darknet. The second generation onion routing Tor, the most widely used darknet technology, is the first choice for most lawbreakers due to its high concealment and ease of operation. Therefore, in order to maintain the security of the network space, the identification of the use of the darknet is one of the research hotspots in the network security field.

Tor is most widely used compared to other darknet technologies. To ensure anonymity and to resist tracking, when using Tor for network access, the host first requests three onion routes of public address from the directory server to establish a communication link and encrypts the transmission using TLS. On the basis, Tor also introduces a bridge and an obfuscation protocol, and a host end is firstly connected with a bridge route of an undisclosed address, and then a communication link is established from the bridge route. Accordingly, the source address of the host end cannot be acquired by the onion route in the link, which causes further difficulty for network supervision.

In recent years, identification research aiming at the use of a hidden network at home and abroad mainly focuses on flow identification, and mainly focuses on a machine learning method. These studies have been essentially developed around the improvement of feature selection and machine learning algorithms, where the selected features can achieve good recognition in the complete flow data. However, the existing method has the following main problems: (1) the current method is based on a complete flow data set for research, and the selected characteristics are only suitable for complete flow data; (2) in order to improve the indexes such as identification accuracy and the like, the number of selected features in the existing research is large, and a large amount of resources are consumed during extraction and storage; (3) the identification research on the complete flow data is difficult to realize under the large-scale flow of the high-speed backbone network. The above problems cause that the existing method cannot realize the rapid identification of the Tor bridge in the high-speed backbone network environment.

Therefore, in order to realize the rapid identification of the Tor network bridge in the high-speed backbone network environment, the invention performs sampling operation at the high-speed backbone network route, selects the characteristics and selects the identification characteristics still applicable to the sampled data packet record; in order to improve the calculation and storage efficiency of the features, a multiple Count Bloom Filter algorithm is used for counting the data packet records and processing the features.

Disclosure of Invention

Aiming at the Tor bridge possibly existing in the high-speed backbone network, firstly, carrying out characteristic selection on traffic between a host end and the bridge, selecting identification characteristics still applicable to a sampled data packet record, carrying out sampling operation at a high-speed backbone network route, carrying out statistics on the data packet and calculation on characteristic values by using a multiple Count Bloom Filter algorithm in order to improve the calculation and storage efficiency of the characteristics, and finally, using a random forest algorithm to carry out identification on the bridge.

In order to achieve the purpose, the invention provides the following technical scheme:

a rapid identification method for Tor bridges in a high-speed backbone network comprises the following steps:

(1) collecting and storing Tor flow data and normal flow data used for model training;

(2) extracting features which can be used for complete flow data identification and classification from the original data, selecting the features, extracting training data from the original data after the features which can be used for recording identification and classification are reserved, and performing model training of machine learning;

(3) sampling flow data at a high-speed backbone network route, and then processing a data packet obtained by sampling by using a multiple Count Bloom Filter algorithm to obtain a record;

(4) and (4) inputting the sampling statistical result obtained in the step (3) into the model processing record trained in the step (2) for identifying the network bridge.

Further, the step (1) specifically includes the following substeps:

(1.1) installing Tor Browser software at a host end, and selecting to use a network bridge to establish a communication link;

(1.2) starting an application to start Tor flow data acquisition;

(1.3) performing network access using the Tor Browser;

(1.4) stopping collecting after the webpage is loaded, and storing the currently collected Tor flow data file between the host and the network bridge;

(1.5) starting an application to start common flow data acquisition;

(1.6) operating with common applications;

(1.7) stopping collecting after the operation is finished, and storing the currently collected common flow data file;

and (1.8) repeating the operations (1.2) to (1.7) until a sufficient amount of flow data is collected.

Further, the step (2) specifically includes the following sub-steps:

(2.1) firstly, extracting characteristics and training a model by using the complete flow data acquired in the step (1), and selecting a random forest algorithm with high use accuracy;

(2.2) when the characteristics are selected, the importance of the characteristics is evaluated by using a method based on the kini index in a random forest algorithm, wherein the calculation method of the kini index is as follows:

where k represents k classes, p_kA sample weight representing a class k;

then feature X_jThe importance of the node m, i.e., the variation of the kini index before and after branching of the node m, is:

wherein GI_mGini index, GI, representing the pre-branching node_lAnd GI_rRespectively representing the Gini indexes of two new nodes after branching;

(2.3) comprehensively considering the feature importance and the usability in the record, and selecting a proper available feature;

and (2.4) taking the flow data collected in the step (1) as original data, extracting training data from the original data through previous feature engineering, and performing model training by using a random forest algorithm.

Further, suitable characteristics available in said step (2.3) are shown in the following table:

feature(s)	Means of
		F1	Whether more than half of the packets have time stamps
F2	Ratio of non-empty packets sent by client to total number of packets
		F3	The ratio of the non-empty packets sent by the server to the total number of the packets
F4	Ratio of empty packet sent by client to non-empty packet sent by server
		F5	Ratio of empty packet sent by server to non-empty packet sent by client
F6	Ratio of non-empty packets sent by client to total number of data packets
		F7	Server-side issued nonRatio of empty packets to total number of data packets
F8	Proportion of PSH packets sent by client to total number of data packets
		F9	The proportion of PSH packets sent by the server side to the total number of data packets
F10	The proportion of packets with the length of 0-50 sent by the client to the total number of data packets
		F11	The proportion of packets with the length of 50-200 sent by the client to the total number of data packets
F12	The proportion of the packets with the length larger than 1200 sent by the client to the total number of the data packets
		F13	The proportion of packets with the length of 50-200 sent by the server side to the total number of the data packets
F14	The proportion of the packets with the length larger than 1200 sent by the server side to the total number of the data packets

。

Further, the step (3) specifically includes the following sub-steps:

(3.1) setting a data packet sampling proportion at a high-speed backbone network route for carrying out flow sampling;

and (3.2) processing the sampled data packet by using an MCBF algorithm to obtain a statistical result.

Further, the step (3.2) specifically includes the following sub-steps:

(3.2.1) for each sampled data packet, respectively taking the { source IP address, port number } and { destination IP address, port number } of the data packet as the input of a hash function, and respectively obtaining a plurality of outputs mapped to the corresponding positions of the MCBF by twice input;

(3.2.2) there exists a 12-byte data structure in each mapped location for storing the information related to the characteristics in the data packet, if the data packet satisfies the corresponding information, adding 1 to the location corresponding to the data structure, otherwise, not changing;

(3.2.3) when the set threshold value theta is reached, extracting the stored information, and then calculating a characteristic value;

and (3.2.4) calculating the extracted information to obtain a recorded characteristic statistical result.

In the step (3.2.2), the information to be stored is shown in the following table:

further, in the step (3.2.3), the information stored in the position where the number of packets sent by the client is recorded to be the minimum is taken as the extracted information.

In the step (3.2.4), the calculation correspondence between the information stored in each position and the characteristics is shown in the following table:

feature(s)	Calculation method
		F1	If the value in Counter 12 is greater than 1/2 θ, F1 is labeled 1, otherwise 0
F2	Counter 2/Counter 1
		F3	Counter 4/Counter 3
F4	(Counter 1-Counter 2)/Counter 4
		F5	(Counter 3-Counter 4)/Counter 2
F6	Counter 2/(Counter 1+Counter 3)
		F7	Counter 4/(Counter 1+Counter 3)
F8	Counter 5/(Counter 1+Counter 3)
		F9	Counter 6/(Counter 1+Counter 3)
F10	Counter 7/(Counter11+Counter 3)
		F11	Counter 8/(Counter 1+Counter 3)
F12	Counter 9/(Counter 1+Counter 3)
		F13	Counter 10/(Counter 1+Counter 3)
F14	Counter 11/(Counter 1+Counter 3)

。

Where the value of F1 is determined by Counter 12 and the threshold θ, if the value in Counter 12 is greater than 1/2 θ, then F1 for that record is marked as 1.

Compared with the prior art, the invention has the following advantages and beneficial effects:

(1) the invention can quickly and accurately identify the Tor network bridge in the backbone network, provide a network bridge list for a network manager and effectively improve the efficiency of network management.

(2) The selected features are mostly proportional features, and can be extracted from the sampled incomplete flow data for identification and classification, so that the storage consumption of the features is reduced.

(3) The invention uses multiple Count Bloom Filter algorithm for statistical processing of the sampled data packet in the high-speed backbone network, thereby improving the efficiency of data packet processing.

Drawings

FIG. 1 is a framework of the method of the present invention for rapidly identifying a Tor bridge in a high speed backbone network.

Fig. 2 shows the accuracy of different machine learning algorithm models when performing complete flow data identification and classification.

FIG. 3 shows the accuracy of the trained model.

FIG. 4 is a diagram of the multiple Count Bloom Filter algorithm.

Fig. 5 shows the sampling ratio fixed at 64: 1, prediction result parameters under different threshold conditions.

Detailed Description

The technical solutions provided by the present invention will be described in detail below with reference to specific examples, and it should be understood that the following specific embodiments are only illustrative of the present invention and are not intended to limit the scope of the present invention.

The invention provides a method for rapidly identifying a Tor bridge in a high-speed backbone network, wherein an identification frame is shown in figure 1 and comprises three parts, the first part is the construction of a training data set, the specific content is the extraction of relevant characteristics which can be used for Tor bridge identification in the high-speed backbone network, the construction of a small-scale traffic data training set, and the training of a machine learning model is carried out in the training set; the second part is the operation in the high-speed backbone network, the concrete content is that the sampling of the data packet is carried out in the high-speed backbone network, and the record statistics and the calculation of the characteristic value of the data packet after the sampling are carried out by using a multiple Count Bloom Filter algorithm; and the third part is the identification operation of the network bridge and outputs a network bridge list, and specifically, the method comprises the steps of identifying and classifying the record of the sampled data packet by using a trained machine learning model, predicting the network bridge and recording the network bridge list. In the second part, the data results after sampling and processing by multiple Count Bloom Filter algorithm are called records, and each record contains server IP, port and related characteristic value.

Specifically, the method of the invention comprises the following steps:

(1) and (4) collecting and storing Tor flow data and normal flow data used for model training.

The specific process of the step is as follows:

(1.2) starting a Wireshark flow acquisition application to start Tor flow data acquisition;

(1.3) performing network access using the Tor Browser;

(1.4) stopping collecting after the webpage is loaded, and storing a Tor flow data file (. pcap) between the host end and the network bridge which is collected currently;

(1.5) starting a Wireshark flow acquisition application to start common flow data acquisition;

(1.6) using common applications for operations including but not limited to web access, chat, etc.;

(1.7) stopping collecting after the operation is finished, and storing the currently collected common flow data file (. pcap);

(1.8) repeating the operations (1.2) - (1.7) until a total of approximately 10000 flow data are collected.

(2) Extracting the characteristics which can be used for complete flow data identification and classification from the original data, selecting the characteristics, keeping the characteristics which can be used for recording the identification and classification, extracting training data from the original data, and performing model training of machine learning.

The specific process in this step is as follows:

(2.1) firstly, extracting features and training a model by using the complete flow data acquired in the step (1), and selecting a random forest algorithm with the highest accuracy by comparing parameters such as accuracy of algorithm models such as random forests, K neighbors and naive Bayes as shown in figure 2.

(2.2) when the feature selection is carried out, the importance of the feature is evaluated by using a method based on the Gini index in a random forest algorithm. The calculation method of the kini index is as follows:

where k represents k classes, p_kRepresenting the sample weight of class k.

wherein GI_mGini index, GI, representing the pre-branching node_lAnd GI_rRespectively representing the kini indexes of two new nodes after branching.

(2.3) the final selected features, after taking into account the importance scores of the features and the availability of the features in the records, are shown in Table 1:

TABLE 1 available characteristics

And (2.4) taking the flow data collected in the step (1) as raw data, completing feature extraction and selection through the previous two steps of (2.1) and (2.2), finally determining available features in the step (2.3), extracting training data from the raw data according to the available features, and performing model training by using a random forest algorithm, wherein the model accuracy is shown in figure 3, wherein the category 1 represents ordinary flow, and the category 0 represents Tor flow.

(3) Sampling flow data at a high-speed backbone network route, storing a data packet according to a sampling ratio, and processing the data packet obtained by sampling by using a multiple Count Bloom Filter algorithm to obtain a record;

the method specifically comprises the following steps:

(3.1) acquiring a verification data set, wherein the verification data set comprises two parts, one part is traffic for carrying out Tor network access by using the same bridge in application, and the other part is traffic data acquired from zero point to fifteen point in 4 months, 9 days in early morning of 2019 by the Japan MAWI working group. The validation data set was sampled at a sampling ratio set to 128: 1;

(3.2) processing the sampled data packet by using a multiple Count Bloom Filter algorithm (MCBF for short) to obtain a statistical result, wherein the algorithm structure is shown in fig. 4, and the specific process is as follows:

(3.2.2) there is a 12 byte data structure in each mapped location for storing information about the characteristics in the packet, the information to be stored being as shown in table 2;

table 2 stored information

If the data packet meets the corresponding information, adding 1 to the position corresponding to the data structure, otherwise, keeping the data structure unchanged;

(3.2.3) when the set threshold is reached, namely when the number of data packets sent by the client reaches 100, extracting the stored information, and then calculating the characteristic value. Considering that when the number of the data packets is too large, the hash result may have an error, and therefore, the information stored in the position where the number of the data packets sent by the client is recorded to the minimum is taken as the extracted information;

(3.2.4) calculating the extracted information to obtain the recorded characteristic statistical result, wherein the calculation corresponding relation between the information stored in each position and the characteristics is shown in a table 3,

TABLE 3 correspondence of features to information

Feature(s)	Calculation method
		F1	If the value in Counter 12 is greater than 1/2 θ, F1 is labeled 1, otherwise 0
F2	Counter			2/Counter 1
		F3	Counter 4/Counter 3
F4	(Counter 1-Counter 2)/Counter 4
		F5	(Counter 3-Counter 4)/Counter 2
F6	Counter			2/(Counter 1+Counter 3)
		F7	Counter 4/(Counter 1+Counter 3)
F8	Counter 5/(Counter 1+Counter 3)
		F9	Counter 6/(Counter 1+Counter 3)
F10	Counter 7/(Counter 1+Counter 3)
		F11	Counter 8/(Counter 1+Counter 3)
F12	Counter 9/(Counter 1+Counter 3)
		F13	Counter 10/(Counter 1+Counter 3)
F14	Counter 11/(Counter 1+Counter 3)

Partial statistics as shown in table 4, when the value in Counter 12 is greater than half the threshold, i.e., 50, then F1 is labeled as 1;

table 4 partial statistical results

(4) And (3) carrying out identification classification on the bridges on the records by using the model trained in the step (2), carrying out identification on the bridges, and outputting a bridge list. The partial identification results are shown in table 5, where category 0 indicates that the server is identified as a Tor bridge, and category 1 indicates that the server is identified as a normal server.

Table 5 partial recognition results

To verify the accuracy of the invention at different sampling ratios and thresholds, when the sampling ratio is fixed at 64: the results of the experiments with different thresholds at 1 are shown in FIG. 5.

The technical means disclosed in the invention scheme are not limited to the technical means disclosed in the above embodiments, but also include the technical scheme formed by any combination of the above technical features. It should be noted that those skilled in the art can make various improvements and modifications without departing from the principle of the present invention, and such improvements and modifications are also considered to be within the scope of the present invention.

Claims

1. A rapid identification method for a Tor bridge in a high-speed backbone network is characterized by comprising the following steps:

2. The method for rapidly identifying Tor bridges in a high-speed backbone network according to claim 1, wherein said step (1) comprises the following sub-steps:

(1.2) starting an application to start Tor flow data acquisition;

(1.3) performing network access using the Tor Browser;

(1.5) starting an application to start common flow data acquisition;

(1.6) operating with common applications;

3. The method for rapidly identifying Tor bridges in a high-speed backbone network according to claim 1, wherein said step (2) comprises the following sub-steps:

where k represents k classes, p_kA sample weight representing a class k;

4. The method for rapid identification of Tor bridges in a high speed backbone network according to claim 3, wherein the suitable available characteristics in step (2.3) are shown in the following table:

feature(s) Means of F1 Whether more than half of the packets have time stamps F2 Ratio of non-empty packets sent by client to total number of packets F3 The ratio of the non-empty packets sent by the server to the total number of the packets F4 Ratio of empty packet sent by client to non-empty packet sent by server F5 Ratio of empty packet sent by server to non-empty packet sent by client F6 Ratio of non-empty packets sent by client to total number of data packets F7 The ratio of the non-empty packets sent by the server to the total number of the data packets F8 Proportion of PSH packets sent by client to total number of data packets F9 The proportion of PSH packets sent by the server side to the total number of data packets F10 The proportion of packets with the length of 0-50 sent by the client to the total number of data packets F11 The proportion of packets with the length of 50-200 sent by the client to the total number of data packets F12 The proportion of the packets with the length larger than 1200 sent by the client to the total number of the data packets F13 The proportion of packets with the length of 50-200 sent by the server side to the total number of the data packets F14 The proportion of the packets with the length larger than 1200 sent by the server side to the total number of the data packets

。

5. The method for rapidly identifying Tor bridges in a high-speed backbone network according to claim 1, wherein said step (3) comprises the following sub-steps:

6. The method for rapidly identifying Tor bridges in a high-speed backbone network according to claim 5, wherein said step (3.2) comprises the following sub-steps:

7. The method for fast identification of Tor bridges in a high speed backbone network according to claim 6, wherein in said step (3.2.2), the information needed to be stored is as shown in the following table:

8. the method for fast identifying Tor bridges in a high-speed backbone network according to claim 6, wherein in said step (3.2.3), the information stored in the location where the number of packets sent by the client is the least recorded is used as the extracted information.

9. The method of claim 6, wherein in step (3.2.4), the computed correspondence of the information stored in each location to the characteristics is as shown in the following table:

。