CN109194622B

CN109194622B - Encrypted flow analysis feature selection method based on feature efficiency

Info

Publication number: CN109194622B
Application number: CN201810896859.3A
Authority: CN
Inventors: 马小博; 安冰玉; 师马玮; 焦洪山; 赵延康; 李剑锋; 彭嘉豪
Original assignee: Xian Jiaotong University
Current assignee: Xian Jiaotong University
Priority date: 2018-08-08
Filing date: 2018-08-08
Publication date: 2020-03-31
Anticipated expiration: 2038-08-08
Also published as: CN109194622A

Abstract

The invention discloses an encryption flow analysis feature selection method based on feature efficiency, which comprises the following steps of firstly defining a feature efficiency calculation method F (f); and then calculating the feature efficiency of each feature on the two types of samples, selecting the one-dimensional feature with the maximum feature efficiency, removing the samples outside the value range overlapping of the one-dimensional feature, and recording the number of the removed samples, and repeating the steps until all the features are calculated. The features are then selected based on a predetermined number of features or a threshold of feature efficiency. The invention can effectively calculate the characteristic efficiency of each characteristic on the premise of giving the maximum characteristic set, screens the characteristics according to the given characteristic efficiency threshold value or the specified characteristic number, is beneficial to improving the identification accuracy of the website fingerprint identification technology, and saves the time and space cost consumed in the construction process of the classification model.

Description

Encrypted flow analysis feature selection method based on feature efficiency

Technical Field

The invention belongs to the field of network security and user privacy, and particularly relates to an encrypted flow analysis feature selection method based on feature efficiency.

Background

In recent years, with the rapid development of the internet, the network has been tightly integrated into our production and life, and the network security has become a non-negligible problem. In daily life, the network security awareness of people is gradually improved, and more users and enterprises pay attention to the protection and the safe transmission of information. The network behavior identification technology based on the encrypted flow can be used for realizing the security supervision of the network, in particular the supervision of illegal business and unhealthy information, such as population selling, sold gambling, military fire transaction and the like. The encrypted traffic analysis can analyze the internet surfing behavior of the user, and at present, the encrypted traffic analysis is mainly based on a website fingerprint identification technology. The website fingerprint identification is a technology for identifying websites accessed in an encryption mode based on a machine learning algorithm, and classifying the websites by extracting the characteristics of network traffic and combining a supervised classification technology. The key of the technology lies in the construction process of a classification model capable of classifying websites through the extracted features.

Selecting an efficient feature extraction method is a very important issue for reasons including: (1) the essence of the website fingerprint identification technology, which is a key technology of encrypted traffic analysis, is to use a classification algorithm in machine learning to construct a classifier capable of classifying websites, so that excessive features will influence the training and testing time of the classifier, and even the classification accuracy and the like. (2) The classifier constructed in the encrypted flow analysis process is trained by the extracted numerical features. Therefore, the extracted features must accurately characterize a website and must not be overly redundant or sparse.

At present, the feature selection method for encrypted flow analysis mainly used at home and abroad is only a feature importance evaluation method based on a random forest algorithm construction process. The importance of each feature is synchronously calculated in the training process, and the importance is mainly relative to the overall performance, namely, the more important features have larger influence on the classification result. The specific working idea is as follows: feature importance is assessed by out-of-band error calculated from out-of-band data (OOB) when the classification model is trained. This method relies on a random forest model and is not based on the optimization of the feature extraction by the data itself.

Therefore, the feature selection problem related to encrypted traffic analysis has not been extensively studied, and the related technology has not been widely applied.

Disclosure of Invention

The invention aims to provide a method for selecting encrypted traffic analysis characteristics based on characteristic efficiency, so as to solve the problems.

In order to achieve the purpose, the invention adopts the following technical scheme:

a method for selecting encrypted flow analysis characteristics based on characteristic efficiency comprises the following steps:

step 1: giving any several types of sample sets and an extracted feature set with the size of n, wherein each type of sample set comprises a plurality of website samples, and each sample is represented by n-dimensional extracted feature values;

step 2: based on data complexity, observing the differentiability of the features from the composition and distribution of the feature value space, and determining a calculation method of the feature efficiency of each feature;

wherein f represents a feature; c_jThe full set of samples representing class j; i C_jf∪C_kfL represents the number of the union set of the class j and the class k on the feature f in the sample, namely, the union set of the two classes is solved on the feature value, and then the size of the union set is counted;

indicating that, for feature f, all samples of classes k and j are inMAXMIN _fAndMINMAX _fthe number of samples in between;

and step 3: after the above calculation method is known, optionally calculating the F (f) value of two types of samples in a given sample class for each feature, and selecting the one-dimensional feature f _ max with the maximum value;

and 4, step 4: deleting samples except for the overlapping part of the f _ max value range in the two selected types of samples, removing the characteristic f and counting the number num of the samples;

and 5: repeating the steps 3 and 4 on the remaining feature set and the sample set until the sample set or the feature set is empty;

step 6: finally, the proportion of the number of samples removed num to the total number of samples for each feature is the importance f _ value of the feature; screening out given m most important characteristics according to the size of f _ value; and m is the size of the feature set after screening.

Further, in step 2, the calculation method for determining the feature efficiency of each feature includes the following steps:

1) optionally selecting one-dimensional characteristic f, counting the number NUM of union sets of various types of sample sets on the characteristic f, namely solving the union sets of all types on the characteristic value, and then counting the size of the union sets;

2) then, the maximum value of the statistical characteristic f in each class of samples is calculated, and the minimum value of the maximum values of the obtained classes of samples is recorded as the maximum valueMINMAX _fThen the minimum value of the statistical characteristic f in each class of samples is counted, and the maximum value of the counted minimum values is recorded asMAXMIN _f；

3) Is obtained byMAXMIN _fAndMINMAX _fafter two indices, find all samples in each class that are in for feature fMAXMIN _fAndMINMAX _fnumber of samples in between

(ii) a Finally, the feature efficiency F (f) for the feature f is obtained equal to 1 minus the number of samples

And the union size NUM of the features f.

Further, the given original feature set size n and the screened feature set size m are set by the user.

Furthermore, the class sample set is a website sample set subjected to feature extraction, and the feature extraction is to extract representative features including the number of data packets and the interval time between the data packets from the captured data packet sequence according to the time and size characteristics of the data packets.

Compared with the prior art, the invention has the following technical effects:

the invention can effectively calculate the characteristic efficiency of each characteristic on the premise of giving the maximum characteristic set, screens the characteristics according to the given characteristic efficiency threshold value or the specified characteristic number, is beneficial to improving the identification accuracy of the website fingerprint identification technology, and saves the time and space cost consumed in the construction process of the classification model. Under the condition that the feature set extracted from the network flow is huge, the feature set can be maximally reduced on the premise of ensuring the highest feature coverage, the training cost during the construction of the classification model is saved, and the identification accuracy is improved.

The invention is independent of classification models, and is based on the essence of data, and from the view point of data complexity, the importance of each dimension feature in sample differentiation is evaluated, and the functional overlapping problem among different features mentioned before is emphasized to be solved, so that the effective screening is carried out.

Drawings

FIG. 1 is a flow chart of the present invention;

Detailed Description

The invention is further described below with reference to the accompanying drawings:

referring to fig. 1, a method for selecting characteristics of encrypted traffic analysis based on characteristic efficiency includes the following steps:

express, about a featuref all samples of class k and j areMAXMIN _fAndMINMAX _fthe number of samples in between;

In step 2, the calculation method for determining the feature efficiency of each feature comprises the following steps:

And the union size NUM of the features f.

The given original characteristic set size n and the screened characteristic set size m are set by the user.

The class sample set is a website sample set subjected to feature extraction, and the feature extraction is to extract representative features including the number of data packets and the interval time between the data packets from the captured data packet sequence according to the time and size characteristics of the data packets.

Example 1:

step 1, acquiring all feature sets including all samples and class information thereof.

Assume that the two categories and the sample information they include are as follows:

{{baidu.com}：[[1,2,3,0],[2,0,5,6],[1,3,4,6],[0,3,4,5]]}

{{google.com}: [[1,3,5,2],[3,4,3,7],[2,5,3,6],[2,5,4,8]]}

where baidu and google represent two web site categories, respectively, each "[ ]" representing a traffic sample. It can be seen that there are 4-dimensional features in total, i.e. the number fea _ num is 4, and it is assumed that the four-dimensional features respectively represent the number of packet size ratios in the range of 32, 64,128, 256; where the number of samples n is 8. Now assume that two-dimensional valid features are to be screened out.

Step 2: calculating their values for each feature on two types of samples according to a predetermined feature efficiency calculation formula

The value and the largest one-dimensional feature f _ max with the highest value is selected. For example, the category sample information for step 1 is calculated as follows:

first, the feature efficiency of the first-dimension features is calculated

For the first dimension of the feature, it is,

indicating that the number of samples between 1-2 eigenvalues in the two classes is known to be 6, | C_jf∪C_kfAnd | represents that the number of the union of the two classes in the sample on the feature in the current dimension is 7.

The resulting feature of this dimension has a feature efficiency of 0.143.

Then respectively calculating the feature efficiencies of the residual three-dimensional features according to the calculation method, wherein the feature efficiencies are respectively as follows: 0.571,0,0.429.

And finally, selecting the one-dimensional feature f _ max with the maximum value as the second-dimensional feature.

And step 3: and removing samples except the overlapping part of the f _ max value range in the two types of samples, removing the characteristic f and counting the number of the samples. First, the overlapping part of the range of the feature f _ max obtained in step 2 is found to be {3}, so that the samples whose feature value does not belong to the set can be removed first, i.e., [2,0,5,6], [3,4,3,7], [2,5,3,6], [2,5,4,8], [1,2,3,0] can be removed, and the number of the removed samples is counted to be 5. The remaining sample features f _ max are then deleted, all remaining samples now containing 3-dimensional features.

And 4, step 4: repeating steps 2 and 3 for the remaining feature set and sample set until the sample set or feature set is empty.

The sample information remaining after the first removal of the feature is:

{{baidu.com}：[ [1,4,6],[0,4,5]]}

{{google.com}: [[1,5,2]]}

and (4) recalculating the sample information through the steps 2 and 3 to obtain the removed features and the removed sample number information. And repeating the steps until the sample set is empty or the feature set is empty, and recording the number information of the removed samples corresponding to all the features. The final results are as follows:

number of samples for 1 st dimension feature removal: 1

Number of samples for 2 nd dimension feature removal: 5

Number of samples for 3 d feature removal: 2

Number of samples for 4-dimensional feature removal: 0

And 5: the importance of each feature is calculated as the ratio of the number of samples removed to the total number of samples removed for that feature. Wherein the importance of the four-dimensional features is 0.125,0.625,0.25 and 0 respectively. The two features are finally screened as the second and third dimension features, respectively.

Claims

1. A method for selecting encrypted flow analysis characteristics based on characteristic efficiency is characterized by comprising the following steps:

2. The feature efficiency-based encrypted traffic analysis feature selection method according to claim 1, wherein in the step 2, the calculation method for determining the feature efficiency of each feature comprises the following steps:

And the union size NUM of the features f.

3. The method for selecting the encrypted flow analysis features based on the feature efficiency as claimed in claim 1, wherein the given size n of the original feature set and the size m of the feature set after being filtered are set by a user.

4. The feature selection method for encrypted traffic analysis based on feature efficiency as claimed in claim 1, wherein the class sample set is a website sample set subjected to feature extraction, and the feature extraction is to extract representative features from the captured data packet sequence according to time and size characteristics of the data packets, including the number of the data packets and the interval time between the data packets.