CN109194622B - Encrypted flow analysis feature selection method based on feature efficiency - Google Patents
Encrypted flow analysis feature selection method based on feature efficiency Download PDFInfo
- Publication number
- CN109194622B CN109194622B CN201810896859.3A CN201810896859A CN109194622B CN 109194622 B CN109194622 B CN 109194622B CN 201810896859 A CN201810896859 A CN 201810896859A CN 109194622 B CN109194622 B CN 109194622B
- Authority
- CN
- China
- Prior art keywords
- feature
- samples
- efficiency
- characteristic
- value
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Active
Links
Images
Classifications
-
- H—ELECTRICITY
- H04—ELECTRIC COMMUNICATION TECHNIQUE
- H04L—TRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
- H04L63/00—Network architectures or network communication protocols for network security
- H04L63/14—Network architectures or network communication protocols for network security for detecting or protecting against malicious traffic
- H04L63/1408—Network architectures or network communication protocols for network security for detecting or protecting against malicious traffic by monitoring network traffic
-
- H—ELECTRICITY
- H04—ELECTRIC COMMUNICATION TECHNIQUE
- H04L—TRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
- H04L63/00—Network architectures or network communication protocols for network security
- H04L63/02—Network architectures or network communication protocols for network security for separating internal from external traffic, e.g. firewalls
- H04L63/0227—Filtering policies
-
- H—ELECTRICITY
- H04—ELECTRIC COMMUNICATION TECHNIQUE
- H04L—TRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
- H04L63/00—Network architectures or network communication protocols for network security
- H04L63/14—Network architectures or network communication protocols for network security for detecting or protecting against malicious traffic
- H04L63/1441—Countermeasures against malicious traffic
Landscapes
- Engineering & Computer Science (AREA)
- Computer Security & Cryptography (AREA)
- Computer Hardware Design (AREA)
- Computing Systems (AREA)
- General Engineering & Computer Science (AREA)
- Computer Networks & Wireless Communication (AREA)
- Signal Processing (AREA)
- Data Exchanges In Wide-Area Networks (AREA)
- Management, Administration, Business Operations System, And Electronic Commerce (AREA)
Abstract
The invention discloses an encryption flow analysis feature selection method based on feature efficiency, which comprises the following steps of firstly defining a feature efficiency calculation method F (f); and then calculating the feature efficiency of each feature on the two types of samples, selecting the one-dimensional feature with the maximum feature efficiency, removing the samples outside the value range overlapping of the one-dimensional feature, and recording the number of the removed samples, and repeating the steps until all the features are calculated. The features are then selected based on a predetermined number of features or a threshold of feature efficiency. The invention can effectively calculate the characteristic efficiency of each characteristic on the premise of giving the maximum characteristic set, screens the characteristics according to the given characteristic efficiency threshold value or the specified characteristic number, is beneficial to improving the identification accuracy of the website fingerprint identification technology, and saves the time and space cost consumed in the construction process of the classification model.
Description
Technical Field
The invention belongs to the field of network security and user privacy, and particularly relates to an encrypted flow analysis feature selection method based on feature efficiency.
Background
In recent years, with the rapid development of the internet, the network has been tightly integrated into our production and life, and the network security has become a non-negligible problem. In daily life, the network security awareness of people is gradually improved, and more users and enterprises pay attention to the protection and the safe transmission of information. The network behavior identification technology based on the encrypted flow can be used for realizing the security supervision of the network, in particular the supervision of illegal business and unhealthy information, such as population selling, sold gambling, military fire transaction and the like. The encrypted traffic analysis can analyze the internet surfing behavior of the user, and at present, the encrypted traffic analysis is mainly based on a website fingerprint identification technology. The website fingerprint identification is a technology for identifying websites accessed in an encryption mode based on a machine learning algorithm, and classifying the websites by extracting the characteristics of network traffic and combining a supervised classification technology. The key of the technology lies in the construction process of a classification model capable of classifying websites through the extracted features.
Selecting an efficient feature extraction method is a very important issue for reasons including: (1) the essence of the website fingerprint identification technology, which is a key technology of encrypted traffic analysis, is to use a classification algorithm in machine learning to construct a classifier capable of classifying websites, so that excessive features will influence the training and testing time of the classifier, and even the classification accuracy and the like. (2) The classifier constructed in the encrypted flow analysis process is trained by the extracted numerical features. Therefore, the extracted features must accurately characterize a website and must not be overly redundant or sparse.
At present, the feature selection method for encrypted flow analysis mainly used at home and abroad is only a feature importance evaluation method based on a random forest algorithm construction process. The importance of each feature is synchronously calculated in the training process, and the importance is mainly relative to the overall performance, namely, the more important features have larger influence on the classification result. The specific working idea is as follows: feature importance is assessed by out-of-band error calculated from out-of-band data (OOB) when the classification model is trained. This method relies on a random forest model and is not based on the optimization of the feature extraction by the data itself.
Therefore, the feature selection problem related to encrypted traffic analysis has not been extensively studied, and the related technology has not been widely applied.
Disclosure of Invention
The invention aims to provide a method for selecting encrypted traffic analysis characteristics based on characteristic efficiency, so as to solve the problems.
In order to achieve the purpose, the invention adopts the following technical scheme:
a method for selecting encrypted flow analysis characteristics based on characteristic efficiency comprises the following steps:
step 1: giving any several types of sample sets and an extracted feature set with the size of n, wherein each type of sample set comprises a plurality of website samples, and each sample is represented by n-dimensional extracted feature values;
step 2: based on data complexity, observing the differentiability of the features from the composition and distribution of the feature value space, and determining a calculation method of the feature efficiency of each feature;
wherein f represents a feature; c j The full set of samples representing class j; i C jf ∪C kf L represents the number of the union set of the class j and the class k on the feature f in the sample, namely, the union set of the two classes is solved on the feature value, and then the size of the union set is counted;indicating that, for feature f, all samples of classes k and j are inMAXMIN fAndMINMAX fthe number of samples in between;
and step 3: after the above calculation method is known, optionally calculating the F (f) value of two types of samples in a given sample class for each feature, and selecting the one-dimensional feature f _ max with the maximum value;
and 4, step 4: deleting samples except for the overlapping part of the f _ max value range in the two selected types of samples, removing the characteristic f and counting the number num of the samples;
and 5: repeating the steps 3 and 4 on the remaining feature set and the sample set until the sample set or the feature set is empty;
step 6: finally, the proportion of the number of samples removed num to the total number of samples for each feature is the importance f _ value of the feature; screening out given m most important characteristics according to the size of f _ value; and m is the size of the feature set after screening.
Further, in step 2, the calculation method for determining the feature efficiency of each feature includes the following steps:
1) optionally selecting one-dimensional characteristic f, counting the number NUM of union sets of various types of sample sets on the characteristic f, namely solving the union sets of all types on the characteristic value, and then counting the size of the union sets;
2) then, the maximum value of the statistical characteristic f in each class of samples is calculated, and the minimum value of the maximum values of the obtained classes of samples is recorded as the maximum valueMINMAX fThen the minimum value of the statistical characteristic f in each class of samples is counted, and the maximum value of the counted minimum values is recorded asMAXMIN f;
3) Is obtained byMAXMIN fAndMINMAX fafter two indices, find all samples in each class that are in for feature fMAXMIN fAndMINMAX fnumber of samples in between(ii) a Finally, the feature efficiency F (f) for the feature f is obtained equal to 1 minus the number of samplesAnd the union size NUM of the features f.
Further, the given original feature set size n and the screened feature set size m are set by the user.
Furthermore, the class sample set is a website sample set subjected to feature extraction, and the feature extraction is to extract representative features including the number of data packets and the interval time between the data packets from the captured data packet sequence according to the time and size characteristics of the data packets.
Compared with the prior art, the invention has the following technical effects:
the invention can effectively calculate the characteristic efficiency of each characteristic on the premise of giving the maximum characteristic set, screens the characteristics according to the given characteristic efficiency threshold value or the specified characteristic number, is beneficial to improving the identification accuracy of the website fingerprint identification technology, and saves the time and space cost consumed in the construction process of the classification model. Under the condition that the feature set extracted from the network flow is huge, the feature set can be maximally reduced on the premise of ensuring the highest feature coverage, the training cost during the construction of the classification model is saved, and the identification accuracy is improved.
The invention is independent of classification models, and is based on the essence of data, and from the view point of data complexity, the importance of each dimension feature in sample differentiation is evaluated, and the functional overlapping problem among different features mentioned before is emphasized to be solved, so that the effective screening is carried out.
Drawings
FIG. 1 is a flow chart of the present invention;
Detailed Description
The invention is further described below with reference to the accompanying drawings:
referring to fig. 1, a method for selecting characteristics of encrypted traffic analysis based on characteristic efficiency includes the following steps:
step 1: giving any several types of sample sets and an extracted feature set with the size of n, wherein each type of sample set comprises a plurality of website samples, and each sample is represented by n-dimensional extracted feature values;
step 2: based on data complexity, observing the differentiability of the features from the composition and distribution of the feature value space, and determining a calculation method of the feature efficiency of each feature;
wherein f represents a feature; c j The full set of samples representing class j; i C jf ∪C kf L represents the number of the union set of the class j and the class k on the feature f in the sample, namely, the union set of the two classes is solved on the feature value, and then the size of the union set is counted;express, about a featuref all samples of class k and j areMAXMIN fAndMINMAX fthe number of samples in between;
and step 3: after the above calculation method is known, optionally calculating the F (f) value of two types of samples in a given sample class for each feature, and selecting the one-dimensional feature f _ max with the maximum value;
and 4, step 4: deleting samples except for the overlapping part of the f _ max value range in the two selected types of samples, removing the characteristic f and counting the number num of the samples;
and 5: repeating the steps 3 and 4 on the remaining feature set and the sample set until the sample set or the feature set is empty;
step 6: finally, the proportion of the number of samples removed num to the total number of samples for each feature is the importance f _ value of the feature; screening out given m most important characteristics according to the size of f _ value; and m is the size of the feature set after screening.
In step 2, the calculation method for determining the feature efficiency of each feature comprises the following steps:
1) optionally selecting one-dimensional characteristic f, counting the number NUM of union sets of various types of sample sets on the characteristic f, namely solving the union sets of all types on the characteristic value, and then counting the size of the union sets;
2) then, the maximum value of the statistical characteristic f in each class of samples is calculated, and the minimum value of the maximum values of the obtained classes of samples is recorded as the maximum valueMINMAX fThen the minimum value of the statistical characteristic f in each class of samples is counted, and the maximum value of the counted minimum values is recorded asMAXMIN f;
3) Is obtained byMAXMIN fAndMINMAX fafter two indices, find all samples in each class that are in for feature fMAXMIN fAndMINMAX fnumber of samples in between(ii) a Finally, the feature efficiency F (f) for the feature f is obtained equal to 1 minus the number of samplesAnd the union size NUM of the features f.
The given original characteristic set size n and the screened characteristic set size m are set by the user.
The class sample set is a website sample set subjected to feature extraction, and the feature extraction is to extract representative features including the number of data packets and the interval time between the data packets from the captured data packet sequence according to the time and size characteristics of the data packets.
Example 1:
step 1, acquiring all feature sets including all samples and class information thereof.
Assume that the two categories and the sample information they include are as follows:
{{baidu.com}:[[1,2,3,0],[2,0,5,6],[1,3,4,6],[0,3,4,5]]}
{{google.com}: [[1,3,5,2],[3,4,3,7],[2,5,3,6],[2,5,4,8]]}
where baidu and google represent two web site categories, respectively, each "[ ]" representing a traffic sample. It can be seen that there are 4-dimensional features in total, i.e. the number fea _ num is 4, and it is assumed that the four-dimensional features respectively represent the number of packet size ratios in the range of 32, 64,128, 256; where the number of samples n is 8. Now assume that two-dimensional valid features are to be screened out.
Step 2: calculating their values for each feature on two types of samples according to a predetermined feature efficiency calculation formulaThe value and the largest one-dimensional feature f _ max with the highest value is selected. For example, the category sample information for step 1 is calculated as follows:
first, the feature efficiency of the first-dimension features is calculated
For the first dimension of the feature, it is,indicating that the number of samples between 1-2 eigenvalues in the two classes is known to be 6, | C jf ∪C kf And | represents that the number of the union of the two classes in the sample on the feature in the current dimension is 7.
The resulting feature of this dimension has a feature efficiency of 0.143.
Then respectively calculating the feature efficiencies of the residual three-dimensional features according to the calculation method, wherein the feature efficiencies are respectively as follows: 0.571,0,0.429.
And finally, selecting the one-dimensional feature f _ max with the maximum value as the second-dimensional feature.
And step 3: and removing samples except the overlapping part of the f _ max value range in the two types of samples, removing the characteristic f and counting the number of the samples. First, the overlapping part of the range of the feature f _ max obtained in step 2 is found to be {3}, so that the samples whose feature value does not belong to the set can be removed first, i.e., [2,0,5,6], [3,4,3,7], [2,5,3,6], [2,5,4,8], [1,2,3,0] can be removed, and the number of the removed samples is counted to be 5. The remaining sample features f _ max are then deleted, all remaining samples now containing 3-dimensional features.
And 4, step 4: repeating steps 2 and 3 for the remaining feature set and sample set until the sample set or feature set is empty.
The sample information remaining after the first removal of the feature is:
{{baidu.com}:[ [1,4,6],[0,4,5]]}
{{google.com}: [[1,5,2]]}
and (4) recalculating the sample information through the steps 2 and 3 to obtain the removed features and the removed sample number information. And repeating the steps until the sample set is empty or the feature set is empty, and recording the number information of the removed samples corresponding to all the features. The final results are as follows:
number of samples for 1 st dimension feature removal: 1
Number of samples for 2 nd dimension feature removal: 5
Number of samples for 3 d feature removal: 2
Number of samples for 4-dimensional feature removal: 0
And 5: the importance of each feature is calculated as the ratio of the number of samples removed to the total number of samples removed for that feature. Wherein the importance of the four-dimensional features is 0.125,0.625,0.25 and 0 respectively. The two features are finally screened as the second and third dimension features, respectively.
Claims (4)
1. A method for selecting encrypted flow analysis characteristics based on characteristic efficiency is characterized by comprising the following steps:
step 1: giving any several types of sample sets and an extracted feature set with the size of n, wherein each type of sample set comprises a plurality of website samples, and each sample is represented by n-dimensional extracted feature values;
step 2: based on data complexity, observing the differentiability of the features from the composition and distribution of the feature value space, and determining a calculation method of the feature efficiency of each feature;
wherein f represents a feature; c j The full set of samples representing class j; i C jf ∪C kf L represents the number of the union set of the class j and the class k on the feature f in the sample, namely, the union set of the two classes is solved on the feature value, and then the size of the union set is counted;indicating that, for feature f, all samples of classes k and j are inMAXMIN fAndMINMAX fthe number of samples in between;
and step 3: after the above calculation method is known, optionally calculating the F (f) value of two types of samples in a given sample class for each feature, and selecting the one-dimensional feature f _ max with the maximum value;
and 4, step 4: deleting samples except for the overlapping part of the f _ max value range in the two selected types of samples, removing the characteristic f and counting the number num of the samples;
and 5: repeating the steps 3 and 4 on the remaining feature set and the sample set until the sample set or the feature set is empty;
step 6: finally, the proportion of the number of samples removed num to the total number of samples for each feature is the importance f _ value of the feature; screening out given m most important characteristics according to the size of f _ value; and m is the size of the feature set after screening.
2. The feature efficiency-based encrypted traffic analysis feature selection method according to claim 1, wherein in the step 2, the calculation method for determining the feature efficiency of each feature comprises the following steps:
1) optionally selecting one-dimensional characteristic f, counting the number NUM of union sets of various types of sample sets on the characteristic f, namely solving the union sets of all types on the characteristic value, and then counting the size of the union sets;
2) then, the maximum value of the statistical characteristic f in each class of samples is calculated, and the minimum value of the maximum values of the obtained classes of samples is recorded as the maximum valueMINMAX fThen the minimum value of the statistical characteristic f in each class of samples is counted, and the maximum value of the counted minimum values is recorded asMAXMIN f;
3) Is obtained byMAXMIN fAndMINMAX fafter two indices, find all samples in each class that are in for feature fMAXMIN fAndMINMAX fnumber of samples in between(ii) a Finally, the feature efficiency F (f) for the feature f is obtained equal to 1 minus the number of samplesAnd the union size NUM of the features f.
3. The method for selecting the encrypted flow analysis features based on the feature efficiency as claimed in claim 1, wherein the given size n of the original feature set and the size m of the feature set after being filtered are set by a user.
4. The feature selection method for encrypted traffic analysis based on feature efficiency as claimed in claim 1, wherein the class sample set is a website sample set subjected to feature extraction, and the feature extraction is to extract representative features from the captured data packet sequence according to time and size characteristics of the data packets, including the number of the data packets and the interval time between the data packets.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201810896859.3A CN109194622B (en) | 2018-08-08 | 2018-08-08 | Encrypted flow analysis feature selection method based on feature efficiency |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201810896859.3A CN109194622B (en) | 2018-08-08 | 2018-08-08 | Encrypted flow analysis feature selection method based on feature efficiency |
Publications (2)
Publication Number | Publication Date |
---|---|
CN109194622A CN109194622A (en) | 2019-01-11 |
CN109194622B true CN109194622B (en) | 2020-03-31 |
Family
ID=64920584
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201810896859.3A Active CN109194622B (en) | 2018-08-08 | 2018-08-08 | Encrypted flow analysis feature selection method based on feature efficiency |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN109194622B (en) |
Families Citing this family (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN110011931B (en) * | 2019-01-25 | 2020-10-16 | 中国科学院信息工程研究所 | Encrypted flow type detection method and system |
CN109981335B (en) * | 2019-01-28 | 2022-02-22 | 重庆邮电大学 | Feature selection method for combined type unbalanced flow classification |
Citations (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN103780501A (en) * | 2014-01-03 | 2014-05-07 | 濮阳职业技术学院 | Peer-to-peer network traffic identification method of inseparable-wavelet support vector machine |
CN105281973A (en) * | 2015-08-07 | 2016-01-27 | 南京邮电大学 | Webpage fingerprint identification method aiming at specific website category |
CN105554152A (en) * | 2015-12-30 | 2016-05-04 | 北京神州绿盟信息安全科技股份有限公司 | Method and device for extracting data features |
Family Cites Families (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US9141823B2 (en) * | 2013-03-15 | 2015-09-22 | Veridicom, Sa De Cv | Abstraction layer for default encryption with orthogonal encryption logic session object; and automated authentication, with a method for online litigation |
US20150281278A1 (en) * | 2014-03-28 | 2015-10-01 | Southern California Edison | System For Securing Electric Power Grid Operations From Cyber-Attack |
CN107070812A (en) * | 2017-05-02 | 2017-08-18 | 武汉绿色网络信息服务有限责任公司 | A kind of HTTPS protocal analysises method and its system |
-
2018
- 2018-08-08 CN CN201810896859.3A patent/CN109194622B/en active Active
Patent Citations (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN103780501A (en) * | 2014-01-03 | 2014-05-07 | 濮阳职业技术学院 | Peer-to-peer network traffic identification method of inseparable-wavelet support vector machine |
CN105281973A (en) * | 2015-08-07 | 2016-01-27 | 南京邮电大学 | Webpage fingerprint identification method aiming at specific website category |
CN105554152A (en) * | 2015-12-30 | 2016-05-04 | 北京神州绿盟信息安全科技股份有限公司 | Method and device for extracting data features |
Non-Patent Citations (1)
Title |
---|
AutoFlowLeaker: Circumventing Web Censorship through Automation Services;Shengtuo Hu, Xiaobo Ma, Muhui Jiang;《2017 IEEE 36th Symposium on Reliable Distributed Systems》;20171019;第214-223页 * |
Also Published As
Publication number | Publication date |
---|---|
CN109194622A (en) | 2019-01-11 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN108874927B (en) | Intrusion detection method based on hypergraph and random forest | |
CN104794192B (en) | Multistage method for detecting abnormality based on exponential smoothing, integrated study model | |
CN103793484B (en) | The fraud identifying system based on machine learning in classification information website | |
CN110046297B (en) | Operation and maintenance violation identification method and device and storage medium | |
CN111385297A (en) | Wireless device fingerprint identification method, system, device and readable storage medium | |
CN109309675A (en) | A kind of network inbreak detection method based on convolutional neural networks | |
CN108334758A (en) | A kind of detection method, device and the equipment of user's ultra vires act | |
CN111191720B (en) | Service scene identification method and device and electronic equipment | |
CN109903053B (en) | Anti-fraud method for behavior recognition based on sensor data | |
CN107368856A (en) | Clustering method and device, the computer installation and readable storage medium storing program for executing of Malware | |
CN109194622B (en) | Encrypted flow analysis feature selection method based on feature efficiency | |
CN113746952B (en) | DGA domain name detection method and device, electronic equipment and computer storage medium | |
CN113052577B (en) | Class speculation method and system for block chain digital currency virtual address | |
CN111866196A (en) | Domain name traffic characteristic extraction method, device, equipment and readable storage medium | |
WO2021012913A1 (en) | Data recognition method and system, electronic device and computer storage medium | |
CN112839014A (en) | Method, system, device and medium for establishing model for identifying abnormal visitor | |
CN111444501B (en) | LDoS attack detection method based on combination of Mel cepstrum and semi-space forest | |
CN114841705B (en) | Anti-fraud monitoring method based on scene recognition | |
CN108268877A (en) | A kind of method and apparatus for identifying target terminal | |
CN112199388A (en) | Strange call identification method and device, electronic equipment and storage medium | |
CN111782908A (en) | WEB violation operation behavior detection method based on data mining cluster analysis | |
CN113746780A (en) | Abnormal host detection method, device, medium and equipment based on host image | |
CN109657703B (en) | Crowd classification method based on space-time data trajectory characteristics | |
CN114817518B (en) | License handling method, system and medium based on big data archive identification | |
CN113746707B (en) | Encrypted traffic classification method based on classifier and network structure |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |