CN109194622A - A kind of encryption flow analysis feature selection approach based on feature efficiency - Google Patents
A kind of encryption flow analysis feature selection approach based on feature efficiency Download PDFInfo
- Publication number
- CN109194622A CN109194622A CN201810896859.3A CN201810896859A CN109194622A CN 109194622 A CN109194622 A CN 109194622A CN 201810896859 A CN201810896859 A CN 201810896859A CN 109194622 A CN109194622 A CN 109194622A
- Authority
- CN
- China
- Prior art keywords
- feature
- sample
- samples
- efficiency
- value
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Granted
Links
Classifications
-
- H—ELECTRICITY
- H04—ELECTRIC COMMUNICATION TECHNIQUE
- H04L—TRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
- H04L63/00—Network architectures or network communication protocols for network security
- H04L63/14—Network architectures or network communication protocols for network security for detecting or protecting against malicious traffic
- H04L63/1408—Network architectures or network communication protocols for network security for detecting or protecting against malicious traffic by monitoring network traffic
-
- H—ELECTRICITY
- H04—ELECTRIC COMMUNICATION TECHNIQUE
- H04L—TRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
- H04L63/00—Network architectures or network communication protocols for network security
- H04L63/02—Network architectures or network communication protocols for network security for separating internal from external traffic, e.g. firewalls
- H04L63/0227—Filtering policies
-
- H—ELECTRICITY
- H04—ELECTRIC COMMUNICATION TECHNIQUE
- H04L—TRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
- H04L63/00—Network architectures or network communication protocols for network security
- H04L63/14—Network architectures or network communication protocols for network security for detecting or protecting against malicious traffic
- H04L63/1441—Countermeasures against malicious traffic
Abstract
The invention discloses a kind of encryption flow analysis feature selection approach based on feature efficiency, first the calculation method F (f) of defined feature efficiency;Then the sample except the feature efficiency of each feature, the selection maximum one-dimensional characteristic of feature efficiency, and the codomain for removing this dimensional feature overlap is calculated on two class samples, and records the number of samples of removal, is so recycled, until all features are had been calculated.Then feature is selected according to the threshold value of prespecified Characteristic Number or feature efficiency.The present invention can be under the premise of given maximum feature set, effectively calculate the feature efficiency of each feature, feature is screened according to given feature efficiency threshold value or defined number of features, be conducive to improve the recognition accuracy of website fingerprint identification technology, and save the time consumed in disaggregated model building process and space cost.
Description
Technical field
The invention belongs to network security and privacy of user field, in particular to a kind of encryption flow based on feature efficiency point
Analyse feature selection approach.
Background technique
In recent years, with the high speed development of internet, network has closely incorporated our production and life, network peace
Also become a very important problem entirely.In daily life, the awareness of network security of people is also gradually increased, more and more
User and enterprise start pay attention to information protection and safe transmission.It, can be with based on the network behavior identification technology of encryption flow
For realizing the supervision of the security control of network, especially illegal traffic and flame, such as human trafficking, prostitution gambling, army
Fire transaction etc..Encryption flow analysis can analyze out the internet behavior of user, currently, the analysis of encryption flow is mainly based upon
Website fingerprint identification technology.Website fingerprint identification is exactly a kind of net that cipher mode access is identified by based on machine learning algorithm
It stands, pass through the feature extraction to network flow and is combined with the technology that the sorting technique of supervision classifies to website.Wherein, this
The key of kind of technology is that through the feature of extraction to can be to the building process of the disaggregated model of websites collection.
A kind of effective feature extracting method is selected, is an extremely important problem, reason includes: (1) encryption flow
The key technology of analysis-website fingerprint identification technology essence is exactly to use the sorting algorithm building in machine learning can be right
The classifier that website is classified so excessive feature will will affect the training testing time of classifier, or even influences classification
Accuracy rate etc..(2) classifier constructed during encryption flow analysis is the numerical characteristics training to extract.Therefore, it mentions
The feature of taking-up must be able to the characteristics of accurately portraying a website, cannot excessive redundancy or rareness.
Random forests algorithm is only based on about what encryption flow analysis feature selection approach was currently mainly used both at home and abroad
The feature importance appraisal procedure of building process.It is the synchronous importance for calculating each feature during training, and is somebody's turn to do
For the mainly opposite overall performance of importance, i.e., more important feature influences classification results bigger.Its specific works is thought
Think: when disaggregated model is trained, being calculated by out of band data (OOB) important to assess feature with outer error
Property.This method depends on Random Forest model, is not based on optimization of the data to feature extraction itself.
It can be seen that the feature selection issues about encryption flow analysis are not yet extensively and profoundly studied, the relevant technologies
Not yet it is widely used.
Summary of the invention
The purpose of the present invention is to provide a kind of encryption flow analysis feature selection approach based on feature efficiency, to solve
The above problem.
To achieve the above object, the invention adopts the following technical scheme:
A kind of encryption flow analysis feature selection approach based on feature efficiency, comprising the following steps:
Step 1: if the feature set that given any Ganlei's sample set and the size of extraction are n, wherein every one kind sample
It include multiple website samples in set, each sample is indicated by the characteristic value that n dimension extracts;
Step 2: being based on data complexity, the ga s safety degree of observational characteristic is carried out from the composition in characteristic value space and distribution, really
The calculation method of the feature efficiency of fixed each feature;
MINMAXf=min (max (f, Cj), max (f, Ck))
MAXMINf=max (min (f, Cj), min (f, Ck));
Wherein, f indicates feature;Whole sample sets of C_j expression class j;| C_jf ∪ C_kf | indicate class j and class in sample
The number of union of the f on feature f seeks union to two classifications that is, in this feature value, then counts the size of union;|
MAXMINf→MINMAXf| it indicates, for feature f, MAXMIN is in all samples of class k and jfWith MINMAXfBetween
Number of samples;
Step 3: after known above-mentioned calculation method, optional two classes sample is each feature in given sample class
Calculate their F (f) value, and the big one-dimensional characteristic f_max of selective value most;
Step 4: removing the sample in two class samples of selection except f_max codomain lap, remove feature f and count
The number num of these samples.
Step 5: step 3 and step 4 being repeated to remaining feature set and sample set, until sample set or feature set are
It is empty;
Step 6: finally, the sample number num that each feature is removed account for population sample number ratio be exactly this feature weight
The property wanted f_value;Given m most important features are filtered out according to the size of f_value;M is that the feature set after screening is big
It is small.
Further, in step 2, determine the calculation method of the feature efficiency of each feature the following steps are included:
1) optionally one-dimensional characteristic f, statistics Different categories of samples is integrated into the number N UM of the union on feature f, i.e., in this feature value
On union is asked to all categories, then count union size;
2) then maximum value of the statistical nature f in Different categories of samples, and will be in the maximum value for the Different categories of samples that got
Minimum value is denoted as MINMAXf, minimum value of the subsequent statistical nature f in Different categories of samples, and the minimum value that these are counted on
In maximum value be denoted as MAXMINf;
3) it after obtaining above-mentioned two index, searches for feature f, is in all samples of each classification
MAXMINfWith MINMAXfBetween number of samples | MAXMINf→MINMAXf|;Finally obtain the feature efficiency F for feature f
(f) it is equal to 1 and subtracts number of samples | MAXMINf→MINMAXf| the ratio with the union size NUM of feature f.
Further, the given primitive character collection size n and feature set size m after screening is by user's sets itself.
Further, class sample set is the website sample set by feature extraction, and feature extraction will be captured
For sequence of data packet according to the time of data packet, it includes data packet number, data that size property, which extracts representative feature,
Interval time between packet.
Compared with prior art, the present invention has following technical effect:
The present invention can effectively calculate the feature efficiency of each feature under the premise of given maximum feature set, according to
Given feature efficiency threshold value or defined number of features screen feature, are conducive to the identification for improving website fingerprint identification technology
Accuracy rate, and save the time consumed in disaggregated model building process and space cost.In the spy extracted to network flow
In the case that collection is very huge, feature set can be maximumlly reduced under the premise of guaranteeing highest feature coverage area, section
The training cost during disaggregated model building is saved, the accuracy rate of identification is improved.
For the present invention independently of disaggregated model, it is the essence from data, is assessed from the angle of data complexity every
Importance of the one-dimensional characteristic in sample differentiation, and focus on the functional overlapping between the different characteristic mentioned before solving
Problem, to carry out Effective selection.
Detailed description of the invention
Fig. 1 is flow chart of the invention;
Specific embodiment
Below in conjunction with attached drawing, the present invention is further described:
Referring to Fig. 1, a kind of encryption flow analysis feature selection approach based on feature efficiency, comprising the following steps:
Step 1: if the feature set that given any Ganlei's sample set and the size of extraction are n, wherein every one kind sample
It include multiple website samples in set, each sample is indicated by the characteristic value that n dimension extracts;
Step 2: being based on data complexity, the ga s safety degree of observational characteristic is carried out from the composition in characteristic value space and distribution, really
The calculation method of the feature efficiency of fixed each feature;
MINMAXf=min (max (f, Cj), max (f, Ck))
MAXMINf=max (min (f, Cj), min (f, Ck));
Wherein, f indicates feature;Whole sample sets of C_j expression class j;| C_jf ∪ C_kf | indicate class j and class in sample
The number of union of the f on feature f seeks union to two classifications that is, in this feature value, then counts the size of union;|
MAXMINf→MINMAXf| it indicates, for feature f, MAXMIN is in all samples of class k and jfWith MINMAXfBetween
Number of samples;
Step 3: after known above-mentioned calculation method, optional two classes sample is each feature in given sample class
Calculate their F (f) value, and the big one-dimensional characteristic f_max of selective value most;
Step 4: removing the sample in two class samples of selection except f_max codomain lap, remove feature f and count
The number num of these samples.
Step 5: step 3 and step 4 being repeated to remaining feature set and sample set, until sample set or feature set are
It is empty;
Step 6: finally, the sample number num that each feature is removed account for population sample number ratio be exactly this feature weight
The property wanted f_value;Given m most important features are filtered out according to the size of f_value;M is that the feature set after screening is big
It is small.
In step 2, determine the calculation method of the feature efficiency of each feature the following steps are included:
1) optionally one-dimensional characteristic f, statistics Different categories of samples is integrated into the number N UM of the union on feature f, i.e., in this feature value
On union is asked to all categories, then count union size;
2) then maximum value of the statistical nature f in Different categories of samples, and will be in the maximum value for the Different categories of samples that got
Minimum value is denoted as MINMAXf, minimum value of the subsequent statistical nature f in Different categories of samples, and the minimum value that these are counted on
In maximum value be denoted as MAXMINf;
3) it after obtaining above-mentioned two index, searches for feature f, is in all samples of each classification
MAXMINfWith MINMAXfBetween number of samples | MAXMINf→MINMAXf|;Finally obtain the feature efficiency F for feature f
(f) it is equal to 1 and subtracts number of samples | MAXMINf→MINMAXf| the ratio with the union size NUM of feature f.
Feature set size m after given primitive character collection size n and screening is by user's sets itself.
Class sample set is the website sample set by feature extraction, and feature extraction is the sequence of data packet that will be captured
According to the time of data packet, size property extracts representative feature, includes data packet number, is spaced between data packet
Time.
Embodiment 1:
Step 1: all feature sets are obtained, including all samples and its classification information.
Assuming that two classifications and its including sample information it is as follows:
{ { baidu.com }: [[1,2,3,0], [2,0,5,6], [1,3,4,6], [0,3,4,5]] }
{{google.com}:[[1,3,5,2],[3,4,3,7],[2,5,3,6],[2,5,4,8]]}
Wherein baidu and google respectively indicates two categories of websites, and each " [] " indicates a flow sample,.It can be with
See, share 4 dimensional features, i.e. number of features fea_num is 4, it is assumed that the meaning that this four dimensional feature respectively represents is that data packet is big
Small point than the number in 32,64,128,256 ranges;Wherein number of samples n is 8.It is now assumed that it is effectively special to filter out bidimensional
Sign.
Step 2: according to preset feature efficiency calculation formula, on two class samples be each feature calculation they
F (f) value, and the big one-dimensional characteristic f_max of selective value most.For example, the classification sample information calculating for step 1 is as follows:
The feature efficiency of the first dimensional feature is calculated first
For the first dimensional feature, | MAXMINf→MINMAXf| expression is that characteristic value is in known two classifications
Number of samples between 1-2 is 6, | Cjf∪Ckf| expression be union of two classifications on this dimensional feature in sample number
It is 7.
So the feature efficiency of this dimensional feature finally obtained is 0.143.
Then according to the above calculation method calculate separately the feature efficiency of remaining three-dimensional feature is respectively as follows: 0.571,0,
0.429。
The last maximum one-dimensional characteristic f_max of selective value is the second dimensional feature.
Step 3: removing the sample in two class samples except f_max codomain lap, remove feature f and count these samples
This number.The lap for finding the codomain of the feature f_max obtained by step 2 first is { 3 }, so can be this spy
The sample that value indicative is not belonging to this set first removes, it can it removes [2,0,5,6], [3,4,3,7], [2,5,3,6], [2,5,
4,8], it is 5 that [1,2,3,0], which counts these removed number of samples,.Then remaining sample characteristics f_max is deleted, at this time institute
The remaining sample having contains 3 dimensional features.
Step 4: step 2 and 3 being repeated to remaining feature set and sample set, until sample set or feature set are sky.When
Remaining sample information after removal feature for the first time are as follows:
{ { baidu.com }: [[1,4,6], [0,4,5]] }
{{google.com}:[[1,5,2]]}
These sample informations are passed through into step 2 again and step 3 calculates, obtain the sample of removed feature and removal
Information of number.It so repeats, is sky until sample set is empty or feature set, records the corresponding removal sample of all features
Information of number.Wherein final result is as follows:
The number of samples that 1st dimensional feature removes: 1
The number of samples that 2nd dimensional feature removes: 5
The number of samples that 3rd dimensional feature removes: 2
The number of samples that 4th dimensional feature removes: 0
Step 5: calculating the ratio that the sample number that each feature is removed accounts for population sample number, be the important of this feature
Property.Wherein the importance of four dimensional features is respectively 0.125,0.625,0.25,0.So finally two features of screening are respectively the
Two peacekeeping third dimension features.
Claims (4)
1. a kind of encryption flow analysis feature selection approach based on feature efficiency, which comprises the following steps:
Step 1: if the feature set that given any Ganlei's sample set and the size of extraction are n, wherein every one kind sample set
In include multiple website samples, each sample is indicated by the characteristic value that extracts of n dimension;
Step 2: being based on data complexity, the ga s safety degree of observational characteristic is carried out from the composition in characteristic value space and distribution, determine every
The calculation method of the feature efficiency of a feature;
MINMAXf=min (max (f, Cj), max (f, Ck))
MAXMINf=max (min (f, Cj), min (f, Ck));
Wherein, f indicates feature;Whole sample sets of C_j expression class j;| C_jf ∪ C_kf | indicate that class j and class f exists in sample
The number of union on feature f seeks union to two classifications that is, in this feature value, then counts the size of union;|
MAXMINf→MINMAXf| it indicates, for feature f, MAXMIN is in all samples of class k and jfWith MINMAXfBetween
Number of samples;
Step 3: after known above-mentioned calculation method, optional two classes sample is each feature calculation in given sample class
Their F (f) value, and the big one-dimensional characteristic f_max of selective value most;
Step 4: deleting the sample in two class samples of selection except f_max codomain lap, remove feature f and count these
The number num of sample;
Step 5: step 3 and step 4 being repeated to remaining feature set and sample set, until sample set or feature set are sky;
Step 6: finally, the sample number num that each feature is removed account for population sample number ratio be exactly this feature importance
f_value;Given m most important features are filtered out according to the size of f_value;M is the feature set size after screening.
2. a kind of encryption flow analysis feature selection approach based on feature efficiency according to claim 1, feature exist
In, in step 2, determine the calculation method of the feature efficiency of each feature the following steps are included:
1) optionally one-dimensional characteristic f, statistics Different categories of samples is integrated into the number N UM of the union on feature f, i.e., right in this feature value
All categories seek union, then count the size of union;
2) then maximum value of the statistical nature f in Different categories of samples, and by the minimum in the maximum value for the Different categories of samples got
Value is denoted as MINMAXf, minimum value of the subsequent statistical nature f in Different categories of samples, and in the minimum value that these are counted on
Maximum value is denoted as MAXMINf;
3) it after obtaining above-mentioned two index, searches for feature f, MAXMIN is in all samples of each classificationf
With MINMAXfBetween number of samples | MAXMINf→MINMAXf|;It finally obtains and 1 is equal to for the feature efficiency F (f) of feature f
Subtract number of samples | MAXMINf→MINMAXf| the ratio with the union size NUM of feature f.
3. a kind of encryption flow analysis feature selection approach based on feature efficiency according to claim 1, feature exist
In the given primitive character collection size n and feature set size m after screening is by user's sets itself.
4. a kind of encryption flow analysis feature selection approach based on feature efficiency according to claim 1, feature exist
In the website sample set that, class sample set is by feature extraction, feature extraction be the sequence of data packet that will be captured according to
The time of data packet, size property extract representative feature, include data packet number, when being spaced between data packet
Between.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201810896859.3A CN109194622B (en) | 2018-08-08 | 2018-08-08 | Encrypted flow analysis feature selection method based on feature efficiency |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201810896859.3A CN109194622B (en) | 2018-08-08 | 2018-08-08 | Encrypted flow analysis feature selection method based on feature efficiency |
Publications (2)
Publication Number | Publication Date |
---|---|
CN109194622A true CN109194622A (en) | 2019-01-11 |
CN109194622B CN109194622B (en) | 2020-03-31 |
Family
ID=64920584
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201810896859.3A Active CN109194622B (en) | 2018-08-08 | 2018-08-08 | Encrypted flow analysis feature selection method based on feature efficiency |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN109194622B (en) |
Cited By (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN109981335A (en) * | 2019-01-28 | 2019-07-05 | 重庆邮电大学 | The feature selection approach of combined class uneven traffic classification |
CN110011931A (en) * | 2019-01-25 | 2019-07-12 | 中国科学院信息工程研究所 | A kind of encryption traffic classes detection method and system |
Citations (6)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN103780501A (en) * | 2014-01-03 | 2014-05-07 | 濮阳职业技术学院 | Peer-to-peer network traffic identification method of inseparable-wavelet support vector machine |
US20140304505A1 (en) * | 2013-03-15 | 2014-10-09 | William Johnson Dawson | Abstraction layer for default encryption with orthogonal encryption logic session object; and automated authentication, with a method for online litigation |
US20150281278A1 (en) * | 2014-03-28 | 2015-10-01 | Southern California Edison | System For Securing Electric Power Grid Operations From Cyber-Attack |
CN105281973A (en) * | 2015-08-07 | 2016-01-27 | 南京邮电大学 | Webpage fingerprint identification method aiming at specific website category |
CN105554152A (en) * | 2015-12-30 | 2016-05-04 | 北京神州绿盟信息安全科技股份有限公司 | Method and device for extracting data features |
CN107070812A (en) * | 2017-05-02 | 2017-08-18 | 武汉绿色网络信息服务有限责任公司 | A kind of HTTPS protocal analysises method and its system |
-
2018
- 2018-08-08 CN CN201810896859.3A patent/CN109194622B/en active Active
Patent Citations (6)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20140304505A1 (en) * | 2013-03-15 | 2014-10-09 | William Johnson Dawson | Abstraction layer for default encryption with orthogonal encryption logic session object; and automated authentication, with a method for online litigation |
CN103780501A (en) * | 2014-01-03 | 2014-05-07 | 濮阳职业技术学院 | Peer-to-peer network traffic identification method of inseparable-wavelet support vector machine |
US20150281278A1 (en) * | 2014-03-28 | 2015-10-01 | Southern California Edison | System For Securing Electric Power Grid Operations From Cyber-Attack |
CN105281973A (en) * | 2015-08-07 | 2016-01-27 | 南京邮电大学 | Webpage fingerprint identification method aiming at specific website category |
CN105554152A (en) * | 2015-12-30 | 2016-05-04 | 北京神州绿盟信息安全科技股份有限公司 | Method and device for extracting data features |
CN107070812A (en) * | 2017-05-02 | 2017-08-18 | 武汉绿色网络信息服务有限责任公司 | A kind of HTTPS protocal analysises method and its system |
Non-Patent Citations (1)
Title |
---|
SHENGTUO HU, XIAOBO MA, MUHUI JIANG: "AutoFlowLeaker: Circumventing Web Censorship through Automation Services", 《2017 IEEE 36TH SYMPOSIUM ON RELIABLE DISTRIBUTED SYSTEMS》 * |
Cited By (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN110011931A (en) * | 2019-01-25 | 2019-07-12 | 中国科学院信息工程研究所 | A kind of encryption traffic classes detection method and system |
CN110011931B (en) * | 2019-01-25 | 2020-10-16 | 中国科学院信息工程研究所 | Encrypted flow type detection method and system |
CN109981335A (en) * | 2019-01-28 | 2019-07-05 | 重庆邮电大学 | The feature selection approach of combined class uneven traffic classification |
CN109981335B (en) * | 2019-01-28 | 2022-02-22 | 重庆邮电大学 | Feature selection method for combined type unbalanced flow classification |
Also Published As
Publication number | Publication date |
---|---|
CN109194622B (en) | 2020-03-31 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN107395590B (en) | A kind of intrusion detection method classified based on PCA and random forest | |
CN107358146B (en) | Method for processing video frequency, device and storage medium | |
CN102110122B (en) | Method and device for establishing sample picture index table, method and device for filtering pictures and method and device for searching pictures | |
CN104794192B (en) | Multistage method for detecting abnormality based on exponential smoothing, integrated study model | |
CN105095856B (en) | Face identification method is blocked based on mask | |
Ektefa et al. | Intrusion detection using data mining techniques | |
US20210034840A1 (en) | Method for Recognzing Face from Monitoring Video Data | |
CN107633084A (en) | Based on the public sentiment managing and control system and its method from media | |
CN111428231B (en) | Safety processing method, device and equipment based on user behaviors | |
CN110781308B (en) | Anti-fraud system for constructing knowledge graph based on big data | |
CN109697416A (en) | A kind of video data handling procedure and relevant apparatus | |
CN109005145A (en) | A kind of malice URL detection system and its method extracted based on automated characterization | |
CN106355154B (en) | Method for detecting frequent passing of people in surveillance video | |
CN102945366A (en) | Method and device for face recognition | |
CN105426762A (en) | Static detection method for malice of android application programs | |
CN103037339A (en) | Short message filtering method based on user creditworthiness and short message spam degree | |
CN103279744A (en) | Multi-scale tri-mode texture feature-based method and system for detecting counterfeit fingerprints | |
US20150113651A1 (en) | Spammer group extraction apparatus and method | |
CN106326835A (en) | Human face data collection statistical system and method for gas station convenience store | |
CN106681980B (en) | A kind of refuse messages analysis method and device | |
CN108319672A (en) | Mobile terminal malicious information filtering method and system based on cloud computing | |
CN106446124A (en) | Website classification method based on network relation graph | |
CN109194622A (en) | A kind of encryption flow analysis feature selection approach based on feature efficiency | |
CN110020161B (en) | Data processing method, log processing method and terminal | |
WO2021012913A1 (en) | Data recognition method and system, electronic device and computer storage medium |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |