CN109194622B - Encrypted flow analysis feature selection method based on feature efficiency - Google Patents

Encrypted flow analysis feature selection method based on feature efficiency Download PDF

Info

Publication number
CN109194622B
CN109194622B CN201810896859.3A CN201810896859A CN109194622B CN 109194622 B CN109194622 B CN 109194622B CN 201810896859 A CN201810896859 A CN 201810896859A CN 109194622 B CN109194622 B CN 109194622B
Authority
CN
China
Prior art keywords
feature
samples
efficiency
characteristic
value
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201810896859.3A
Other languages
Chinese (zh)
Other versions
CN109194622A (en
Inventor
马小博
安冰玉
师马玮
焦洪山
赵延康
李剑锋
彭嘉豪
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Xian Jiaotong University
Original Assignee
Xian Jiaotong University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Xian Jiaotong University filed Critical Xian Jiaotong University
Priority to CN201810896859.3A priority Critical patent/CN109194622B/en
Publication of CN109194622A publication Critical patent/CN109194622A/en
Application granted granted Critical
Publication of CN109194622B publication Critical patent/CN109194622B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L63/00Network architectures or network communication protocols for network security
    • H04L63/14Network architectures or network communication protocols for network security for detecting or protecting against malicious traffic
    • H04L63/1408Network architectures or network communication protocols for network security for detecting or protecting against malicious traffic by monitoring network traffic
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L63/00Network architectures or network communication protocols for network security
    • H04L63/02Network architectures or network communication protocols for network security for separating internal from external traffic, e.g. firewalls
    • H04L63/0227Filtering policies
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L63/00Network architectures or network communication protocols for network security
    • H04L63/14Network architectures or network communication protocols for network security for detecting or protecting against malicious traffic
    • H04L63/1441Countermeasures against malicious traffic

Landscapes

  • Engineering & Computer Science (AREA)
  • Computer Security & Cryptography (AREA)
  • Computer Hardware Design (AREA)
  • Computing Systems (AREA)
  • General Engineering & Computer Science (AREA)
  • Computer Networks & Wireless Communication (AREA)
  • Signal Processing (AREA)
  • Data Exchanges In Wide-Area Networks (AREA)
  • Management, Administration, Business Operations System, And Electronic Commerce (AREA)

Abstract

The invention discloses an encryption flow analysis feature selection method based on feature efficiency, which comprises the following steps of firstly defining a feature efficiency calculation method F (f); and then calculating the feature efficiency of each feature on the two types of samples, selecting the one-dimensional feature with the maximum feature efficiency, removing the samples outside the value range overlapping of the one-dimensional feature, and recording the number of the removed samples, and repeating the steps until all the features are calculated. The features are then selected based on a predetermined number of features or a threshold of feature efficiency. The invention can effectively calculate the characteristic efficiency of each characteristic on the premise of giving the maximum characteristic set, screens the characteristics according to the given characteristic efficiency threshold value or the specified characteristic number, is beneficial to improving the identification accuracy of the website fingerprint identification technology, and saves the time and space cost consumed in the construction process of the classification model.

Description

Encrypted flow analysis feature selection method based on feature efficiency
Technical Field
The invention belongs to the field of network security and user privacy, and particularly relates to an encrypted flow analysis feature selection method based on feature efficiency.
Background
In recent years, with the rapid development of the internet, the network has been tightly integrated into our production and life, and the network security has become a non-negligible problem. In daily life, the network security awareness of people is gradually improved, and more users and enterprises pay attention to the protection and the safe transmission of information. The network behavior identification technology based on the encrypted flow can be used for realizing the security supervision of the network, in particular the supervision of illegal business and unhealthy information, such as population selling, sold gambling, military fire transaction and the like. The encrypted traffic analysis can analyze the internet surfing behavior of the user, and at present, the encrypted traffic analysis is mainly based on a website fingerprint identification technology. The website fingerprint identification is a technology for identifying websites accessed in an encryption mode based on a machine learning algorithm, and classifying the websites by extracting the characteristics of network traffic and combining a supervised classification technology. The key of the technology lies in the construction process of a classification model capable of classifying websites through the extracted features.
Selecting an efficient feature extraction method is a very important issue for reasons including: (1) the essence of the website fingerprint identification technology, which is a key technology of encrypted traffic analysis, is to use a classification algorithm in machine learning to construct a classifier capable of classifying websites, so that excessive features will influence the training and testing time of the classifier, and even the classification accuracy and the like. (2) The classifier constructed in the encrypted flow analysis process is trained by the extracted numerical features. Therefore, the extracted features must accurately characterize a website and must not be overly redundant or sparse.
At present, the feature selection method for encrypted flow analysis mainly used at home and abroad is only a feature importance evaluation method based on a random forest algorithm construction process. The importance of each feature is synchronously calculated in the training process, and the importance is mainly relative to the overall performance, namely, the more important features have larger influence on the classification result. The specific working idea is as follows: feature importance is assessed by out-of-band error calculated from out-of-band data (OOB) when the classification model is trained. This method relies on a random forest model and is not based on the optimization of the feature extraction by the data itself.
Therefore, the feature selection problem related to encrypted traffic analysis has not been extensively studied, and the related technology has not been widely applied.
Disclosure of Invention
The invention aims to provide a method for selecting encrypted traffic analysis characteristics based on characteristic efficiency, so as to solve the problems.
In order to achieve the purpose, the invention adopts the following technical scheme:
a method for selecting encrypted flow analysis characteristics based on characteristic efficiency comprises the following steps:
step 1: giving any several types of sample sets and an extracted feature set with the size of n, wherein each type of sample set comprises a plurality of website samples, and each sample is represented by n-dimensional extracted feature values;
step 2: based on data complexity, observing the differentiability of the features from the composition and distribution of the feature value space, and determining a calculation method of the feature efficiency of each feature;
Figure 940217DEST_PATH_IMAGE001
wherein f represents a feature; c j The full set of samples representing class j; i C jf ∪C kf L represents the number of the union set of the class j and the class k on the feature f in the sample, namely, the union set of the two classes is solved on the feature value, and then the size of the union set is counted;
Figure 621865DEST_PATH_IMAGE002
indicating that, for feature f, all samples of classes k and j are inMAXMIN fAndMINMAX fthe number of samples in between;
and step 3: after the above calculation method is known, optionally calculating the F (f) value of two types of samples in a given sample class for each feature, and selecting the one-dimensional feature f _ max with the maximum value;
and 4, step 4: deleting samples except for the overlapping part of the f _ max value range in the two selected types of samples, removing the characteristic f and counting the number num of the samples;
and 5: repeating the steps 3 and 4 on the remaining feature set and the sample set until the sample set or the feature set is empty;
step 6: finally, the proportion of the number of samples removed num to the total number of samples for each feature is the importance f _ value of the feature; screening out given m most important characteristics according to the size of f _ value; and m is the size of the feature set after screening.
Further, in step 2, the calculation method for determining the feature efficiency of each feature includes the following steps:
1) optionally selecting one-dimensional characteristic f, counting the number NUM of union sets of various types of sample sets on the characteristic f, namely solving the union sets of all types on the characteristic value, and then counting the size of the union sets;
2) then, the maximum value of the statistical characteristic f in each class of samples is calculated, and the minimum value of the maximum values of the obtained classes of samples is recorded as the maximum valueMINMAX fThen the minimum value of the statistical characteristic f in each class of samples is counted, and the maximum value of the counted minimum values is recorded asMAXMIN f
3) Is obtained byMAXMIN fAndMINMAX fafter two indices, find all samples in each class that are in for feature fMAXMIN fAndMINMAX fnumber of samples in between
Figure 62074DEST_PATH_IMAGE003
(ii) a Finally, the feature efficiency F (f) for the feature f is obtained equal to 1 minus the number of samples
Figure 302300DEST_PATH_IMAGE004
And the union size NUM of the features f.
Further, the given original feature set size n and the screened feature set size m are set by the user.
Furthermore, the class sample set is a website sample set subjected to feature extraction, and the feature extraction is to extract representative features including the number of data packets and the interval time between the data packets from the captured data packet sequence according to the time and size characteristics of the data packets.
Compared with the prior art, the invention has the following technical effects:
the invention can effectively calculate the characteristic efficiency of each characteristic on the premise of giving the maximum characteristic set, screens the characteristics according to the given characteristic efficiency threshold value or the specified characteristic number, is beneficial to improving the identification accuracy of the website fingerprint identification technology, and saves the time and space cost consumed in the construction process of the classification model. Under the condition that the feature set extracted from the network flow is huge, the feature set can be maximally reduced on the premise of ensuring the highest feature coverage, the training cost during the construction of the classification model is saved, and the identification accuracy is improved.
The invention is independent of classification models, and is based on the essence of data, and from the view point of data complexity, the importance of each dimension feature in sample differentiation is evaluated, and the functional overlapping problem among different features mentioned before is emphasized to be solved, so that the effective screening is carried out.
Drawings
FIG. 1 is a flow chart of the present invention;
Detailed Description
The invention is further described below with reference to the accompanying drawings:
referring to fig. 1, a method for selecting characteristics of encrypted traffic analysis based on characteristic efficiency includes the following steps:
step 1: giving any several types of sample sets and an extracted feature set with the size of n, wherein each type of sample set comprises a plurality of website samples, and each sample is represented by n-dimensional extracted feature values;
step 2: based on data complexity, observing the differentiability of the features from the composition and distribution of the feature value space, and determining a calculation method of the feature efficiency of each feature;
Figure 765642DEST_PATH_IMAGE005
wherein f represents a feature; c j The full set of samples representing class j; i C jf ∪C kf L represents the number of the union set of the class j and the class k on the feature f in the sample, namely, the union set of the two classes is solved on the feature value, and then the size of the union set is counted;
Figure 375615DEST_PATH_IMAGE006
express, about a featuref all samples of class k and j areMAXMIN fAndMINMAX fthe number of samples in between;
and step 3: after the above calculation method is known, optionally calculating the F (f) value of two types of samples in a given sample class for each feature, and selecting the one-dimensional feature f _ max with the maximum value;
and 4, step 4: deleting samples except for the overlapping part of the f _ max value range in the two selected types of samples, removing the characteristic f and counting the number num of the samples;
and 5: repeating the steps 3 and 4 on the remaining feature set and the sample set until the sample set or the feature set is empty;
step 6: finally, the proportion of the number of samples removed num to the total number of samples for each feature is the importance f _ value of the feature; screening out given m most important characteristics according to the size of f _ value; and m is the size of the feature set after screening.
In step 2, the calculation method for determining the feature efficiency of each feature comprises the following steps:
1) optionally selecting one-dimensional characteristic f, counting the number NUM of union sets of various types of sample sets on the characteristic f, namely solving the union sets of all types on the characteristic value, and then counting the size of the union sets;
2) then, the maximum value of the statistical characteristic f in each class of samples is calculated, and the minimum value of the maximum values of the obtained classes of samples is recorded as the maximum valueMINMAX fThen the minimum value of the statistical characteristic f in each class of samples is counted, and the maximum value of the counted minimum values is recorded asMAXMIN f
3) Is obtained byMAXMIN fAndMINMAX fafter two indices, find all samples in each class that are in for feature fMAXMIN fAndMINMAX fnumber of samples in between
Figure 545697DEST_PATH_IMAGE003
(ii) a Finally, the feature efficiency F (f) for the feature f is obtained equal to 1 minus the number of samples
Figure 51764DEST_PATH_IMAGE004
And the union size NUM of the features f.
The given original characteristic set size n and the screened characteristic set size m are set by the user.
The class sample set is a website sample set subjected to feature extraction, and the feature extraction is to extract representative features including the number of data packets and the interval time between the data packets from the captured data packet sequence according to the time and size characteristics of the data packets.
Example 1:
step 1, acquiring all feature sets including all samples and class information thereof.
Assume that the two categories and the sample information they include are as follows:
{{baidu.com}:[[1,2,3,0],[2,0,5,6],[1,3,4,6],[0,3,4,5]]}
{{google.com}: [[1,3,5,2],[3,4,3,7],[2,5,3,6],[2,5,4,8]]}
where baidu and google represent two web site categories, respectively, each "[ ]" representing a traffic sample. It can be seen that there are 4-dimensional features in total, i.e. the number fea _ num is 4, and it is assumed that the four-dimensional features respectively represent the number of packet size ratios in the range of 32, 64,128, 256; where the number of samples n is 8. Now assume that two-dimensional valid features are to be screened out.
Step 2: calculating their values for each feature on two types of samples according to a predetermined feature efficiency calculation formula
Figure 799141DEST_PATH_IMAGE008
The value and the largest one-dimensional feature f _ max with the highest value is selected. For example, the category sample information for step 1 is calculated as follows:
first, the feature efficiency of the first-dimension features is calculated
Figure 416067DEST_PATH_IMAGE009
For the first dimension of the feature, it is,
Figure 175075DEST_PATH_IMAGE010
indicating that the number of samples between 1-2 eigenvalues in the two classes is known to be 6, | C jf ∪C kf And | represents that the number of the union of the two classes in the sample on the feature in the current dimension is 7.
The resulting feature of this dimension has a feature efficiency of 0.143.
Then respectively calculating the feature efficiencies of the residual three-dimensional features according to the calculation method, wherein the feature efficiencies are respectively as follows: 0.571,0,0.429.
And finally, selecting the one-dimensional feature f _ max with the maximum value as the second-dimensional feature.
And step 3: and removing samples except the overlapping part of the f _ max value range in the two types of samples, removing the characteristic f and counting the number of the samples. First, the overlapping part of the range of the feature f _ max obtained in step 2 is found to be {3}, so that the samples whose feature value does not belong to the set can be removed first, i.e., [2,0,5,6], [3,4,3,7], [2,5,3,6], [2,5,4,8], [1,2,3,0] can be removed, and the number of the removed samples is counted to be 5. The remaining sample features f _ max are then deleted, all remaining samples now containing 3-dimensional features.
And 4, step 4: repeating steps 2 and 3 for the remaining feature set and sample set until the sample set or feature set is empty.
The sample information remaining after the first removal of the feature is:
{{baidu.com}:[ [1,4,6],[0,4,5]]}
{{google.com}: [[1,5,2]]}
and (4) recalculating the sample information through the steps 2 and 3 to obtain the removed features and the removed sample number information. And repeating the steps until the sample set is empty or the feature set is empty, and recording the number information of the removed samples corresponding to all the features. The final results are as follows:
number of samples for 1 st dimension feature removal: 1
Number of samples for 2 nd dimension feature removal: 5
Number of samples for 3 d feature removal: 2
Number of samples for 4-dimensional feature removal: 0
And 5: the importance of each feature is calculated as the ratio of the number of samples removed to the total number of samples removed for that feature. Wherein the importance of the four-dimensional features is 0.125,0.625,0.25 and 0 respectively. The two features are finally screened as the second and third dimension features, respectively.

Claims (4)

1. A method for selecting encrypted flow analysis characteristics based on characteristic efficiency is characterized by comprising the following steps:
step 1: giving any several types of sample sets and an extracted feature set with the size of n, wherein each type of sample set comprises a plurality of website samples, and each sample is represented by n-dimensional extracted feature values;
step 2: based on data complexity, observing the differentiability of the features from the composition and distribution of the feature value space, and determining a calculation method of the feature efficiency of each feature;
Figure 224827DEST_PATH_IMAGE001
wherein f represents a feature; c j The full set of samples representing class j; i C jf ∪C kf L represents the number of the union set of the class j and the class k on the feature f in the sample, namely, the union set of the two classes is solved on the feature value, and then the size of the union set is counted;
Figure 696259DEST_PATH_IMAGE002
indicating that, for feature f, all samples of classes k and j are inMAXMIN fAndMINMAX fthe number of samples in between;
and step 3: after the above calculation method is known, optionally calculating the F (f) value of two types of samples in a given sample class for each feature, and selecting the one-dimensional feature f _ max with the maximum value;
and 4, step 4: deleting samples except for the overlapping part of the f _ max value range in the two selected types of samples, removing the characteristic f and counting the number num of the samples;
and 5: repeating the steps 3 and 4 on the remaining feature set and the sample set until the sample set or the feature set is empty;
step 6: finally, the proportion of the number of samples removed num to the total number of samples for each feature is the importance f _ value of the feature; screening out given m most important characteristics according to the size of f _ value; and m is the size of the feature set after screening.
2. The feature efficiency-based encrypted traffic analysis feature selection method according to claim 1, wherein in the step 2, the calculation method for determining the feature efficiency of each feature comprises the following steps:
1) optionally selecting one-dimensional characteristic f, counting the number NUM of union sets of various types of sample sets on the characteristic f, namely solving the union sets of all types on the characteristic value, and then counting the size of the union sets;
2) then, the maximum value of the statistical characteristic f in each class of samples is calculated, and the minimum value of the maximum values of the obtained classes of samples is recorded as the maximum valueMINMAX fThen the minimum value of the statistical characteristic f in each class of samples is counted, and the maximum value of the counted minimum values is recorded asMAXMIN f
3) Is obtained byMAXMIN fAndMINMAX fafter two indices, find all samples in each class that are in for feature fMAXMIN fAndMINMAX fnumber of samples in between
Figure 750803DEST_PATH_IMAGE003
(ii) a Finally, the feature efficiency F (f) for the feature f is obtained equal to 1 minus the number of samples
Figure 852751DEST_PATH_IMAGE004
And the union size NUM of the features f.
3. The method for selecting the encrypted flow analysis features based on the feature efficiency as claimed in claim 1, wherein the given size n of the original feature set and the size m of the feature set after being filtered are set by a user.
4. The feature selection method for encrypted traffic analysis based on feature efficiency as claimed in claim 1, wherein the class sample set is a website sample set subjected to feature extraction, and the feature extraction is to extract representative features from the captured data packet sequence according to time and size characteristics of the data packets, including the number of the data packets and the interval time between the data packets.
CN201810896859.3A 2018-08-08 2018-08-08 Encrypted flow analysis feature selection method based on feature efficiency Active CN109194622B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201810896859.3A CN109194622B (en) 2018-08-08 2018-08-08 Encrypted flow analysis feature selection method based on feature efficiency

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201810896859.3A CN109194622B (en) 2018-08-08 2018-08-08 Encrypted flow analysis feature selection method based on feature efficiency

Publications (2)

Publication Number Publication Date
CN109194622A CN109194622A (en) 2019-01-11
CN109194622B true CN109194622B (en) 2020-03-31

Family

ID=64920584

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201810896859.3A Active CN109194622B (en) 2018-08-08 2018-08-08 Encrypted flow analysis feature selection method based on feature efficiency

Country Status (1)

Country Link
CN (1) CN109194622B (en)

Families Citing this family (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110011931B (en) * 2019-01-25 2020-10-16 中国科学院信息工程研究所 Encrypted flow type detection method and system
CN109981335B (en) * 2019-01-28 2022-02-22 重庆邮电大学 Feature selection method for combined type unbalanced flow classification

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103780501A (en) * 2014-01-03 2014-05-07 濮阳职业技术学院 Peer-to-peer network traffic identification method of inseparable-wavelet support vector machine
CN105281973A (en) * 2015-08-07 2016-01-27 南京邮电大学 Webpage fingerprint identification method aiming at specific website category
CN105554152A (en) * 2015-12-30 2016-05-04 北京神州绿盟信息安全科技股份有限公司 Method and device for extracting data features

Family Cites Families (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US9141823B2 (en) * 2013-03-15 2015-09-22 Veridicom, Sa De Cv Abstraction layer for default encryption with orthogonal encryption logic session object; and automated authentication, with a method for online litigation
US20150281278A1 (en) * 2014-03-28 2015-10-01 Southern California Edison System For Securing Electric Power Grid Operations From Cyber-Attack
CN107070812A (en) * 2017-05-02 2017-08-18 武汉绿色网络信息服务有限责任公司 A kind of HTTPS protocal analysises method and its system

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103780501A (en) * 2014-01-03 2014-05-07 濮阳职业技术学院 Peer-to-peer network traffic identification method of inseparable-wavelet support vector machine
CN105281973A (en) * 2015-08-07 2016-01-27 南京邮电大学 Webpage fingerprint identification method aiming at specific website category
CN105554152A (en) * 2015-12-30 2016-05-04 北京神州绿盟信息安全科技股份有限公司 Method and device for extracting data features

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
AutoFlowLeaker: Circumventing Web Censorship through Automation Services;Shengtuo Hu, Xiaobo Ma, Muhui Jiang;《2017 IEEE 36th Symposium on Reliable Distributed Systems》;20171019;第214-223页 *

Also Published As

Publication number Publication date
CN109194622A (en) 2019-01-11

Similar Documents

Publication Publication Date Title
CN108874927B (en) Intrusion detection method based on hypergraph and random forest
CN104794192B (en) Multistage method for detecting abnormality based on exponential smoothing, integrated study model
CN103793484B (en) The fraud identifying system based on machine learning in classification information website
CN110046297B (en) Operation and maintenance violation identification method and device and storage medium
CN111385297A (en) Wireless device fingerprint identification method, system, device and readable storage medium
CN109309675A (en) A kind of network inbreak detection method based on convolutional neural networks
CN108334758A (en) A kind of detection method, device and the equipment of user's ultra vires act
CN111191720B (en) Service scene identification method and device and electronic equipment
CN109903053B (en) Anti-fraud method for behavior recognition based on sensor data
CN107368856A (en) Clustering method and device, the computer installation and readable storage medium storing program for executing of Malware
CN109194622B (en) Encrypted flow analysis feature selection method based on feature efficiency
CN113746952B (en) DGA domain name detection method and device, electronic equipment and computer storage medium
CN113052577B (en) Class speculation method and system for block chain digital currency virtual address
CN111866196A (en) Domain name traffic characteristic extraction method, device, equipment and readable storage medium
WO2021012913A1 (en) Data recognition method and system, electronic device and computer storage medium
CN112839014A (en) Method, system, device and medium for establishing model for identifying abnormal visitor
CN111444501B (en) LDoS attack detection method based on combination of Mel cepstrum and semi-space forest
CN114841705B (en) Anti-fraud monitoring method based on scene recognition
CN108268877A (en) A kind of method and apparatus for identifying target terminal
CN112199388A (en) Strange call identification method and device, electronic equipment and storage medium
CN111782908A (en) WEB violation operation behavior detection method based on data mining cluster analysis
CN113746780A (en) Abnormal host detection method, device, medium and equipment based on host image
CN109657703B (en) Crowd classification method based on space-time data trajectory characteristics
CN114817518B (en) License handling method, system and medium based on big data archive identification
CN113746707B (en) Encrypted traffic classification method based on classifier and network structure

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant