CN105281973A - Webpage fingerprint identification method aiming at specific website category - Google Patents

Webpage fingerprint identification method aiming at specific website category Download PDF

Info

Publication number
CN105281973A
CN105281973A CN201510481183.8A CN201510481183A CN105281973A CN 105281973 A CN105281973 A CN 105281973A CN 201510481183 A CN201510481183 A CN 201510481183A CN 105281973 A CN105281973 A CN 105281973A
Authority
CN
China
Prior art keywords
class
fingerprint
training set
feature
sample
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Withdrawn
Application number
CN201510481183.8A
Other languages
Chinese (zh)
Inventor
陈伟
李晨阳
沈婧
张伟
杨庚
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Nanjing Post and Telecommunication University
Nanjing University of Posts and Telecommunications
Original Assignee
Nanjing Post and Telecommunication University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Nanjing Post and Telecommunication University filed Critical Nanjing Post and Telecommunication University
Priority to CN201510481183.8A priority Critical patent/CN105281973A/en
Publication of CN105281973A publication Critical patent/CN105281973A/en
Withdrawn legal-status Critical Current

Links

Classifications

    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L41/00Arrangements for maintenance, administration or management of data switching networks, e.g. of packet switching networks
    • H04L41/14Network analysis or design
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L43/00Arrangements for monitoring or testing data switching networks
    • H04L43/04Processing captured monitoring data, e.g. for logfile generation

Landscapes

  • Engineering & Computer Science (AREA)
  • Computer Networks & Wireless Communication (AREA)
  • Signal Processing (AREA)
  • Data Mining & Analysis (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The invention discloses a webpage fingerprint identification method aiming at a specific website category. By adopting a characteristic selection method based on a classification effect and a classification method based on a combination of training set division and result integration, the method solves the imbalanced classification problem in the webpage fingerprint identification process of the specific website category, and the webpage fingerprint collection method is improved, so that webpage fingerprint identification under a cache mechanism can be coped. The method is simple and easy, different fingerprint data generated by different browser operations are fully considered in the data collection process, the coping capability of a fingerprint identification system with practical application environments is substantially improved, and the method has an important effect on network monitoring.

Description

A kind of web page fingerprint recognition methods for specific website classification
Technical field
The present invention relates to a kind of web page fingerprint recognition methods for specific website classification, belong to data mining and field of information security technology.
Background technology
Along with informatization deeply and the Deep Development of the Internet, with the Internet be carrier various actions with exchange more active, but, things has its dual character, the acquisition of information that people bring in enjoyment the Internet exchanges easily simultaneously with activity, various forms of net crime also presents the gesture growed in intensity, such as: network spy, network defraud, network pornography, network gambling etc., and these network crime activity serious harms national security and social stability.In order to tackle these potential security threats, the network behavior of target group being differentiated and monitors and just become abnormal important.But, because the behavior of current the Internet is the behavior of cross-border the Internet, even if the physical location of the web server of accessing operation is not within the border, because the restriction of China's Great Wall fire compartment wall and user itself are for the consideration of confidentiality, user often adopts the means of communication based on anonymous communication technology when carrying out cross-border internetwork operation.Because the monitoring of traditional network behavior is mostly based on flow analysis technology, these traffic classification technology based on data pack load feature are effective when data pack load is plaintext, but due to widely using of anonymous communication technology, this technology uses cryptographic algorithm to be encrypted data pack load part, data probe cannot obtain packet cleartext information, traffic classification method based on data pack load was lost efficacy, at present, traditional encipher flux analysis mainly adopts the flow analysis technology of Corpus--based Method, web page fingerprint recognition technology is the embody rule of this technology under actual scene.
In many network information interacting activities, website is important information carrier, and webpage plays very important role as the basic element of website, people enter website browsing webpage, with obtaining information or the information leaving oneself on webpage, therefore, for the monitoring of network behavior, judge that the categories of websites that target group browses is very important, and the differentiation of categories of websites be unable to do without web page class, and other differentiates.When target access webpage, when browser starts Web page loading, although the payload segment cryptographic algorithm adding packet in current-carrying is encrypted by anonymous communication instrument, but packet further feature information is not covered, as: data package size value, data packet transmission direction and each data packet transmission order and Transmission Time Interval, web page fingerprint recognition technology identifies by these channel characteristics in analyzing communication link encryption flow the webpage classification that target is browsed just, when so-called webpage " fingerprint " refers to Web page loading, the channel characteristics example that data flow produces is loaded in communication link, during owing to loading the webpage of different content, load traffic channel characteristic sum web page contents in communication link and present one-to-one relationship, collecting the encryption session data flow browsing related web page in advance, extract statistical law information structuring training set, use Supervised classification algorithm to load data flow to the encryption webpage that target produces to classify, thus identify the webpage classification of target access.
Web page fingerprint recognition technology is also in constantly ripe process, current most crucial problem how to improve web page fingerprint recognition performance under practical circumstances and applicability, here environment comprises system-operating environment and system applied environment, often need in actual applications to judge whether web page fingerprint belongs to certain specific website, and applicability mainly refers to the ability that novel web technology is tackled in web page fingerprint recognition methods, unlatching caching mechanism all given tacit consent to by present most of browser, and the existence of caching mechanism can form unstable web page fingerprint, cause the reduction judging fingerprint accuracy.
Therefore, meeting under real cache condition, specific website classification web page fingerprint identification demand, most important to the applicability improving web page fingerprint recognition technology actual application environment.And the present invention can solve problem above well.
Summary of the invention
The object of the invention is to propose a kind of web page fingerprint recognition methods for specific website classification, the method is for the identification problem of specific website classification web page fingerprint under cache environment, main employing is based on the feature selection approach of classifying quality and based on the sorting technique that training set divides and result integration combines, the uneven classification problem occurred when solving the web page fingerprint identification of specific website classification, improve web page fingerprint collection method, the method can be applied to the web page fingerprint identification under reply caching mechanism.
The present invention solves the technical scheme that its technical problem takes: a kind of web page fingerprint recognition methods for specific website classification, comprise the feature selection approach based on classifying quality value and the sorting technique that training set divides and result integration combines, and based on the web page fingerprint collection method of user operation
Method flow:
Step 1: training data is collected.On the communication link the web page fingerprint data under the different browsers mode of operation of all websites that target may be accessed are gathered.
Step 2: data prediction.The data that abate the noise and redundant data, comprise retransmission data packet may, bad packet, and redundant data comprises protocol integrated test system data.
Step 3: structure training set.First feature extraction operation is carried out, from pretreated webpage loading data flow, corresponding characteristic value is extracted according to fingerprint characteristic, then by each feature, eigenvalue cluster synthesis feature value vector, and this webpage is loaded the categories of websites belonging to example and be added on characteristic vector end composing training example as the class categories of this feature value vector, finally all training examples constitute the original training set of fingerprint.
Step 4: feature selecting.The present invention proposes a kind of feature selection approach based on classifying quality value, finger print data collection is divided into positive class and negative class by the method, wherein needs the categories of websites identified to be positive class, and other categories of websites is negative class, as shown in the table:
Wherein, C irepresent the positive class needing to identify, represent all categories except needing the website of identification in sample set.A represents the sample frequency containing feature t in positive class; B represents the sample frequency not containing feature t in positive class; C represents the sample frequency containing feature t in negative class, and D represents the sample frequency not containing feature t in negative class.The present invention uses DCR (DistinguishClassificationResult) algorithm weighing stream attribute and categories of websites correlation to be:
D C R ( t , C i ) = A A + B × A A + C × ( A A + B - C C + D ) × FC i F C ‾ i
Wherein, represent the ratio containing characteristic attribute t in positive class fingerprint example, this value is larger, and characteristic attribute t represents that the ability of positive class is stronger; represent positive class example ratio in the fingerprint example containing characteristic attribute t, this value is larger, represents that this other effect of stream attribute region class is better. represent that the sample number containing characteristic attribute t in positive class is more, and the sample number containing characteristic attribute t in negative class is less, this value is larger, shows that the effect of this characteristic attribute class discrimination is better. represent the ratio of this characteristic attribute t average frequency occurred and frequency on average occurred in negative class in positive class, this characteristic attribute of the larger expression of this ratio and positive class have stronger correlation.Concentrating the classifying quality value of the positive class of each feature to carry out descending sort according to classifying quality value size by calculating each finger print data, selecting rank top n feature as final classification feature.
Step 5: training set divides.First whole training set is divided into positive class training set and negative class training set by positive class and negative class, with C and represent positive class training set and negative class training set respectively:
C = { ( c i , + ) } i = 1 n , C ‾ = { ( c ‾ i , - ) } i = 1 m
Wherein, c irepresent i-th positive class sample, n represents positive class sample number; represent i-th negative class sample, m represents negative class sample number.Afterwards, random division method is used to divide to negative class training set:
C ‾ i = { ( c ‾ k i , - ) } k = 1 l i , i = 1 , 2 , ... , N
Wherein, i-th piece of sub-training set of negative class after representative divides, l irepresent the number of samples of i-th piece of sub-training set of negative class.The block number N wherein divided is determined by following formula: m is negative class training set sample number, and n is positive class training set sample number.Finally, positive class training set and the sub-training set of each negative class are merged, obtain N number of sub-training set:
T i = C ∪ C ‾ i , i = 1 , 2 , ... , N
Wherein, T iit is the sub-training set after finally having divided.Due in sub-training set, positive class number of samples equals negative class number of samples, can traditional classifier be used to classify on these training sets.
Step 6: classification.After training set divides, traditional classifier is used to classify to the finger print data to be sorted that target produces on each training subset.The present invention uses the KNN grader based on cosine similarity matching algorithm, after target web load traffic arrives, return the operation of step 2 and step 3, be translated into web page fingerprint example to be sorted, the feature that use characteristic is selected afterwards participates in Similarity Measure, calculate the cosine similarity value between each example in fingerprint example to be sorted and training set, filter out K the fingerprint the most similar to fingerprint to be sorted, its calculating formula of similarity is:
S i m ( d i , d j ) = Σ k = 1 M W i k × W j k ( Σ k = 1 M W i k 2 ) ( Σ k = 1 M W j k 2 )
Wherein, d ifor fingerprint to be sorted, d jfor certain fingerprint in training set, W ikfor a kth characteristic value in fingerprint to be sorted, W jkfor a kth characteristic value in training set fingerprint.Calculate the class weight of a fingerprint K neighbour to be sorted afterwards respectively, computing formula: wherein d ifor fingerprint to be sorted, for one in K adjacent fingerprint, for similarity mode algorithm, for classification function, when belong to classification C jtime, this functional value is 1, otherwise is 0.Finally by more all kinds of class weight, determine the categories of websites of this fingerprint to be sorted.
Step 7: result integration.After grader is classified to every sub-training set, produce N number of classification results, this classification results number is identical with training set divided block number.Finally integrate these classification results based on maximized thought, obtain final classification results, concrete steps are as follows:
W i=F(T i),i=1,2,...,N
W=MAX(W 1,W 2,...,W N)
Training subset after dividing each carries out classification, and to obtain each partitions of subsets result be W i, this result is made up of two parts: fingerprint affiliated web site classification c to be sorted and fingerprint to be sorted belong to such other classification weights p.Choose the W that in all classification results, p value is maximum kas final classification results.
Beneficial effect:
1, the present invention is simple, has taken into full account the different finger print datas that different browser operations generates when Data Collection, greatly strengthen the ability of fingerprint recognition system reply actual application environment.
2, use characteristic system of selection of the present invention carries out feature selecting to fingerprint characteristic, reduces computation complexity during fingerprint Similarity Measure, and effectively improves recognition performance.
3, the present invention is directed to the imbalance classification situation that specific website identification occurs, propose a kind of training set division-classification-integrated approach, effectively improve the discrimination of specific website recognition performance and categories of websites web page fingerprint.
Accompanying drawing explanation
Fig. 1 is the method flow diagram of web page fingerprint identification of the present invention.
Fig. 2 is that training set divides schematic diagram.
Fig. 3 is environment for use schematic diagram.
Fig. 4 is the feature situation of DCR value front 20 in certain training set.
Fig. 5 is fingerprint case similarity situation in sub-training set.
Embodiment
Below in conjunction with Figure of description, the invention is described in further detail.
The technical term that the present invention comprises comprises as follows:
Supervised classification: be learn according to other training data of marking class or set up a decision-making mode, and be speculated as the generic of mark example according to this pattern.In supervised learning, each training example is made up of feature value vector and generic, produces can judge new other method for classifying modes of unmarked example class by using supervised learning algorithm on the training data, is below the definition of its correlation technique.
Training set: generally sample data is divided into two parts in when verifying the validity of Supervised classification: training set and test set, wherein training set is used for setting up disaggregated model, and test set is then for inspection-classification performance.In most of the cases, initial data can not directly as training set, and will build after the step such as data prediction, feature selecting and form.
Grader: refer to and act on the pattern classification rule that the sorting algorithm on training set formed, it is some that example to be sorted can be mapped in given classification by grader.The present invention uses KNN grader as web page fingerprint recognition classifier.
Uneven classification: non-equilibrium classification refers to classifies on of all categories the unbalanced training sample of distribution, wherein the sample size of some classification is far fewer than other classes, but these classifications are but the classifications needing to pay close attention to, because its sample number is rare, traditional classifier is very low to such other susceptibility, therefore needs specific method to solve rare class specimen discerning problem.
Feature selecting: refer to and select N number of feature to make some performance index optimization of categorizing system from an existing M feature, it is concentrated from primitive character and selects some most validity features to reduce classification training set dimension, plays the effect reducing complicated classification degree, raising classification performance.
When applying the method under practical circumstances, owing to there is a large amount of business anonymous communication instrument and browser, the present invention chooses climb over the walls software and chrome browser of shadowocks and is described.As shown in Figure 3, first target browses website overseas by using shadowsocks instrument of climbing over the walls to access, shadowsocks instrument is connected to far-end SOCKS proxy server, and use chrome browser, now shadowsocks instrument establishes an anonymous encrypted communication channel between targeted customer and remote agent's server, this channel passes through the controlled switching equipment of certain supervisor, this switching equipment is configured with mirror port, the data on flows of targeted customer can be captured by supervisor, supervisor loads data by extracting corresponding webpage from data on flows, and it is analyzed.Wherein, supervisor and targeted customer are in identical communication link environment, and the data on flows that target produces can monitored side obtain, and data on flows payload segment is encrypted.The present invention is based on this basis of environmental method flow as shown in Figure 1 and carry out work, concrete analytical procedure comprises:
Step 1: supervisor carries out Data Collection by utilizing controlled switching equipment Usage data collection instrument, comprises Target Data Collection and training data is collected.Target Data Collection is mainly collected the data on flows that target browses web sites by metadata acquisition tool, and from flow, extract webpage load traffic.Training data is collected website that mainly supervisor uses browser access target to browse and is collected on the communication link and extract the webpage load traffic of corresponding website, wherein each website uses 4 kinds of different browser operation modes to conduct interviews to it respectively, and often kind of browser operation mode collects 10 finger print datas respectively.Finger print data is directed in csv file by metadata acquisition tool, traffic activity all between browser and remote web-server in a web page loading process is have recorded at each finger print data, these finger print datas are made up of some serial tcp data bags, impact due to encryption technology cannot obtain the information of tcp data payload package part, and the data structure of all the other tcp data package informatins comprises: packet sequence number, packet transmission time, source IP address, object IP address, data package size and packet describe.
Step 2: after collecting finger print data, needs to carry out data prediction operation to it.Containing a large amount of protocol integrated test system data in tcp data in finger print data, these protocol integrated test system data are mainly used in the foundation and the disconnection that control tcp data, in addition also include other redundancy and noise data in original finger print data, comprising: TCP retransmission data packet may and TCP rascal.The packet that data package size in shadowsocks finger print data is less than 70 by present case is considered as protocol integrated test system packet and is removed, and containing " Retransmission " during packet is described, " Dup ", the packet of " Out-of-order " description field is considered as rascal and retransmission data packet may is removed.
Step 3: after removing redundancy and noise data, need initial data to be configured to data set.First < data package sizes different in packet in all webpage load traffic is extracted, packet direction > vector, by all different < data package sizes, > vector in packet direction is as primitive character.Certain < data package size is met afterwards by each webpage load traffic, the packet number of packet direction > feature is as characteristic value, and add this categories of websites belonging to web page fingerprint flow, finally form web page fingerprint training set:
(1048,-) (116,-) ......... (148,-) Website Class
449 3 ......... 0 Zhaori News
438 4 ......... 0 Zhaori News
......... ......... ......... ......... ...........
678 0 ......... 2 youtube
578 0 ......... 0 New York Time
In table, each row of data represents that a complete webpage loads data flow instances, the feature pair that feature and data package size and data packet transmission direction form, such as (1048,-) represent that data package size is 1048, data packet transmission direction is for transfer to browser by server, and each characteristic value represents is the number of data packets meeting this feature in this fingerprint example, such as 449 represent, be that in the webpage load traffic of Asahi Shimbun, data package size is 1048 in some websites classification, data packet transmission direction is that the packet from server end to browser end has 449.
Step 4: feature selecting.Suppose to need the specific website classification differentiated to be facebook in present case, certain training set comprises feature: (962 ,+), wherein categories of websites is the example number containing this feature in the fingerprint example of facebook is 19; Example number containing this feature in the fingerprint example of the non-facebook of categories of websites is 0; Categories of websites is the example number not containing this feature in the fingerprint example of facebook is 21; Example number not containing this feature in the fingerprint example of the non-facebook of categories of websites is 80; Categories of websites is that to meet this characteristic bag number in the fingerprint example of facebook be 38; The packet number meeting this feature in the fingerprint example of the non-facebook of categories of websites is 0; Therefore the DCR value of final feature (962 ,+) is:
D C R ( 962 + ) = 19 19 + 21 &times; 19 19 + 0 &times; ( 19 19 + 21 - 0 0 + 80 ) &times; 38 ( 0 + 1 ) = 8.57375
DCR value is all calculated to feature each in training set, selects the feature of DCR value front 20 as characteristic of division, as shown in Figure 4.
Step 5: training set divides.As shown in Figure 2, suppose that present case is for the web page fingerprint data that have collected 40 target potential access websites, so categories of websites is that the web page fingerprint data of facebook account for 1/40 of all data in training set, and positive class finger print data and the ratio of negative class finger print data in training set are 1:39.Carry out random division to negative class finger print data, be divided into 39 sons and bear class data set, remerge every height afterwards and bear class data set and positive class data set forms sub-training set, in this little training set, positive class fingerprint number is identical with negative class.
Step 6: classification.Calculate fingerprint example in target fingerprint to be sorted and each sub-training set and divide the cosine similarity of interior feature, as shown in Figure 5, with (' facebook.59 ', 0.995499913195721) be example, its represent sequence number in training set be 59 classification be the fingerprint example of facebook and the similarity of fingerprint to be measured be 0.995499913195721.Each for sub-training set fingerprint example is arranged from high to low according to fingerprint Similarity value to be measured, before getting rank, the Similarity Measure result of 5 participates in categorised decision, before current standings, the result of 5 is: (facebook, 0.995), (facebook, 0.692), (zhaori, 0.626), (facebook, 0.614), (facebook, 0.610), calculate the class weight of this target fingerprint afterwards, namely the classification weights of the similarity sum of respective classes: facebook are 0.995+0.692+0.614+0.610=2.911, the classification weights of non-facecook are 0.626, subclassification selects the classification results of classification maximum weight as subclassification classification results, show that the classification results that final grader exports is facebook.
Step 7: result integration.After fingerprint to be measured uses KNN grader to draw classification results on every sub-training set, we gather each sub-classifier classification results based on maximization thought, namely select the subclassification result of maximum classification weights in each sub-classifier as final classification results, as two sub-training sets in Fig. 5, the classification results of first training set is (facebook, 2.911), (non-facebook, 0.626), the classification results of second training set is (facebook, 2.157), (non-facebook, 1.261), wherein maximum classification weights are 2.911, thus using the classification results facebook of its correspondence as the result after these two sub-combining classifiers.

Claims (5)

1. for a web page fingerprint recognition methods for specific website, it is characterized in that, described method comprises the steps:
Step 1: training data is collected; On the communication link the web page fingerprint data under the different browsers mode of operation of all websites that target may be accessed are gathered;
Step 2: data prediction; The data that abate the noise and redundant data, comprising: retransmission data packet may, bad packet, and redundant data comprises protocol integrated test system data;
Step 3: structure training set; First feature extraction operation is carried out, from pretreated webpage loading data flow, corresponding characteristic value is extracted according to fingerprint characteristic, then by each feature or eigenvalue cluster synthesis feature value vector, and this webpage is loaded the categories of websites belonging to example and be added on characteristic vector end composing training example as the class categories of this feature value vector, finally all training examples constitute the original training set of fingerprint;
Step 4: feature selecting; Finger print data collection is divided into positive class and negative class, wherein needs the categories of websites identified to be positive class, other categories of websites is negative class;
Step 5: training set divides; First whole training set is divided into positive class training set and negative class training set by positive class and negative class, with C and represent positive class training set and negative class training set respectively:
C = { ( c i , + ) } i = 1 n , C &OverBar; = { ( c &OverBar; i , - ) } i = 1 m
Wherein, c irepresent i-th positive class sample, n represents positive class sample number; represent i-th negative class sample, m represents negative class sample number; Afterwards, random division method is used to divide to negative class training set:
C &OverBar; i = { ( c &OverBar; k i , - ) } k = 1 l i , i = 1 , 2 , ... , N
Wherein, i-th piece of sub-training set of negative class after representative divides, l irepresent the number of samples of i-th piece of sub-training set of negative class, the block number N wherein divided is determined by following formula: m is negative class training set sample number, and n is positive class training set sample number, finally, positive class training set and the sub-training set of each negative class is merged, obtains N number of sub-training set:
T i = C &cup; C &OverBar; i , i = 1 , 2 , ... , N
Wherein, T ibe the sub-training set after finally having divided, in sub-training set, positive class number of samples equals negative class number of samples, and these training sets use traditional classifier to classify;
Step 6: classification; After training set divides, traditional classifier is used to classify to the finger print data to be sorted that target produces on each training subset;
Step 7: result integration; To divide through training set and after every sub-training set being classified with grader, produce N number of classification results, this classification results number is identical with training set divided block number, finally integrates these classification results based on maximized thought, obtain final classification results, this step is as follows:
W i=F(T i),i=1,2,...,N
W=MAX(W 1,W 2,...,W N)
Training subset after dividing each carries out classification, and to obtain each partitions of subsets result be W i, this result is made up of two parts: fingerprint affiliated web site classification c to be sorted and fingerprint to be sorted belong to such other classification weights p, choose the W that in all classification results, p value is maximum kas final classification results.
2. a kind of web page fingerprint recognition methods for specific website according to claim 1, it is characterized in that, the feature selecting of described step 4 is as shown in the table, comprising:
Wherein, C irepresent the positive class needing to identify, represent all categories except needing the website of identification in sample set, A represents the sample frequency containing feature t in positive class; B represents the sample frequency not containing feature t in positive class; C represents the sample frequency containing feature t in negative class, and D represents the sample frequency not containing feature t in negative class.
3. a kind of web page fingerprint recognition methods for specific website according to claim 2, it is characterized in that, the feature selecting of described step 4 is based on following several hypothesis, comprise: if a) feature extensively occurs in the sample fingerprint of positive class, that is: in positive class, the distribution of this characteristic attribute is more even, and so this feature more can represent this categories of websites; If b) feature extensively occurs in positive class sample fingerprint, and seldom occurs in negative class sample fingerprint, then this feature differentiation classification is effective; If c) extensively contain this feature in positive class sample fingerprint, and seldom containing this feature in negative class sample, then this is characterized as the evident characteristics of this categories of websites; D) occur that the angle of frequency is considered in the sample from feature; If the appearance frequency of this feature in positive class sample is much larger than the appearance frequency in negative class sample, then this feature differentiation classification is effective; The DCR algorithm that stream attribute and categories of websites correlation are weighed in use is:
D C R ( t , C i ) = A A + B &times; A A + C &times; ( A A + B - C C + D ) &times; FC i F C &OverBar; i
Wherein, represent that the sample number containing stream attribute t in positive class accounts for the ratio of positive class sample fingerprint collection, this value is larger, and stream attribute t represents that the ability of positive class is stronger; represent that the sample number containing stream attribute t in positive class accounts for the ratio of whole sample fingerprint collection, this value is larger, and other is effective to represent this stream attribute region class; represent that the sample number containing stream attribute t in positive class is many, and the sample number containing stream attribute t in negative class is few, this value is larger, shows the effective of this stream attribute class discrimination; represent the ratio of this stream attribute t average frequency occurred and frequency on average occurred in negative class in positive class, this stream attribute of the larger expression of this ratio and positive class have very strong correlation; Concentrating the classifying quality value of the positive class of each feature to carry out descending sort according to classifying quality value size by calculating each finger print data, selecting rank top n feature as final classification feature.
4. a kind of web page fingerprint recognition methods for specific website according to claim 1, it is characterized in that, described step 6 comprises: use the KNN grader based on cosine similarity matching algorithm, KNN and K-NearestNeighbor, by the similarity between the training sample that calculates known category information in example to be sorted and training set, filter out and the immediate K of a sample to be sorted training sample, if this K sample standard deviation belongs to same classification, classification samples is then with also to belong to this classification, otherwise each classification is assessed, finally determine sample data generic to be sorted according to certain rule, after target web load traffic arrives, return the operation of step 2 and step, be translated into web page fingerprint example to be sorted, the feature that use characteristic is selected afterwards participates in Similarity Measure, calculate the cosine similarity value between each example in fingerprint example to be sorted and training set, filter out K the fingerprint the most similar to fingerprint to be sorted, its calculating formula of similarity:
S i m ( d i , d j ) = &Sigma; k = 1 M W i k &times; W j k ( &Sigma; k = 1 M W i k 2 ) ( &Sigma; k = 1 M W j k 2 )
Wherein, d ifor fingerprint to be sorted, d jfor certain fingerprint in training set, W ikfor a kth characteristic value in fingerprint to be sorted, W jkfor a kth characteristic value in training set fingerprint, calculate the class weight of a fingerprint K neighbour to be sorted respectively, computing formula: wherein d ifor fingerprint to be sorted, for one in K adjacent fingerprint, for similarity mode algorithm, for classification function, when belong to classification C jtime, this functional value is 1, otherwise is 0, finally by more all kinds of class weight, determines the categories of websites of this fingerprint to be sorted.
5. a kind of web page fingerprint recognition methods for specific website according to claim 1, is characterized in that, described in be applied to reply caching mechanism under web page fingerprint identification.
CN201510481183.8A 2015-08-07 2015-08-07 Webpage fingerprint identification method aiming at specific website category Withdrawn CN105281973A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201510481183.8A CN105281973A (en) 2015-08-07 2015-08-07 Webpage fingerprint identification method aiming at specific website category

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201510481183.8A CN105281973A (en) 2015-08-07 2015-08-07 Webpage fingerprint identification method aiming at specific website category

Publications (1)

Publication Number Publication Date
CN105281973A true CN105281973A (en) 2016-01-27

Family

ID=55150342

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201510481183.8A Withdrawn CN105281973A (en) 2015-08-07 2015-08-07 Webpage fingerprint identification method aiming at specific website category

Country Status (1)

Country Link
CN (1) CN105281973A (en)

Cited By (14)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106850333A (en) * 2016-12-23 2017-06-13 中国科学院信息工程研究所 A kind of network equipment recognition methods and system based on feedback cluster
CN107273416A (en) * 2017-05-05 2017-10-20 深信服科技股份有限公司 The dark chain detection method of webpage, device and computer-readable recording medium
CN107766481A (en) * 2017-10-13 2018-03-06 国家计算机网络与信息安全管理中心 A kind of method and system for finding internet financial platform
CN107819664A (en) * 2016-09-12 2018-03-20 阿里巴巴集团控股有限公司 A kind of recognition methods of spam, device and electronic equipment
CN108924090A (en) * 2018-06-04 2018-11-30 上海交通大学 A kind of shadowsocks flow rate testing methods based on convolutional neural networks
CN109194622A (en) * 2018-08-08 2019-01-11 西安交通大学 A kind of encryption flow analysis feature selection approach based on feature efficiency
CN109214193A (en) * 2017-07-05 2019-01-15 阿里巴巴集团控股有限公司 Data encryption, machine learning model training method, device and electronic equipment
CN109831448A (en) * 2019-03-05 2019-05-31 南京理工大学 For the detection method of particular encryption web page access behavior
CN110324310A (en) * 2019-05-21 2019-10-11 国家工业信息安全发展研究中心 Networked asset fingerprint identification method, system and equipment
CN111224940A (en) * 2019-11-15 2020-06-02 中国科学院信息工程研究所 Anonymous service traffic correlation identification method and system nested in encrypted tunnel
CN111698223A (en) * 2020-05-22 2020-09-22 哈尔滨工程大学 Encrypted WEB fingerprint identification method based on automatic feature engineering
CN111835720A (en) * 2020-06-10 2020-10-27 南京邮电大学 VPN flow WEB fingerprint identification method based on feature enhancement
CN112583738A (en) * 2020-12-29 2021-03-30 北京浩瀚深度信息技术股份有限公司 Method, equipment and storage medium for analyzing and classifying network flow
CN113641935A (en) * 2021-08-12 2021-11-12 厦门大学 Method for enhancing and improving anonymous network webpage fingerprint monitoring capability by using data

Cited By (21)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107819664A (en) * 2016-09-12 2018-03-20 阿里巴巴集团控股有限公司 A kind of recognition methods of spam, device and electronic equipment
CN106850333B (en) * 2016-12-23 2019-11-29 中国科学院信息工程研究所 A kind of network equipment recognition methods and system based on feedback cluster
CN106850333A (en) * 2016-12-23 2017-06-13 中国科学院信息工程研究所 A kind of network equipment recognition methods and system based on feedback cluster
CN107273416A (en) * 2017-05-05 2017-10-20 深信服科技股份有限公司 The dark chain detection method of webpage, device and computer-readable recording medium
CN107273416B (en) * 2017-05-05 2021-05-04 深信服科技股份有限公司 Webpage hidden link detection method and device and computer readable storage medium
CN109214193A (en) * 2017-07-05 2019-01-15 阿里巴巴集团控股有限公司 Data encryption, machine learning model training method, device and electronic equipment
CN107766481A (en) * 2017-10-13 2018-03-06 国家计算机网络与信息安全管理中心 A kind of method and system for finding internet financial platform
CN108924090A (en) * 2018-06-04 2018-11-30 上海交通大学 A kind of shadowsocks flow rate testing methods based on convolutional neural networks
CN108924090B (en) * 2018-06-04 2020-12-11 上海交通大学 Method for detecting traffics of shadowsocks based on convolutional neural network
CN109194622B (en) * 2018-08-08 2020-03-31 西安交通大学 Encrypted flow analysis feature selection method based on feature efficiency
CN109194622A (en) * 2018-08-08 2019-01-11 西安交通大学 A kind of encryption flow analysis feature selection approach based on feature efficiency
CN109831448A (en) * 2019-03-05 2019-05-31 南京理工大学 For the detection method of particular encryption web page access behavior
CN110324310A (en) * 2019-05-21 2019-10-11 国家工业信息安全发展研究中心 Networked asset fingerprint identification method, system and equipment
CN111224940A (en) * 2019-11-15 2020-06-02 中国科学院信息工程研究所 Anonymous service traffic correlation identification method and system nested in encrypted tunnel
CN111698223A (en) * 2020-05-22 2020-09-22 哈尔滨工程大学 Encrypted WEB fingerprint identification method based on automatic feature engineering
CN111698223B (en) * 2020-05-22 2022-02-22 哈尔滨工程大学 Encrypted WEB fingerprint identification method based on automatic feature engineering
CN111835720A (en) * 2020-06-10 2020-10-27 南京邮电大学 VPN flow WEB fingerprint identification method based on feature enhancement
CN111835720B (en) * 2020-06-10 2023-04-07 南京邮电大学 VPN flow WEB fingerprint identification method based on feature enhancement
CN112583738A (en) * 2020-12-29 2021-03-30 北京浩瀚深度信息技术股份有限公司 Method, equipment and storage medium for analyzing and classifying network flow
CN113641935A (en) * 2021-08-12 2021-11-12 厦门大学 Method for enhancing and improving anonymous network webpage fingerprint monitoring capability by using data
CN113641935B (en) * 2021-08-12 2023-10-20 厦门大学 Method for improving anonymous network webpage fingerprint monitoring capability by utilizing data enhancement

Similar Documents

Publication Publication Date Title
CN105281973A (en) Webpage fingerprint identification method aiming at specific website category
Yu et al. PBCNN: packet bytes-based convolutional neural network for network intrusion detection
CN107749859B (en) Malicious mobile application detection method for network encryption traffic
Singh Panwar et al. Evaluation of network intrusion detection with features selection and machine learning algorithms on CICIDS-2017 dataset
CN104009836B (en) Encryption data detection method and system
Husain et al. Development of an efficient network intrusion detection model using extreme gradient boosting (XGBoost) on the UNSW-NB15 dataset
Bajaj et al. Improving the intrusion detection using discriminative machine learning approach and improve the time complexity by data mining feature selection methods
CN107370752B (en) Efficient remote control Trojan detection method
CN107332848A (en) A kind of exception of network traffic real-time monitoring system based on big data
Groleat et al. Hardware acceleration of SVM-based traffic classification on FPGA
Bajaj et al. Dimension reduction in intrusion detection features using discriminative machine learning approach
CN109286576A (en) A kind of network agent encryption traffic characteristic extracting method of data packet frequency analysis
CN114257428A (en) Encrypted network traffic identification and classification method based on deep learning
Silva et al. A statistical analysis of intrinsic bias of network security datasets for training machine learning mechanisms
Bian et al. Host in danger? detecting network intrusions from authentication logs
Yan et al. Principal Component Analysis Based Network Traffic Classification.
CN102984131B (en) A kind of information identifying method and device
Harbola et al. Improved intrusion detection in DDoS applying feature selection using rank & score of attributes in KDD-99 data set
CN101764754B (en) Sample acquiring method in business identifying system based on DPI and DFI
CN105429817A (en) Illegal business identification device and illegal business identification method based on DPI and DFI
CN110519228B (en) Method and system for identifying malicious cloud robot in black-production scene
Tian et al. A transductive scheme based inference techniques for network forensic analysis
Thanh et al. An approach to reduce data dimension in building effective network intrusion detection systems
Kato et al. Large-scale network packet analysis for intelligent DDoS attack detection development
CN107306252B (en) A kind of data analysing method and system

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
C04 Withdrawal of patent application after publication (patent law 2001)
WW01 Invention patent application withdrawn after publication

Application publication date: 20160127