CN105824822A

CN105824822A - Method clustering phishing page to locate target page

Info

Publication number: CN105824822A
Application number: CN201510003979.2A
Authority: CN
Inventors: 唐新民; 景晓军; 沈智杰
Original assignee: SURFILTER NETWORK TECHNOLOGY Co Ltd
Current assignee: SURFILTER NETWORK TECHNOLOGY Co Ltd
Priority date: 2015-01-05
Filing date: 2015-01-05
Publication date: 2016-08-03

Abstract

The invention provides a method clustering phishing pages to locate a target page; the method comprises the following steps: 1, searching a related page set of a given phishing page; 2, extracting and modeling webpage feature information of the phishing page and the related page set; 3, using a DBSCAN algorithm to cluster and analyze the webpage feature information similarity, thus obtaining similar pages of the same kind; 4, using a domain name similar relation to locate the target page from the similar pages of the same kind. The method can carry out related webpage searching and clustering analysis according to the various webpage features through the given phishing page, thus screening and identifying the related page set, better coping with phishing page cheating means, and finding the target page simulated by the phishing pages with high accuracy and in a large scale.

Description

A kind of by the method for fishing webpage cluster location target web

Technical field

The present invention relates to information security field, more particularly, it relates to a kind of by the method for fishing webpage location target web.

Background technology

Along with being widely used and the growing of ecommerce and universal of the Internet, increasing user is to be identified by input personal information carrying out online transaction when.Meanwhile, along with the carrying out of electronic transaction, in recent years, the phishing phenomenon of illegal industry emerges in an endless stream, lawless person imitates the message format of actual site, induction user logins the webpage of a personation, thus steals userspersonal information such as bank or credit card account, password etc..Owing to these false webpages are more and more true to nature, so a lot of careless user is easy to have dust thrown into the eyes, cause exposure and the Personal Finance loss of sensitive information.

At present, China Patent No. CN102629261A discloses the method being searched target web by fishing webpage, and it mainly from the angle that vision is similar, positions target web by perception hash method, i.e. reached " likeness in form ".But, current fishing webpage is when imitating target web, much consistent by color matching, or whole style unanimously reaches the effect with target web " alike in spirit " and user cheating, and now, the method cannot solve this " alike in spirit " rather than " likeness in form " situation.

Summary of the invention

The technical problem to be solved in the present invention is, for the existing defect being searched target web by fishing webpage, it is provided that a kind of by the method for fishing webpage location target web.

The present invention solves the technical scheme of the problems referred to above and there is provided a kind of method by fishing webpage cluster location target web, it is characterised in that comprise the following steps:

S1, the related web page set of the given fishing webpage of lookup；

S2, extract and model described fishing webpage and the web page characteristics information of described related web page set；

S3, utilize the similarity of web page characteristics information described in DBSCAN algorithm cluster analysis, obtain same class similar web page；

S4, by domain name similarity relation, orient the target web in described same class similar web page.

In the above-mentioned method by fishing webpage cluster location target web, described step S1 includes:

S101, extract the URL hyperlink of the html source code of described fishing webpage, it is thus achieved that the network address of direct correlation webpage；

S102, extract the key word of described fishing webpage, and scanned for by search engine, it is thus achieved that the network address of the webpage of non-immediate association；

S103, according to described direct correlation webpage and the network address of non-immediate associating web pages, use reptile to crawl, it is thus achieved that related web page set.

In the above-mentioned method by fishing webpage cluster location target web, in described step S2, also including the similarity distance calculating each webpage in described fishing webpage and described related web page set, this similarity distance is V_i(L_i,R_i,US_i,TS_i,LS_i), wherein, L_iIt is linking relationship similarity, R_iIt is hierarchical relationship similarity, US_iIt is domain name similarity relation similarity, TS_iIt is text similarity relation similarity, LS_iIt it is vision similarity relation similarity.

In the above-mentioned method by fishing webpage cluster location target web, in described step S2, described web page characteristics information includes linking relationship, hierarchical relationship, domain name similarity relation, text similarity relation and vision similarity relation.

In the above-mentioned method by fishing webpage cluster location target web, in described step S2, farther include the modeling length calculating similarity relation to described fishing webpage, wherein, described similarity relation includes domain name similarity relation, text similarity relation and vision similarity relation.

In the above-mentioned method by fishing webpage cluster location target web, in described step S3, also include utilizing described modeling length to carry out cluster analysis.

The method that the present invention provides is by known fishing webpage, related web page lookup and cluster analysis is carried out according to multiple web page characteristics, related web page set carries out screening identify, can preferably tackle the fraud of fishing webpage, and on a large scale, find out the target web that fishing website imitates high-accuracy.

Accompanying drawing explanation

Fig. 1 is the schematic flow sheet of the method being positioned target web by fishing webpage of the embodiment of the present invention.

Fig. 2 is carried out the refinement flow chart of step S1 in Fig. 1.

Detailed description of the invention

In order to make the purpose of the present invention, technical scheme and advantage clearer, below in conjunction with drawings and Examples, the present invention is further elaborated.Should be appreciated that specific embodiment described herein, only in order to explain the present invention, is not intended to limit the present invention.

As it is shown in figure 1, be the method flow diagram being positioned target web by fishing webpage of the embodiment of the present invention.In the present embodiment, the method comprises the following steps:

S1, the related web page set of the given fishing webpage of lookup；

In this step, fishing webpage is URL address and the content of pages of counterfeit actual site, it is assumed that given fishing webpage is labeled as P, and its related web page aggregated label is W_p, as in figure 2 it is shown, this step farther includes:

S101, the URL hyperlink of html source code of extraction fishing webpage P, it is thus achieved that the network address of direct correlation webpage；Wherein, in URL hyperlink is included in BODY label.

S102, the key word of extraction fishing webpage P, and scanned for by search engine, it is thus achieved that the network address of the webpage of non-immediate association；

In this step, key word includes the keyword etc. in title, metatag and body, and search engine is GOOGLE, but is not limited to this, it is also possible to be Baidu etc..

S103, according to above-mentioned direct correlation webpage and the network address of non-immediate associating web pages, use reptile to crawl, it is thus achieved that related web page set.Wherein, the related web page set W of fishing webpage P_pFormal definitions be: W_p={ W₁,W₂,...,W_n, n is the webpage number that the related web page set of fishing webpage P comprises,.

In this step, by five kinds of relations, webpage is modeled, is characteristic vector a: V by each webpage relational representation_p={ f₁,f₂,f₃,f₄,f₅, wherein, f₁,f₂,f₃,f₄,f₅Represent linking relationship, hierarchical relationship, domain name similarity relation, text similarity relation and vision similarity relation, wherein, every kind of webpage relational representation respectively.

Linking relationship L_i,jRepresenting that the probit of target web is pointed in fishing webpage link, its computing formula is as follows:

L_i,j=NL^ij/N_i,

Wherein, NL_ijIt is the quantity of the link of arbitrary webpage of website, all sensing webpage j places in webpage i；N_iIt it is the number of links that comprises of webpage i.

Hierarchical relationship R_ijFor based on webpage j grade in the results list obtained as inquiry using the representative key word of webpage i, i.e. defining the grade incidence relation from webpage i to webpage j, its computing formula is as follows:

R_{i, j} = \frac{N_{r} - (R_{s} - 1)}{N_{r}},

Wherein, N_rIt is to inquire about the results list length returned, can regulate as parameter.R_sIt it is webpage j grade in returning list.If webpage j is not in returning list, then R_sIt is set to 0.

Domain name similarity relation US_i,jFor calculating the similarity between two domain names (character string), utilize editing distance algorithm that doubtful website is carried out domain name similarity analysis, wherein editing distance refers between two character strings, another required minimum edit operation number of times is changed into by one, if this distance is the biggest, illustrate that two character strings are the most different.

Text similarity relation TS_ijFor weighing the webpage i text similarity to webpage j, calculate as follows:

S201, utilize TF-IDF algorithm to extract key word and word frequency, and construct word frequency vector；

S202, cosine Similarity algorithm is utilized to measure the distance of two word frequency vector.Wherein, cosine value closer to 1, show angle closer to 0, then two word frequency vectors are the most similar.

Vision similarity relation LS_i,jFor weighing webpage i to webpage j layout similarity, setting up vision similarity relation by perception hash algorithm (Perceptualhashalgorithm), wherein, this perception hash algorithm comprises the following steps: minification；Simplify color；Calculate meansigma methods；Compared pixels gray scale；Calculating cryptographic Hash compares.

In the present embodiment, this step also includes calculating fishing webpage P and related web page set W_pIn the similarity distance of each webpage, this similarity distance is V_i(L_i,R_i,US_i,TS_i,LS_i), wherein, i represents related web page set W_pIn arbitrary webpage i, i=1,2 ... n, L_iIt is linking relationship similarity, R_iIt is hierarchical relationship similarity, US_iIt is domain name similarity relation similarity, TS_iIt is text similarity relation similarity, LS_iIt it is vision similarity relation similarity.For fishing webpage P, it is designated as with the similarity distance of itself: V_p={ 1,1,1,1,1}.

In the present embodiment, this step farther includes the modeling length calculating each similarity relation to fishing webpage P.

In this step, by the similarity distance V of fishing webpage P_pWith similarity distance V_inullIt is combined，Constitute a new set，Coordinate points according to fishing webpage，And utilize each similarity relation to the modeling length of fishing webpage P，Obtain the distance between each related web page in related web page set，In conjunction with DBSCAN (Density-BasedSpatialClusteringofApplicationswithNoise，Clustering algorithm) algorithm carries out cluster analysis，The clustering cluster of arbitrary shape can be formed，Wherein，By adjusting key parameter Eps (sweep radius) and MinPts (minimum comprise count)，The clustering distance of the webpage in related web page set Yu fishing webpage is controlled in OK range，When the value of Eps Yu MinPts increases，Related web page is easier to cluster，The correct recognition rata of fishing webpage increases the most accordingly simultaneously.In the present embodiment, the value of Eps is about 0.1～0.2, and the value of MinPts is 4 or more than 4.After by analysis, the result of analysis can show V_pWhether can be with some V_iGather a class, i.e. same class similar web page.

In this step, and fishing webpage P other webpages in same category of cluster are i.e. the target webs that fishing webpage can imitate.

Therefore, the present invention makes a look up similar web page according to multiple web page characteristics, and be modeled according to multiple webpage similarity relation, wherein, refer to editing distance algorithm, TF-IDF algorithm and perception hash algorithm, to image from domain name to text, different types of web page characteristics information is extracted, it is ensured that information comprehensive.For the characteristic information extracted, DBSCAN clustering algorithm is used to carry out multiple similarity relation model integrating cluster, the suitable distance scope of each similarity relation model is controlled by regulation parameter Eps (sweep radius) and MinPts (minimum comprise count), find suitable clustering cluster, thus it is efficiently obtained by cluster result, comprehensively analyze from true and false discrimination, obtain higher recognition accuracy.

The above; being only the present invention preferably detailed description of the invention, but protection scope of the present invention is not limited thereto, any those familiar with the art is in the technical scope that the invention discloses; the change that can readily occur in or replacement, all should contain within protection scope of the present invention.Therefore, protection scope of the present invention should be as the criterion with scope of the claims.

Claims

1. the method being positioned target web by fishing webpage cluster, it is characterised in that comprise the following steps:

S1, the related web page set of the given fishing webpage of lookup；

The most according to claim 1 by the method for fishing webpage cluster location target web, it is characterised in that described step S1 includes:

The most according to claim 1 by the method for fishing webpage cluster location target web, it is characterised in that in described step S2, also including the similarity distance calculating each webpage in described fishing webpage and described related web page set, this similarity distance is V_i(L_i,R_i,US_i,TS_i,LS_i), wherein, L_iIt is linking relationship similarity, R_iIt is hierarchical relationship similarity, US_iIt is domain name similarity relation similarity, TS_iIt is text similarity relation similarity, LS_iIt it is vision similarity relation similarity.

The most according to claim 3 by the method for fishing webpage cluster location target web, it is characterized in that, in described step S2, described web page characteristics information includes linking relationship, hierarchical relationship, domain name similarity relation, text similarity relation and vision similarity relation.

The most according to claim 4 by the method for fishing webpage cluster location target web, it is characterized in that, in described step S2, farther include the modeling length calculating similarity relation to described fishing webpage, wherein, described similarity relation includes domain name similarity relation, text similarity relation and vision similarity relation.

The most according to claim 5 by the method for fishing webpage cluster location target web, it is characterised in that in described step S3, also to include utilizing described modeling length to carry out cluster analysis.