CN105824822A - Method clustering phishing page to locate target page - Google Patents

Method clustering phishing page to locate target page Download PDF

Info

Publication number
CN105824822A
CN105824822A CN201510003979.2A CN201510003979A CN105824822A CN 105824822 A CN105824822 A CN 105824822A CN 201510003979 A CN201510003979 A CN 201510003979A CN 105824822 A CN105824822 A CN 105824822A
Authority
CN
China
Prior art keywords
webpage
similarity
web page
fishing webpage
similarity relation
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN201510003979.2A
Other languages
Chinese (zh)
Inventor
唐新民
景晓军
沈智杰
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
SURFILTER NETWORK TECHNOLOGY Co Ltd
Original Assignee
SURFILTER NETWORK TECHNOLOGY Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by SURFILTER NETWORK TECHNOLOGY Co Ltd filed Critical SURFILTER NETWORK TECHNOLOGY Co Ltd
Priority to CN201510003979.2A priority Critical patent/CN105824822A/en
Publication of CN105824822A publication Critical patent/CN105824822A/en
Pending legal-status Critical Current

Links

Landscapes

  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The invention provides a method clustering phishing pages to locate a target page; the method comprises the following steps: 1, searching a related page set of a given phishing page; 2, extracting and modeling webpage feature information of the phishing page and the related page set; 3, using a DBSCAN algorithm to cluster and analyze the webpage feature information similarity, thus obtaining similar pages of the same kind; 4, using a domain name similar relation to locate the target page from the similar pages of the same kind. The method can carry out related webpage searching and clustering analysis according to the various webpage features through the given phishing page, thus screening and identifying the related page set, better coping with phishing page cheating means, and finding the target page simulated by the phishing pages with high accuracy and in a large scale.

Description

A kind of by the method for fishing webpage cluster location target web
Technical field
The present invention relates to information security field, more particularly, it relates to a kind of by the method for fishing webpage location target web.
Background technology
Along with being widely used and the growing of ecommerce and universal of the Internet, increasing user is to be identified by input personal information carrying out online transaction when.Meanwhile, along with the carrying out of electronic transaction, in recent years, the phishing phenomenon of illegal industry emerges in an endless stream, lawless person imitates the message format of actual site, induction user logins the webpage of a personation, thus steals userspersonal information such as bank or credit card account, password etc..Owing to these false webpages are more and more true to nature, so a lot of careless user is easy to have dust thrown into the eyes, cause exposure and the Personal Finance loss of sensitive information.
At present, China Patent No. CN102629261A discloses the method being searched target web by fishing webpage, and it mainly from the angle that vision is similar, positions target web by perception hash method, i.e. reached " likeness in form ".But, current fishing webpage is when imitating target web, much consistent by color matching, or whole style unanimously reaches the effect with target web " alike in spirit " and user cheating, and now, the method cannot solve this " alike in spirit " rather than " likeness in form " situation.
Summary of the invention
The technical problem to be solved in the present invention is, for the existing defect being searched target web by fishing webpage, it is provided that a kind of by the method for fishing webpage location target web.
The present invention solves the technical scheme of the problems referred to above and there is provided a kind of method by fishing webpage cluster location target web, it is characterised in that comprise the following steps:
S1, the related web page set of the given fishing webpage of lookup;
S2, extract and model described fishing webpage and the web page characteristics information of described related web page set;
S3, utilize the similarity of web page characteristics information described in DBSCAN algorithm cluster analysis, obtain same class similar web page;
S4, by domain name similarity relation, orient the target web in described same class similar web page.
In the above-mentioned method by fishing webpage cluster location target web, described step S1 includes:
S101, extract the URL hyperlink of the html source code of described fishing webpage, it is thus achieved that the network address of direct correlation webpage;
S102, extract the key word of described fishing webpage, and scanned for by search engine, it is thus achieved that the network address of the webpage of non-immediate association;
S103, according to described direct correlation webpage and the network address of non-immediate associating web pages, use reptile to crawl, it is thus achieved that related web page set.
In the above-mentioned method by fishing webpage cluster location target web, in described step S2, also including the similarity distance calculating each webpage in described fishing webpage and described related web page set, this similarity distance is Vi(Li,Ri,USi,TSi,LSi), wherein, LiIt is linking relationship similarity, RiIt is hierarchical relationship similarity, USiIt is domain name similarity relation similarity, TSiIt is text similarity relation similarity, LSiIt it is vision similarity relation similarity.
In the above-mentioned method by fishing webpage cluster location target web, in described step S2, described web page characteristics information includes linking relationship, hierarchical relationship, domain name similarity relation, text similarity relation and vision similarity relation.
In the above-mentioned method by fishing webpage cluster location target web, in described step S2, farther include the modeling length calculating similarity relation to described fishing webpage, wherein, described similarity relation includes domain name similarity relation, text similarity relation and vision similarity relation.
In the above-mentioned method by fishing webpage cluster location target web, in described step S3, also include utilizing described modeling length to carry out cluster analysis.
The method that the present invention provides is by known fishing webpage, related web page lookup and cluster analysis is carried out according to multiple web page characteristics, related web page set carries out screening identify, can preferably tackle the fraud of fishing webpage, and on a large scale, find out the target web that fishing website imitates high-accuracy.
Accompanying drawing explanation
Fig. 1 is the schematic flow sheet of the method being positioned target web by fishing webpage of the embodiment of the present invention.
Fig. 2 is carried out the refinement flow chart of step S1 in Fig. 1.
Detailed description of the invention
In order to make the purpose of the present invention, technical scheme and advantage clearer, below in conjunction with drawings and Examples, the present invention is further elaborated.Should be appreciated that specific embodiment described herein, only in order to explain the present invention, is not intended to limit the present invention.
As it is shown in figure 1, be the method flow diagram being positioned target web by fishing webpage of the embodiment of the present invention.In the present embodiment, the method comprises the following steps:
S1, the related web page set of the given fishing webpage of lookup;
In this step, fishing webpage is URL address and the content of pages of counterfeit actual site, it is assumed that given fishing webpage is labeled as P, and its related web page aggregated label is Wp, as in figure 2 it is shown, this step farther includes:
S101, the URL hyperlink of html source code of extraction fishing webpage P, it is thus achieved that the network address of direct correlation webpage;Wherein, in URL hyperlink is included in BODY label.
S102, the key word of extraction fishing webpage P, and scanned for by search engine, it is thus achieved that the network address of the webpage of non-immediate association;
In this step, key word includes the keyword etc. in title, metatag and body, and search engine is GOOGLE, but is not limited to this, it is also possible to be Baidu etc..
S103, according to above-mentioned direct correlation webpage and the network address of non-immediate associating web pages, use reptile to crawl, it is thus achieved that related web page set.Wherein, the related web page set W of fishing webpage PpFormal definitions be: Wp={ W1,W2,...,Wn, n is the webpage number that the related web page set of fishing webpage P comprises,.
S2, extract and model described fishing webpage and the web page characteristics information of described related web page set;
In this step, by five kinds of relations, webpage is modeled, is characteristic vector a: V by each webpage relational representationp={ f1,f2,f3,f4,f5, wherein, f1,f2,f3,f4,f5Represent linking relationship, hierarchical relationship, domain name similarity relation, text similarity relation and vision similarity relation, wherein, every kind of webpage relational representation respectively.
Linking relationship Li,jRepresenting that the probit of target web is pointed in fishing webpage link, its computing formula is as follows:
Li,j=NLij/Ni,
Wherein, NLijIt is the quantity of the link of arbitrary webpage of website, all sensing webpage j places in webpage i;NiIt it is the number of links that comprises of webpage i.
Hierarchical relationship RijFor based on webpage j grade in the results list obtained as inquiry using the representative key word of webpage i, i.e. defining the grade incidence relation from webpage i to webpage j, its computing formula is as follows:
R i , j = N r - ( R s - 1 ) N r ,
Wherein, NrIt is to inquire about the results list length returned, can regulate as parameter.RsIt it is webpage j grade in returning list.If webpage j is not in returning list, then RsIt is set to 0.
Domain name similarity relation USi,jFor calculating the similarity between two domain names (character string), utilize editing distance algorithm that doubtful website is carried out domain name similarity analysis, wherein editing distance refers between two character strings, another required minimum edit operation number of times is changed into by one, if this distance is the biggest, illustrate that two character strings are the most different.
Text similarity relation TSijFor weighing the webpage i text similarity to webpage j, calculate as follows:
S201, utilize TF-IDF algorithm to extract key word and word frequency, and construct word frequency vector;
S202, cosine Similarity algorithm is utilized to measure the distance of two word frequency vector.Wherein, cosine value closer to 1, show angle closer to 0, then two word frequency vectors are the most similar.
Vision similarity relation LSi,jFor weighing webpage i to webpage j layout similarity, setting up vision similarity relation by perception hash algorithm (Perceptualhashalgorithm), wherein, this perception hash algorithm comprises the following steps: minification;Simplify color;Calculate meansigma methods;Compared pixels gray scale;Calculating cryptographic Hash compares.
In the present embodiment, this step also includes calculating fishing webpage P and related web page set WpIn the similarity distance of each webpage, this similarity distance is Vi(Li,Ri,USi,TSi,LSi), wherein, i represents related web page set WpIn arbitrary webpage i, i=1,2 ... n, LiIt is linking relationship similarity, RiIt is hierarchical relationship similarity, USiIt is domain name similarity relation similarity, TSiIt is text similarity relation similarity, LSiIt it is vision similarity relation similarity.For fishing webpage P, it is designated as with the similarity distance of itself: Vp={ 1,1,1,1,1}.
In the present embodiment, this step farther includes the modeling length calculating each similarity relation to fishing webpage P.
S3, utilize the similarity of web page characteristics information described in DBSCAN algorithm cluster analysis, obtain same class similar web page;
In this step, by the similarity distance V of fishing webpage PpWith similarity distance VinullIt is combined,Constitute a new set,Coordinate points according to fishing webpage,And utilize each similarity relation to the modeling length of fishing webpage P,Obtain the distance between each related web page in related web page set,In conjunction with DBSCAN (Density-BasedSpatialClusteringofApplicationswithNoise,Clustering algorithm) algorithm carries out cluster analysis,The clustering cluster of arbitrary shape can be formed,Wherein,By adjusting key parameter Eps (sweep radius) and MinPts (minimum comprise count),The clustering distance of the webpage in related web page set Yu fishing webpage is controlled in OK range,When the value of Eps Yu MinPts increases,Related web page is easier to cluster,The correct recognition rata of fishing webpage increases the most accordingly simultaneously.In the present embodiment, the value of Eps is about 0.1~0.2, and the value of MinPts is 4 or more than 4.After by analysis, the result of analysis can show VpWhether can be with some ViGather a class, i.e. same class similar web page.
S4, by domain name similarity relation, orient the target web in described same class similar web page.
In this step, and fishing webpage P other webpages in same category of cluster are i.e. the target webs that fishing webpage can imitate.
Therefore, the present invention makes a look up similar web page according to multiple web page characteristics, and be modeled according to multiple webpage similarity relation, wherein, refer to editing distance algorithm, TF-IDF algorithm and perception hash algorithm, to image from domain name to text, different types of web page characteristics information is extracted, it is ensured that information comprehensive.For the characteristic information extracted, DBSCAN clustering algorithm is used to carry out multiple similarity relation model integrating cluster, the suitable distance scope of each similarity relation model is controlled by regulation parameter Eps (sweep radius) and MinPts (minimum comprise count), find suitable clustering cluster, thus it is efficiently obtained by cluster result, comprehensively analyze from true and false discrimination, obtain higher recognition accuracy.
The above; being only the present invention preferably detailed description of the invention, but protection scope of the present invention is not limited thereto, any those familiar with the art is in the technical scope that the invention discloses; the change that can readily occur in or replacement, all should contain within protection scope of the present invention.Therefore, protection scope of the present invention should be as the criterion with scope of the claims.

Claims (6)

1. the method being positioned target web by fishing webpage cluster, it is characterised in that comprise the following steps:
S1, the related web page set of the given fishing webpage of lookup;
S2, extract and model described fishing webpage and the web page characteristics information of described related web page set;
S3, utilize the similarity of web page characteristics information described in DBSCAN algorithm cluster analysis, obtain same class similar web page;
S4, by domain name similarity relation, orient the target web in described same class similar web page.
The most according to claim 1 by the method for fishing webpage cluster location target web, it is characterised in that described step S1 includes:
S101, extract the URL hyperlink of the html source code of described fishing webpage, it is thus achieved that the network address of direct correlation webpage;
S102, extract the key word of described fishing webpage, and scanned for by search engine, it is thus achieved that the network address of the webpage of non-immediate association;
S103, according to described direct correlation webpage and the network address of non-immediate associating web pages, use reptile to crawl, it is thus achieved that related web page set.
The most according to claim 1 by the method for fishing webpage cluster location target web, it is characterised in that in described step S2, also including the similarity distance calculating each webpage in described fishing webpage and described related web page set, this similarity distance is Vi(Li,Ri,USi,TSi,LSi), wherein, LiIt is linking relationship similarity, RiIt is hierarchical relationship similarity, USiIt is domain name similarity relation similarity, TSiIt is text similarity relation similarity, LSiIt it is vision similarity relation similarity.
The most according to claim 3 by the method for fishing webpage cluster location target web, it is characterized in that, in described step S2, described web page characteristics information includes linking relationship, hierarchical relationship, domain name similarity relation, text similarity relation and vision similarity relation.
The most according to claim 4 by the method for fishing webpage cluster location target web, it is characterized in that, in described step S2, farther include the modeling length calculating similarity relation to described fishing webpage, wherein, described similarity relation includes domain name similarity relation, text similarity relation and vision similarity relation.
The most according to claim 5 by the method for fishing webpage cluster location target web, it is characterised in that in described step S3, also to include utilizing described modeling length to carry out cluster analysis.
CN201510003979.2A 2015-01-05 2015-01-05 Method clustering phishing page to locate target page Pending CN105824822A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201510003979.2A CN105824822A (en) 2015-01-05 2015-01-05 Method clustering phishing page to locate target page

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201510003979.2A CN105824822A (en) 2015-01-05 2015-01-05 Method clustering phishing page to locate target page

Publications (1)

Publication Number Publication Date
CN105824822A true CN105824822A (en) 2016-08-03

Family

ID=56513609

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201510003979.2A Pending CN105824822A (en) 2015-01-05 2015-01-05 Method clustering phishing page to locate target page

Country Status (1)

Country Link
CN (1) CN105824822A (en)

Cited By (16)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106302440A (en) * 2016-08-11 2017-01-04 国家计算机网络与信息安全管理中心 A kind of method obtaining suspicious fishing website by all kinds of means
CN106302438A (en) * 2016-08-11 2017-01-04 国家计算机网络与信息安全管理中心 A kind of method of actively monitoring fishing website of Behavior-based control feature by all kinds of means
CN106330861A (en) * 2016-08-09 2017-01-11 中国信息安全测评中心 Website detection method and apparatus
CN107948168A (en) * 2017-11-29 2018-04-20 四川无声信息技术有限公司 Page detection method and device
GB2555801A (en) * 2016-11-09 2018-05-16 F Secure Corp Identifying fraudulent and malicious websites, domain and subdomain names
CN109067723A (en) * 2018-07-24 2018-12-21 国家计算机网络与信息安全管理中心 Retroactive method, controller and the medium of fishing website user's information
CN110138758A (en) * 2019-05-05 2019-08-16 哈尔滨英赛克信息技术有限公司 Mistake based on domain name vocabulary plants domain name detection method
CN110442775A (en) * 2019-08-13 2019-11-12 杭州安恒信息技术股份有限公司 Acquisition methods, device and the electronic equipment of multiple level marketing Website publicity address
CN110825941A (en) * 2019-10-17 2020-02-21 北京天融信网络安全技术有限公司 Content management system identification method, device and storage medium
CN111444961A (en) * 2020-03-26 2020-07-24 国家计算机网络与信息安全管理中心黑龙江分中心 Method for judging internet website affiliation through clustering algorithm
CN111669353A (en) * 2019-03-08 2020-09-15 顺丰科技有限公司 Phishing website detection method and system
CN112163145A (en) * 2020-10-09 2021-01-01 杭州安恒信息技术股份有限公司 Website retrieval method, device and equipment based on edit distance and cosine included angle
CN113378090A (en) * 2021-04-23 2021-09-10 国家计算机网络与信息安全管理中心 Internet website similarity analysis method and device and readable storage medium
CN113556308A (en) * 2020-04-23 2021-10-26 深信服科技股份有限公司 Method, system, equipment and computer storage medium for detecting flow security
CN113726824A (en) * 2021-11-03 2021-11-30 成都无糖信息技术有限公司 Fraud website searching method and system based on image characteristics
CN113791656A (en) * 2021-07-29 2021-12-14 六盘水市农业科学研究院 Stepped heating circulating drying method

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20090300768A1 (en) * 2008-05-30 2009-12-03 Balachander Krishnamurthy Method and apparatus for identifying phishing websites in network traffic using generated regular expressions
CN102096781A (en) * 2011-01-18 2011-06-15 南京邮电大学 Fishing detection method based on webpage relevance
CN102629261A (en) * 2012-03-01 2012-08-08 南京邮电大学 Method for finding landing page from phishing page
CN104143008A (en) * 2014-08-11 2014-11-12 北京奇虎科技有限公司 Method and device for detecting phishing webpage based on picture matching

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20090300768A1 (en) * 2008-05-30 2009-12-03 Balachander Krishnamurthy Method and apparatus for identifying phishing websites in network traffic using generated regular expressions
CN102096781A (en) * 2011-01-18 2011-06-15 南京邮电大学 Fishing detection method based on webpage relevance
CN102629261A (en) * 2012-03-01 2012-08-08 南京邮电大学 Method for finding landing page from phishing page
CN104143008A (en) * 2014-08-11 2014-11-12 北京奇虎科技有限公司 Method and device for detecting phishing webpage based on picture matching

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
GANG LIU ET AL: ""Automatic Detection of Phishing Target from Phishing Webpage"", 《2010 20TH INTERNATIONAL CONFERENCE ON PATTERN RECOGNITION》 *
田先桃: ""一种基于网页关联性特征的钓鱼检测方法"", 《中国优秀硕士学位论文全文数据库 信息科技辑》 *

Cited By (23)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106330861A (en) * 2016-08-09 2017-01-11 中国信息安全测评中心 Website detection method and apparatus
CN106330861B (en) * 2016-08-09 2020-03-03 中国信息安全测评中心 Website detection method and device
CN106302440B (en) * 2016-08-11 2019-12-10 国家计算机网络与信息安全管理中心 Method for acquiring suspicious phishing websites through multiple channels
CN106302438A (en) * 2016-08-11 2017-01-04 国家计算机网络与信息安全管理中心 A kind of method of actively monitoring fishing website of Behavior-based control feature by all kinds of means
CN106302440A (en) * 2016-08-11 2017-01-04 国家计算机网络与信息安全管理中心 A kind of method obtaining suspicious fishing website by all kinds of means
GB2555801A (en) * 2016-11-09 2018-05-16 F Secure Corp Identifying fraudulent and malicious websites, domain and subdomain names
CN107948168A (en) * 2017-11-29 2018-04-20 四川无声信息技术有限公司 Page detection method and device
CN109067723A (en) * 2018-07-24 2018-12-21 国家计算机网络与信息安全管理中心 Retroactive method, controller and the medium of fishing website user's information
CN109067723B (en) * 2018-07-24 2021-03-02 国家计算机网络与信息安全管理中心 Method, controller and medium for tracing information of phishing website user
CN111669353A (en) * 2019-03-08 2020-09-15 顺丰科技有限公司 Phishing website detection method and system
CN110138758A (en) * 2019-05-05 2019-08-16 哈尔滨英赛克信息技术有限公司 Mistake based on domain name vocabulary plants domain name detection method
CN110442775A (en) * 2019-08-13 2019-11-12 杭州安恒信息技术股份有限公司 Acquisition methods, device and the electronic equipment of multiple level marketing Website publicity address
CN110825941A (en) * 2019-10-17 2020-02-21 北京天融信网络安全技术有限公司 Content management system identification method, device and storage medium
CN111444961A (en) * 2020-03-26 2020-07-24 国家计算机网络与信息安全管理中心黑龙江分中心 Method for judging internet website affiliation through clustering algorithm
CN111444961B (en) * 2020-03-26 2023-08-18 国家计算机网络与信息安全管理中心黑龙江分中心 Method for judging attribution of Internet website through clustering algorithm
CN113556308A (en) * 2020-04-23 2021-10-26 深信服科技股份有限公司 Method, system, equipment and computer storage medium for detecting flow security
CN112163145A (en) * 2020-10-09 2021-01-01 杭州安恒信息技术股份有限公司 Website retrieval method, device and equipment based on edit distance and cosine included angle
CN112163145B (en) * 2020-10-09 2024-01-30 杭州安恒信息技术股份有限公司 Website retrieval method, device and equipment based on editing distance and cosine included angle
CN113378090A (en) * 2021-04-23 2021-09-10 国家计算机网络与信息安全管理中心 Internet website similarity analysis method and device and readable storage medium
CN113378090B (en) * 2021-04-23 2022-09-06 国家计算机网络与信息安全管理中心 Internet website similarity analysis method and device and readable storage medium
CN113791656A (en) * 2021-07-29 2021-12-14 六盘水市农业科学研究院 Stepped heating circulating drying method
CN113726824A (en) * 2021-11-03 2021-11-30 成都无糖信息技术有限公司 Fraud website searching method and system based on image characteristics
CN113726824B (en) * 2021-11-03 2022-01-07 成都无糖信息技术有限公司 Fraud website searching method and system based on image characteristics

Similar Documents

Publication Publication Date Title
CN105824822A (en) Method clustering phishing page to locate target page
US9015802B1 (en) Personally identifiable information detection
US11204972B2 (en) Comprehensive search engine scoring and modeling of user relevance
CN103744981B (en) System for automatic classification analysis for website based on website content
US11550856B2 (en) Artificial intelligence for product data extraction
WO2019218514A1 (en) Method for extracting webpage target information, device, and storage medium
US10198635B2 (en) Systems and methods for associating an image with a business venue by using visually-relevant and business-aware semantics
CN103544436B (en) System and method for distinguishing phishing websites
US20180165370A1 (en) Methods and systems for object recognition
CN104899508B (en) A kind of multistage detection method for phishing site and system
WO2019041521A1 (en) Apparatus and method for extracting user keyword, and computer-readable storage medium
CN103294781B (en) A kind of method and apparatus for processing page data
CN107145496A (en) The method for being matched image with content item based on keyword
CN107705066A (en) Information input method and electronic equipment during a kind of commodity storage
Wong et al. An unsupervised framework for extracting and normalizing product attributes from multiple web sites
CN107346326A (en) For generating the method and system of neural network model
CN107341183A (en) A kind of Website classification method based on darknet website comprehensive characteristics
CN106204156A (en) A kind of advertisement placement method for network forum and device
CN102129470A (en) Tag clustering method and system
CN104077396A (en) Method and device for detecting phishing website
CN102170446A (en) Fishing webpage detection method based on spatial layout and visual features
CN108038173B (en) Webpage classification method and system and webpage classification equipment
CN102446255A (en) Method and device for detecting page tamper
CN102629261A (en) Method for finding landing page from phishing page
Deng et al. Enhanced models for expertise retrieval using community-aware strategies

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
RJ01 Rejection of invention patent application after publication

Application publication date: 20160803

RJ01 Rejection of invention patent application after publication