CN102170446A - Fishing webpage detection method based on spatial layout and visual features - Google Patents

Fishing webpage detection method based on spatial layout and visual features Download PDF

Info

Publication number
CN102170446A
CN102170446A CN2011101124281A CN201110112428A CN102170446A CN 102170446 A CN102170446 A CN 102170446A CN 2011101124281 A CN2011101124281 A CN 2011101124281A CN 201110112428 A CN201110112428 A CN 201110112428A CN 102170446 A CN102170446 A CN 102170446A
Authority
CN
China
Prior art keywords
webpage
fishing
spatial
feature
module
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN2011101124281A
Other languages
Chinese (zh)
Inventor
张卫丰
曾兵
张迎周
周国强
许碧欢
陆柳敏
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Nanjing Post and Telecommunication University
Nanjing University of Posts and Telecommunications
Original Assignee
Nanjing Post and Telecommunication University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Nanjing Post and Telecommunication University filed Critical Nanjing Post and Telecommunication University
Priority to CN2011101124281A priority Critical patent/CN102170446A/en
Publication of CN102170446A publication Critical patent/CN102170446A/en
Pending legal-status Critical Current

Links

Images

Landscapes

  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The fishing webpage detection method based on spatial layout and visual features relates to a design plan which is based on webpage visual layout features and is combined with a spatial database and a picture feature similarity degree comparison. The fishing webpage detection method mainly solves the problem of rapid fishing webpage detection from the angle of webpage layout and visual similarity degree. The system is composed of six modules. The upper most layer is a user interface module which is mainly responsible for acquiring the user input and feeding back the result to the user. The intermediate layer is a control module which is responsible for dispatching all the function modules to complete the fishing webpage detection. The core of the system are four function modules, i.e., the layout feature extracting module, the spatial database module, the machine learning matching module, the picture feature extracting and comparison module. Proved by a great deal of experimental data, the method builds a fishing webpage detection system with a high speed and a high precision. The data processing capacity is greatly increased and the webpage detecting time is shortened while ensuring a high accuracy rate.

Description

A kind of fishing webpage detection method based on space layout and visual signature
Technical field
The present invention relates to the method that a kind of page or leaf of going fishing detects, mainly fishing webpage is mated and discern, belong to information security field from webpage visual layout visual similarity angle.
Background technology
Fishing website is to increase and the unusual rampant network defraud behavior that becomes with online transaction along with network is universal.Fishing website is the swindle website that the offender makes, fishing website is almost completely identical with website of bank or other well-known websites usually, thereby lures the website user to submit out sensitive information (as: user name, password, account No. or credit card details etc.) [Zhang2007] on fishing website.
Fig. 1 is the architecture of fishing website.Most typical phishing attack process is as follows: at first the user is lured one by on the closely similar fishing website in well-designed and website destination organization, obtain the personal sensitive information that the user imports then on this fishing website, for example account No., bank cipher etc.Usually this attack process can not allow victim's vigilance.These personal information have very large attraction to the fishing website holder, the personal information that steals by use, they can palm off the victim and carry out the rogue financial transaction, obtain great economic interests, and therefore victims are subjected to enormous economic loss, not only so, the personal information that is stolen also may be used to other unlawful activities.How to discern fishing website, how to guarantee the secret integrality of site information transmission, more demonstrate its importance and necessity.
Most of users can be deceived, and many times are because fishing webpage always has the similitude of height with true webpage.The method of calculating similitude is a lot, document [Liu2005], [Fu2006], [Chen2009], [Afroz2009].
As what mention among [Dhamija2006], [Jackson2007], [Afroz2009], because people generally relatively pay attention to the main purpose of own browsing page, and then ignored the prompting of safety issue, and vision deception rate is very high.People expect detecting from visual angle thus, based on the detection of vision be divided into detection based on the HMTL text, based on the detection of layout [Liu2005], [Afroz2009] with based on the detection of image [Chen2009].Because the dynamic of HMTL flexibility of language and web page element and rich, but the counterfeiter can make visually the same the different webpage of HMTL structure easily, like this, will lose efficacy based on the coupling of HMTL.Based on the similar detection method of webpage of spatial layout feature and characteristics of image visual theory according to the people, similitude to webpage is calculated, it is a kind of general detection method, proposed a kind of matching algorithm [Fu2006] of the EMD distance based on pixel as people such as Fu in 2006, this algorithm is that the similitude angle from vision is discovered fishing webpage on pixel level.From experimental result as can be seen: effect will be significantly better than the detection based on the HTML content, but its limitation is also arranged, and this algorithm has only been considered color and the characteristic distributions thereof in the Web page image, does not consider the relation of the position between the different piece in the webpage.According to Gus's tower visual theory, relative position in people's vision in the highest flight, relative position relation between particularly a plurality of bodies, the variation of relative position relation must cause visual difference, and this algorithm is not owing to considering that the relative position factor may cause the inefficacy of similar detection, so this method can only detect the webpage that similitude is visually arranged with true webpage.[Cao2009] solved the problem of relative position on the basis of Fu, at first webpage is carried out piecemeal, and then with EMD algorithm computation similarity.
Spatial database is a kind of more space querying technology of using aspect geography information, adopts the R tree to be data structure.Because the page layout characteristic information need search all visually close rectangles,, can obtain in the feature database close spatial layout feature on all visually similar and positions fast here in conjunction with the feature of spatial data library storage webpage.
[Zhang2007]Y.Zhang,J.Hong,and?L.Cranor.Cantina:A?content-based?approach?todetecting?phishing?websites.WWW,2007
[Fu2006]Anthony?Y.Fu,Wenyin?Liu,Xiaotie?Deng.Detecting?Phishing?Web?Pages?withVisual?Similarity?Assessment?based?on?Earth?Mover’s?Distance(EMD).IEEETransactions?on?Dependable?and?Secure?Computing,2006,3(4),pages?301-311
[Dong2010]X.Dong,J.A.Clark,J.L.Jacob.Defending?the?weakest?link:phishingwebsites?detection?by?analysing?user?behaviours.Springer?Science+Business?Media,LLC?2010.
[Liu2010]W.Y.Liu,N.Fang,X.J.Quan,B.Qiu,G.Liu.Discovering?phishing?target?based?onsemantic?link?network.Future?Generation?Comp.Syst.,2010:381~388.
[Cao2009] Jiuxin Cao, Bo Mao, Junzhou Luo, and Bo Liu.A Phishing Web Pages DetectionAlgorithm Based on Nested Structure of Earth Mover ' s Distance (Nested-EMD) .Chinese Journal of Computers.2009, (05): 922-929. (Chinese: Cao Jiuxin, hair ripple, Luo Junzhou, Liu Bo. based on the fishing webpage detection algorithm of nested EMD. Chinese journal of computers, 2009, (05): 922-929.)
Summary of the invention
Technical problem: the purpose of this invention is to provide a kind of fishing webpage detection method based on space layout and visual signature, manual identified is mainly passed through in fishing webpage identification in the past, the fishing webpage detection technique of present existing computer Recognition is mainly carried out matching detection from the web page element angle to detecting webpage, and matching speed often can't reach the requirement of actual use.The present invention carries out feature extraction, characteristic matching from the layout angle of webpage, improves the speed of page coupling greatly, cooperates the similitude contrast of the picture visual signature between the corresponding segment, has guaranteed high accuracy and low False Rate when improving detection speed.
Technical scheme: the present invention is in conjunction with the browser renders engine, suspicious webpage to appointment carries out the visual layout feature extraction, utilize the spatial layout feature that spatial database index search locus is close, vision is similar again, find legal webpage the most similar in the sample space through statistical analysis, comparison is the visual signature similarity of corresponding segment between the two, thereby reaches the purpose that fishing webpage detects.
This method is made up of 5 modules, wherein the superiors are subscriber interface modules, mainly be responsible for obtaining user's input and the result is fed back to the user, the centre is a control module, be responsible for scheduling all functions module and finish the fishing website detection, be 3 most crucial functional modules, that is: spatial layout feature abstraction module, spatial data library module, machine learning matching module below; Wherein the spatial layout feature abstraction module extracts block page layout feature, this spatial layout feature is delivered the spatial data library module in the training stage and is set up index or inquire about, and delivers to spatial data library module inquiry similar features at the spatial layout feature that this module of fishing webpage detection-phase extracts; The spatial data library module is set up spatial index to the data feature in training period, carries out the quick search of similar features at the fishing webpage detection-phase, and inquiry gained similar features is given the machine learning matching module and discerned; The machine learning matching module is trained in the characteristic that systematic training stage reception characteristic extracting module transmits, and optimizes the parameter of webpage similitude threshold values; At the fishing webpage detection-phase, receive the characteristic that characteristic extracting module transmits, with similar features in the spatial database, calculate the similitude between the webpage, judge fishing webpage according to webpage similitude threshold values at last;
By browser kernel analyzing web page source code and extract the space layout feature, with this foundation that detects as fishing webpage, and improve inquiry velocity in conjunction with spatial database in feature database in the fishing webpage testing process, the specific implementation step is:
Step 1) training stage data set-up procedure: gather at least 100 regular Website pages that may be imitated by fishing webpage, and extract spatial layout feature and be organized into sample data; The spatial layout feature of all sample datas is inserted spatial database; Gather at least 100 fishing website pages and 100 generic web page, and the extraction feature organization becomes test data;
Spatial layout feature is made up of following four numerical attributes:
● the height of DOM Document Object Model node
The height here represents that browser engine resolved the html source code of webpage, gained web page element picture element number in vertical direction after CSS source code and the Java page script source code,
● the width of DOM Document Object Model node
The width means browser engine has here been resolved the html source code of webpage, and the later gained web page element of CSS source code and Java page script source code is the picture element number in the horizontal direction,
● the X coordinate of DOM Document Object Model node
If the picture element coordinate in the upper left corner, Web browser viewing area is (0,0), the X coordinate representation browser engine has here been resolved the html source code of webpage, the top left pixel point of gained web page element is with respect to (0 after CSS source code and the Java page script source code, 0) coordinate distance in the horizontal direction, its middle distance is that 1 expression differs a picture element
● the Y coordinate of DOM Document Object Model node
If the picture element coordinate in the upper left corner, Web browser viewing area is (0,0), the X coordinate representation browser engine has here been resolved the html source code of webpage, the top left pixel point of gained web page element is with respect to (0 after CSS source code and the source code, 0) coordinate distance in vertical direction, its middle distance is that 1 expression differs a picture element
Step 2) calibration process of test data set: all generic web page in the test data are labeled as " 0 ", represent non-fishing webpage; Again all fishing webpages in the test data are labeled as " 1 ", the expression fishing webpage;
Step 3) is searched similar feature to features of all test webpages in spatial database, and statistics finds each test webpage the most similar webpage in the storehouse, calculates the similarity of their similarity as test webpage and storehouse;
Step 4) is sent the mark of all test webpages and the similarity in test webpage and storehouse into the machine learning matching module, travel through all possible similarity threshold values, find a value T to make similarity greater than the fishing webpage quantity of T and the similarity fishing webpage quantity difference maximum less than T, T is as fishing webpage layout similarity threshold values;
Second stage: corresponding segment is carried out the feature extraction and the comparison of content picture
Step 5) is carried out the extraction of picture feature to similar segment corresponding between fishing webpage and the generic web page respectively, obtains the characteristic vector of corresponding picture;
Step 6) is handled the characteristic vector of picture, utilizes related algorithm to calculate similarity between the corresponding picture, sees whether similarity result surpasses the visual signature similarity threshold P that sets;
Step 7) is to the testing process of doubtful fishing webpage: suspicious webpage is gathered spatial layout feature; In spatial database, search the webpage of similar feature with the feature of suspicious webpage, the characteristics of image and the spatial layout feature of the webpage after characteristics of image, spatial layout feature and the feature database filtration of suspicious webpage are carried out similarity calculating, see whether similarity result surpasses the visual signature similarity threshold of setting, as then taking a decision as to whether fishing webpage, otherwise be generic web page greater than threshold value.
Beneficial effect: because topological employing face phase site analysis means Network Based, the present invention has following special benefits and useful achievement:
High-accuracy: the main evaluation index of machine learning is precision (precision) and recall rate (recall), detect in the identification at fishing website, the accuracy representing machine is judged as in all pages of fishing webpage, really be the ratio of fishing webpage, it is the fishing webpage proportion that recall rate is represented in all fishing webpages by machine recognition.Obviously precision and the high more expression effect of recall rate are good more.Adopt machine learning model to carry out that accuracy of detection and recall rate are respectively 97.9% and 95% after the machine learning through experiment showed, that fishing webpage that the present invention proposes detects, this result is equally matched with at present best fishing webpage Automatic Measurement Technique.
High speed detection: the advantage of maximum of the present invention is to have shortened detection time greatly, owing to done certain optimization improvement in conjunction with spatial database and to the data library inquiry, can make full use of the characteristic that tree structure reduces time complexity, only picture feature is extracted and contrast to carrying out between the segment of thinking the space correspondence simultaneously, simplify the amount of calculation of picture comparison greatly, improved detection speed.
Application is extensive: because reality of the present invention has proposed a kind of page layout and picture visual similarity numerical procedure, so of many uses on the webpage similitude.
Description of drawings
Fig. 1 is the architectural schematic of fishing website.
Fig. 2 is a fishing webpage testing process schematic diagram.
Embodiment
Technical solution of the present invention mainly is divided into three parts:
1. spatial layout feature extracts part.
The spatial layout feature here is meant the square boundary of all visual informations on the webpage, such as the square boundary of the passage in the webpage, and the square boundary of a secondary picture, perhaps square boundary of visually close element combinations etc.The groundwork of spatial layout feature abstraction module is exactly to extract all sizeable rectangular block information in the webpage in conjunction with browser kernel and document object model tree analysis tool.
So the function of this module is exactly the document object model tree of a web page of traversal, analyze the html of this page in conjunction with the layout render engine in the browser kernel, CSS, java page script source code, obtain the display position and the size of the label of each node representative, and note these information according to specified format and form the page layout characteristic information.
In the feature database acquisition phase, the possible imitated legal page layout characteristic that this module will collect is given the spatial data library module and is carried out storage; In the fishing webpage analysis phase, this module is passed to the page layout analysis module with the spatial layout feature data of the doubtful page and is carried out analysis-by-synthesis.
2. spatial database part
Spatial database adopts the R tree to be data structure, it is a kind of more space querying technology of aspect geography information, using, because the page layout characteristic information need be searched all visually close rectangles, here in conjunction with the feature of spatial data library storage webpage, can obtain in the feature database close spatial layout feature on all visually similar and positions fast.
The R data tree structure of simply introducing spatial database below and being adopted:
R tree is a kind of and the similar tree form data structure of B tree, still is mainly used in the establishment of spatial data and obtains, such as can " searching all gas stations in current location two kilometer range " by the usage space database.This data structure is used the method partition space of level polymerization, these cut apart later space may be overlapping, the space uses minimum frame rectangular tables to show that each node in the R tree all has the inlet (quantity has the upper limit of appointment) of some, the inlet of each nonleaf node is stored two category informations, one class is the index of the corresponding child node of this inlet, and another kind of information then is the MBR of this byte point.This tree structure of usage space database and the thought of minimum frame rectangle can be inquired about on the geography in mass data or visually close data apace.
Here this specific character of usage space database can be carried out effective index to all layout informations in the feature database, can carry out quick search after the layout information that gets access to webpage to be detected, obtains all visually close layout informations.
This module is set up spatial data index when setting up feature database, carry out the spatial data inquiry at the fishing webpage detection-phase.
3. machine learning compatible portion.
Its core missions be exactly according in the layout information of the page to be detected and the feature database to the page to be detected in similar characteristic block comprehensive statistics, webpage similitude algorithm according to appointment, find the highest n of a similarity degree webpage, if similitude surpasses certain threshold values, think that then webpage to be detected is further to carry out the webpage that picture feature is extracted and compared, if be lower than threshold values, then think normal webpage.Wherein the definite of threshold values needs to use the labeled data of training stage to train gained according to our machine learning algorithm.
4. picture feature is extracted and the comparison part
Its core missions are exactly to the similar webpage of the resulting space layout of machine learning compatible portion, and the picture feature of carrying out between the corresponding segment is extracted and comparison.The corresponding characteristics algorithm of this module utilization, extract the characteristic vector of picture, then suspicious webpage and generic web page are carried out the similarity calculating of picture feature vector, whether the result who sees surpasses the vision similarity threshold value, thereby judges whether suspicious webpage is fishing webpage.
● based on the page layout of space topological and the step that fishing webpage detects and its implementation comprised of visual signature be:
Mainly can be divided into three parts:
1. the training of machine learning module
The regular Website page that at least 100 of step 1) collections may be imitated by fishing webpage, and extract spatial layout feature and be organized into sample data;
Step 2) spatial layout feature with all sample datas inserts spatial database;
At least 100 fishing website pages of step 3) collection and 100 generic web page, and extract feature organization and become test data, fishing website being labeled as " 1 ", common website is labeled as " 0 ";
Step 4) is searched similar feature to features of all test webpages in spatial database, and finds each test webpage the most similar webpage in the storehouse according to Sim formula statistics, calculates the similarity of their similarity as test webpage and storehouse;
Step 5) is sent the mark of all test webpages and the similarity in test webpage and storehouse into the machine learning matching module, and the data training algorithm of use machine learning compatible portion calculates the similarity threshold values of fishing webpage space layout.
2. the extraction of visual signature and similarity are calculated
The result that step 1) detects according to the page layout similarity carries out the picture feature extraction (can extract local invariant feature etc.) of corresponding segment to being judged to be two the most similar web pages, obtains the characteristic vector of corresponding picture;
Step 2) the picture feature vector to extracting utilizes related algorithm to carry out similarity and calculates (such as Euclidean distance between the calculated characteristics vector or mahalanobis distance), obtains the similarity comparison result.
The step 3) utilization obtains the method for fishing website space layout similarity threshold, calculates the threshold value of fishing website picture analogies degree.
3. fishing website detects
Step 1) is gathered spatial layout feature to suspicious webpage
Step 2) in spatial database, searches similar feature with the feature of suspicious webpage, and find suspicious webpage the most similar webpage in the storehouse, calculate the similarity of their similarity as suspicious webpage and storehouse according to Sim formula statistics;
Step 3) will be treated suspicious webpage to send into the machine learning matching module that trains with the similarity storehouse and predict, forecasting institute gets the result as whether needing to carry out the foundation that the picture feature similarity detects: if the decision space layout is inconsistent, think that then suspicious webpage is a generic web page; If decision space layout unanimity, the picture analogies degree that then carries out between the corresponding segment compares.
Step 4) is extracted the picture feature of corresponding segment to thinking the webpage of space layout unanimity in the step 3, obtains the characteristic vector of picture, and the similarity of utilizing related algorithm to carry out characteristic vector is then calculated.If the gained result surpasses preset threshold, then think fishing webpage; Otherwise, think generic web page.
1. the relation between the system module
System is made up of 6 modules, and wherein the superiors are subscriber interface modules, mainly is responsible for obtaining user's input and the result is fed back to the user, and the centre is a control module, is responsible for scheduling all functions module and finishes the fishing website detection.Have 4 functional modules:
The spatial layout feature abstraction module, be responsible for extracting the page layout feature according to Feature Extraction Algorithm, this spatial layout feature is delivered the spatial data library module in the training stage and is set up index or inquire about, and delivers to spatial data library module inquiry similar features at the spatial layout feature that this module of fishing webpage detection-phase extracts.
The spatial data library module, this module is improved spatial data library module in the past, improve algorithm with reference to following spatial database index and search algorithm, the function of this module is that the training data feature is set up spatial index, carry out the quick search of similar features at the fishing webpage detection-phase, inquiry gained similar features is given the machine learning matching module and is discerned;
The machine learning matching module is trained in the characteristic that systematic training stage reception characteristic extracting module transmits, and optimizes the parameter of webpage similitude threshold values; At the fishing webpage detection-phase, receive the characteristic that characteristic extracting module transmits, with similar features in the spatial database, calculate the similitude between the webpage.
Picture feature is extracted and comparing module, in system, accept the result that the machine learning matching module transmits, picture feature extraction and the picture feature similarity that webpage is carried out between the corresponding segment carried out then, similarity between suspicious webpage of final decision and the generic web page, and then take a decision as to whether fishing webpage.
2. the realization of system module
A) spatial layout feature abstraction module
The spatial layout feature abstraction module need call the browser layout engine, and DOM Document Object Model source code analysis instrument, to formulating html document and the attached picture file thereof of URL, CSS file, Java page script file is analyzed, and finally extracts spatial layout feature.Spatial layout feature is made up of following four numerical attributes:
● the height of DOM Document Object Model node
The height here represents that browser engine resolved the html source code of webpage, gained web page element picture element number in vertical direction after CSS source code and the Java page script source code.
● the width of DOM Document Object Model node
The width means browser engine has here been resolved the html source code of webpage, and the later gained web page element of CSS source code and Java page script source code is the picture element number in the horizontal direction.
● the X coordinate of DOM Document Object Model node
If the picture element coordinate in the upper left corner, Web browser viewing area is (0,0), the X coordinate representation browser engine has here been resolved the html source code of webpage, the top left pixel point of gained web page element is with respect to (0 after CSS source code and the Java page script source code, 0) coordinate distance in the horizontal direction, its middle distance are that 1 expression differs a picture element.
● the Y coordinate of DOM Document Object Model node
If the picture element coordinate in the upper left corner, Web browser viewing area is (0,0), the X coordinate representation browser engine has here been resolved the html source code of webpage, the top left pixel point of gained web page element is with respect to (0 after CSS source code and the Java page script source code, 0) coordinate distance in vertical direction, its middle distance are that 1 expression differs a picture element.
Under the situation of understanding the Web browser operation principle, select the browser kernel of a main flow, understand the API of this browser kernel, and understand the html source code that how to call DOM Document Object Model source code analysis tool analysis webpage.The concrete implementation step of this module is as follows:
Step 1) is used the webpage that selected Web browser is resolved needs to extract feature;
Step 2) obtains the html source code of this page and use DOM Document Object Model analysis tool analysis source code;
Step 3) is obtained the spatial layout feature of all DOM Document Object Model nodes according to algorithm.
B) spatial data library module
This module is carried out data directory in conjunction with spatial database, and on this basis traditional spatial database is carried out improvement on the algorithm, makes it more to adapt to fishing webpage and detects inquiry.Concrete implementation step is as follows:
It is the spatial data library module of data structure with the R tree that step 1) is designed and Implemented one, can the arbitrary shape of input be carried out: insert, revise, delete, inquire about, wherein inquiry should be carried out general space querying, such as: import a rectangle, search all figures that are included in the database in this rectangle, perhaps search with this rectangular centre apart from less than all rectangles of 15;
Step 2) according to the query script of respective algorithms room for improvement database;
All Query Results of step 3) sort according to the standard with the centre-to-centre spacing descending of query characteristics;
C) machine learning matching module
According to the feature of detected webpage A and from spatial database all characteristic synthetics of the feature similarity of gained and all A analyze, find webpage B the most similar in the feature database to A, and then the similarity of calculating A and B, if this similarity surpasses reservation threshold, think that then A is a fishing webpage, otherwise think that then A is not a fishing webpage.
This module need be according to the similarity between two pages of spatial layout feature calculating that extract previously, at first need to understand the notion of a character pair piece, if promptly among two webpage A and the B two characteristic block A-1 and B-1 are arranged respectively, if the centre-to-centre spacing of A-1 and B-1 is less than predetermined centre-to-centre spacing threshold values D, and the ratio of A-1 and the width of B-1 is in preset range, the ratio of height is also in predetermined being divided into, think that then the A-2 piece is corresponding with the B-1 piece, according to experiment gained result, here getting D is 50 pixels, and the width ratio scope is [0.8,1.2], the height ratio scope is [0.8,1.2].Calculating formula of similarity is as follows:
Sim ( n q , n r , n cor ) = ( 1 - | n q - n r | max ( n q , n r ) ) · n cor 2 n q · n r
n qCharacteristic block sum in the expression A page, n rCharacteristic block sum in the expression B page, n CorRepresent two page characteristic of correspondence piece sums.Calculating gained Sim is two similar value between the webpage.
D) extraction of picture feature and similarity are calculated
To c) in the result that obtains be the webpage of layout unanimity, the picture feature of carrying out between the corresponding segment is extracted, obtain the characteristic vector of every width of cloth picture, the similarity of utilizing corresponding algorithm to carry out between the vector is then calculated, and the result and the preset threshold that obtain compare.Whether the suspicious webpage of final decision is fishing webpage.

Claims (1)

1. fishing webpage detection method based on space layout and visual signature, it is characterized in that this method is made up of 5 modules, wherein the superiors are subscriber interface modules, mainly be responsible for obtaining user's input and the result is fed back to the user, the centre is a control module, being responsible for scheduling all functions module and finishing the fishing website detection, is 3 most crucial functional modules, that is: spatial layout feature abstraction module, spatial data library module, machine learning matching module below; Wherein the spatial layout feature abstraction module extracts block page layout feature, this spatial layout feature is delivered the spatial data library module in the training stage and is set up index or inquire about, and delivers to spatial data library module inquiry similar features at the spatial layout feature that this module of fishing webpage detection-phase extracts; The spatial data library module is set up spatial index to the data feature in training period, carries out the quick search of similar features at the fishing webpage detection-phase, and inquiry gained similar features is given the machine learning matching module and discerned; The machine learning matching module is trained in the characteristic that systematic training stage reception characteristic extracting module transmits, and optimizes the parameter of webpage similitude threshold values; At the fishing webpage detection-phase, receive the characteristic that characteristic extracting module transmits, with similar features in the spatial database, calculate the similitude between the webpage, judge fishing webpage according to webpage similitude threshold values at last;
By browser kernel analyzing web page source code and extract the space layout feature, with this foundation that detects as fishing webpage, and improve inquiry velocity in conjunction with spatial database in feature database in the fishing webpage testing process, the specific implementation step is:
Step 1) training stage data set-up procedure: gather at least 100 regular Website pages that may be imitated by fishing webpage, and extract spatial layout feature and be organized into sample data; The spatial layout feature of all sample datas is inserted spatial database; Gather at least 100 fishing website pages and 100 generic web page, and the extraction feature organization becomes test data;
Spatial layout feature is made up of following four numerical attributes:
● the height of DOM Document Object Model node
The height here represents that browser engine resolved the html source code of webpage, gained web page element picture element number in vertical direction after CSS source code and the Java page script source code,
● the width of DOM Document Object Model node
The width means browser engine has here been resolved the html source code of webpage, and the later gained web page element of CSS source code and Java page script source code is the picture element number in the horizontal direction,
● the X coordinate of DOM Document Object Model node
If the picture element coordinate in the upper left corner, Web browser viewing area is (0,0), the X coordinate representation browser engine has here been resolved the html source code of webpage, the top left pixel point of gained web page element is with respect to (0 after CSS source code and the Java page script source code, 0) coordinate distance in the horizontal direction, its middle distance is that 1 expression differs a picture element
● the Y coordinate of DOM Document Object Model node
If the picture element coordinate in the upper left corner, Web browser viewing area is (0,0), the X coordinate representation browser engine has here been resolved the html source code of webpage, the top left pixel point of gained web page element is with respect to (0 after CSS source code and the source code, 0) coordinate distance in vertical direction, its middle distance is that 1 expression differs a picture element
Step 2) calibration process of test data set: all generic web page in the test data are labeled as " 0 ", represent non-fishing webpage; Again all fishing webpages in the test data are labeled as " 1 ", the expression fishing webpage;
Step 3) is searched similar feature to features of all test webpages in spatial database, and statistics finds each test webpage the most similar webpage in the storehouse, calculates the similarity of their similarity as test webpage and storehouse;
Step 4) is sent the mark of all test webpages and the similarity in test webpage and storehouse into the machine learning matching module, travel through all possible similarity threshold values, find a value T to make similarity greater than the fishing webpage quantity of T and the similarity fishing webpage quantity difference maximum less than T, T is as fishing webpage layout similarity threshold values;
Second stage: corresponding segment is carried out the feature extraction and the comparison of content picture
Step 5) is carried out the extraction of picture feature to similar segment corresponding between fishing webpage and the generic web page respectively, obtains the characteristic vector of corresponding picture;
Step 6) is handled the characteristic vector of picture, utilizes related algorithm to calculate similarity between the corresponding picture, sees whether similarity result surpasses the visual signature similarity threshold P that sets;
Step 7) is to the testing process of doubtful fishing webpage: suspicious webpage is gathered spatial layout feature; In spatial database, search the webpage of similar feature with the feature of suspicious webpage, the characteristics of image and the spatial layout feature of the webpage after characteristics of image, spatial layout feature and the feature database filtration of suspicious webpage are carried out similarity calculating, see whether similarity result surpasses the visual signature similarity threshold of setting, as then taking a decision as to whether fishing webpage, otherwise be generic web page greater than threshold value.
CN2011101124281A 2011-04-29 2011-04-29 Fishing webpage detection method based on spatial layout and visual features Pending CN102170446A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN2011101124281A CN102170446A (en) 2011-04-29 2011-04-29 Fishing webpage detection method based on spatial layout and visual features

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN2011101124281A CN102170446A (en) 2011-04-29 2011-04-29 Fishing webpage detection method based on spatial layout and visual features

Publications (1)

Publication Number Publication Date
CN102170446A true CN102170446A (en) 2011-08-31

Family

ID=44491423

Family Applications (1)

Application Number Title Priority Date Filing Date
CN2011101124281A Pending CN102170446A (en) 2011-04-29 2011-04-29 Fishing webpage detection method based on spatial layout and visual features

Country Status (1)

Country Link
CN (1) CN102170446A (en)

Cited By (22)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102662959A (en) * 2012-03-07 2012-09-12 南京邮电大学 Method for detecting phishing web pages with spatial mixed index mechanism
CN102999638A (en) * 2013-01-05 2013-03-27 南京邮电大学 Phishing website detection method excavated based on network group
CN103023874A (en) * 2012-11-21 2013-04-03 北京航空航天大学 Phishing website detection method
CN103049483A (en) * 2012-11-30 2013-04-17 北京奇虎科技有限公司 System for recognizing web page dangerousness
CN103136251A (en) * 2011-11-29 2013-06-05 星云融创(北京)科技有限公司 Method and device of webpage identification
CN103179095A (en) * 2011-12-22 2013-06-26 阿里巴巴集团控股有限公司 Method and client device for detecting phishing websites
CN103442014A (en) * 2013-09-03 2013-12-11 中国科学院信息工程研究所 Method and system for automatic detection of suspected counterfeit websites
CN103823758A (en) * 2014-03-13 2014-05-28 北京金山网络科技有限公司 Browser testing method and device
CN103986731A (en) * 2014-05-30 2014-08-13 北京奇虎科技有限公司 Method and device for detecting phishing web pages through picture matching
CN104113869A (en) * 2014-06-20 2014-10-22 北京拓明科技有限公司 Signaling data-based prediction method and system for potential complaint user
CN104166725A (en) * 2014-08-26 2014-11-26 哈尔滨工业大学(威海) Phishing website detection method
CN104765882A (en) * 2015-04-29 2015-07-08 中国互联网络信息中心 Internet website statistics method based on web page characteristic strings
CN105763543A (en) * 2016-02-03 2016-07-13 百度在线网络技术(北京)有限公司 Phishing site identification method and device
CN106127042A (en) * 2016-07-06 2016-11-16 苏州仙度网络科技有限公司 Webpage visual similarity recognition method
CN106303757A (en) * 2015-06-23 2017-01-04 中国科学院信息工程研究所 A kind of view-based access control model feature and the network audio-video address resolution method of stream reduction
CN106685936A (en) * 2016-12-14 2017-05-17 深圳市深信服电子科技有限公司 Webpage defacement detection method and apparatus
CN106874926A (en) * 2016-08-04 2017-06-20 阿里巴巴集团控股有限公司 Service exception detection method and device based on characteristics of image
CN107636650A (en) * 2015-05-18 2018-01-26 微软技术许可有限责任公司 Meet the document based on the condition for rendering assessment to present
CN107729386A (en) * 2017-09-19 2018-02-23 杭州安恒信息技术有限公司 A kind of dark chain detection technique based on degree of polymerization analysis
WO2020259036A1 (en) * 2019-06-26 2020-12-30 扬州大学 Method for generating web code based on ui of generative adversarial and convolutional neural networks.
CN113641933A (en) * 2021-06-30 2021-11-12 北京百度网讯科技有限公司 Abnormal webpage identification method, abnormal site identification method and device
CN117596054A (en) * 2023-11-29 2024-02-23 北京中电汇通科技有限公司 Network security method and system based on dynamic network information security

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101145902A (en) * 2007-08-17 2008-03-19 东南大学 Fishing webpage detection method based on image processing
CN101510887A (en) * 2009-03-27 2009-08-19 腾讯科技(深圳)有限公司 Method and device for identifying website
CN101894134A (en) * 2010-06-21 2010-11-24 南京邮电大学 Spatial layout-based fishing webpage detection and implementation method

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101145902A (en) * 2007-08-17 2008-03-19 东南大学 Fishing webpage detection method based on image processing
CN101510887A (en) * 2009-03-27 2009-08-19 腾讯科技(深圳)有限公司 Method and device for identifying website
CN101894134A (en) * 2010-06-21 2010-11-24 南京邮电大学 Spatial layout-based fishing webpage detection and implementation method

Cited By (34)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103136251A (en) * 2011-11-29 2013-06-05 星云融创(北京)科技有限公司 Method and device of webpage identification
CN103179095B (en) * 2011-12-22 2016-03-30 阿里巴巴集团控股有限公司 A kind of method and client terminal device detecting fishing website
CN103179095A (en) * 2011-12-22 2013-06-26 阿里巴巴集团控股有限公司 Method and client device for detecting phishing websites
CN102662959B (en) * 2012-03-07 2014-07-16 南京邮电大学 Method for detecting phishing web pages with spatial mixed index mechanism
CN102662959A (en) * 2012-03-07 2012-09-12 南京邮电大学 Method for detecting phishing web pages with spatial mixed index mechanism
CN103023874A (en) * 2012-11-21 2013-04-03 北京航空航天大学 Phishing website detection method
CN103023874B (en) * 2012-11-21 2015-08-26 北京航空航天大学 A kind of detection method for phishing site
CN103049483B (en) * 2012-11-30 2016-04-20 北京奇虎科技有限公司 The recognition system of webpage danger
CN103049483A (en) * 2012-11-30 2013-04-17 北京奇虎科技有限公司 System for recognizing web page dangerousness
CN102999638A (en) * 2013-01-05 2013-03-27 南京邮电大学 Phishing website detection method excavated based on network group
CN103442014A (en) * 2013-09-03 2013-12-11 中国科学院信息工程研究所 Method and system for automatic detection of suspected counterfeit websites
CN103823758A (en) * 2014-03-13 2014-05-28 北京金山网络科技有限公司 Browser testing method and device
CN103986731A (en) * 2014-05-30 2014-08-13 北京奇虎科技有限公司 Method and device for detecting phishing web pages through picture matching
CN104113869B (en) * 2014-06-20 2017-12-22 北京拓明科技有限公司 A kind of potential report user's Forecasting Methodology and system based on signaling data
CN104113869A (en) * 2014-06-20 2014-10-22 北京拓明科技有限公司 Signaling data-based prediction method and system for potential complaint user
CN104166725A (en) * 2014-08-26 2014-11-26 哈尔滨工业大学(威海) Phishing website detection method
CN104166725B (en) * 2014-08-26 2018-01-12 哈尔滨工业大学(威海) A kind of detection method for phishing site
CN104765882A (en) * 2015-04-29 2015-07-08 中国互联网络信息中心 Internet website statistics method based on web page characteristic strings
CN107636650A (en) * 2015-05-18 2018-01-26 微软技术许可有限责任公司 Meet the document based on the condition for rendering assessment to present
CN106303757B (en) * 2015-06-23 2019-07-16 中国科学院信息工程研究所 A kind of view-based access control model feature and the network audio-video address resolution method of stream reduction
CN106303757A (en) * 2015-06-23 2017-01-04 中国科学院信息工程研究所 A kind of view-based access control model feature and the network audio-video address resolution method of stream reduction
CN105763543A (en) * 2016-02-03 2016-07-13 百度在线网络技术(北京)有限公司 Phishing site identification method and device
CN105763543B (en) * 2016-02-03 2019-08-30 百度在线网络技术(北京)有限公司 A kind of method and device identifying fishing website
CN106127042A (en) * 2016-07-06 2016-11-16 苏州仙度网络科技有限公司 Webpage visual similarity recognition method
CN106874926A (en) * 2016-08-04 2017-06-20 阿里巴巴集团控股有限公司 Service exception detection method and device based on characteristics of image
CN106685936A (en) * 2016-12-14 2017-05-17 深圳市深信服电子科技有限公司 Webpage defacement detection method and apparatus
CN107729386A (en) * 2017-09-19 2018-02-23 杭州安恒信息技术有限公司 A kind of dark chain detection technique based on degree of polymerization analysis
CN107729386B (en) * 2017-09-19 2019-09-13 杭州安恒信息技术股份有限公司 A kind of dark chain detection technique based on degree of polymerization analysis
WO2020259036A1 (en) * 2019-06-26 2020-12-30 扬州大学 Method for generating web code based on ui of generative adversarial and convolutional neural networks.
US11579850B2 (en) 2019-06-26 2023-02-14 Yangzhou University Method for generating web code for UI based on a generative adversarial network and a convolutional neural network
CN113641933A (en) * 2021-06-30 2021-11-12 北京百度网讯科技有限公司 Abnormal webpage identification method, abnormal site identification method and device
CN113641933B (en) * 2021-06-30 2023-10-20 北京百度网讯科技有限公司 Abnormal webpage identification method, abnormal site identification method and device
CN117596054A (en) * 2023-11-29 2024-02-23 北京中电汇通科技有限公司 Network security method and system based on dynamic network information security
CN117596054B (en) * 2023-11-29 2024-05-07 北京中电汇通科技有限公司 Network security method and system based on dynamic network information security

Similar Documents

Publication Publication Date Title
CN102170446A (en) Fishing webpage detection method based on spatial layout and visual features
CN101894134B (en) Spatial layout-based fishing webpage detection and implementation method
CN103544436B (en) System and method for distinguishing phishing websites
CN103179095B (en) A kind of method and client terminal device detecting fishing website
CN101826105B (en) Phishing webpage detection method based on Hungary matching algorithm
CN102096781A (en) Fishing detection method based on webpage relevance
US7173632B2 (en) Information display
CN102662959B (en) Method for detecting phishing web pages with spatial mixed index mechanism
CN104899508B (en) A kind of multistage detection method for phishing site and system
CN107368718B (en) User browsing behavior authentication method and system
CN102170447A (en) Method for detecting phishing webpage based on nearest neighbour and similarity measurement
CN105824822A (en) Method clustering phishing page to locate target page
CN103605794A (en) Website classifying method
CN105337987B (en) A kind of method for authentication of identification of network user and system
CN101820366A (en) Pre-fetching-based phishing web page detection method
CN109922065B (en) Quick identification method for malicious website
CN103577755A (en) Malicious script static detection method based on SVM (support vector machine)
CN106779278A (en) The evaluation system of assets information and its treating method and apparatus of information
CN107341183A (en) A kind of Website classification method based on darknet website comprehensive characteristics
Bohunsky et al. Visual structure-based web page clustering and retrieval
CN101515272A (en) Method and device for extracting webpage content
CN103023874B (en) A kind of detection method for phishing site
CN107273416A (en) The dark chain detection method of webpage, device and computer-readable recording medium
CN107911360A (en) One kind is hacked website detection method and system
CN110020075A (en) Device is excavated in illegal website automatically

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
C02 Deemed withdrawal of patent application after publication (patent law 2001)
WD01 Invention patent application deemed withdrawn after publication

Application publication date: 20110831