CN102316081A - Method and device for identifying similar webpage - Google Patents

Method and device for identifying similar webpage Download PDF

Info

Publication number
CN102316081A
CN102316081A CN2010102222145A CN201010222214A CN102316081A CN 102316081 A CN102316081 A CN 102316081A CN 2010102222145 A CN2010102222145 A CN 2010102222145A CN 201010222214 A CN201010222214 A CN 201010222214A CN 102316081 A CN102316081 A CN 102316081A
Authority
CN
China
Prior art keywords
subtree
node
similarity
isomorphism
maximum
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN2010102222145A
Other languages
Chinese (zh)
Inventor
胡振宇
叶润国
黄宇鸿
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing Venus Information Security Technology Co Ltd
Beijing Venus Information Technology Co Ltd
Original Assignee
Beijing Venus Information Security Technology Co Ltd
Beijing Venus Information Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing Venus Information Security Technology Co Ltd, Beijing Venus Information Technology Co Ltd filed Critical Beijing Venus Information Security Technology Co Ltd
Priority to CN2010102222145A priority Critical patent/CN102316081A/en
Publication of CN102316081A publication Critical patent/CN102316081A/en
Pending legal-status Critical Current

Links

Images

Abstract

The invention provides a method and a device for identifying similar webpages. The device comprises a receiving module, a comparison module and a judgment module, wherein the receiving module is used for receiving document object model (DOM) trees of two webpages respectively; the comparison module is used for comparing the received two DOM trees so as to obtain the similarity of the two DOM trees; and the judgment module is used for comparing the similarity with a preset threshold value, and determining that the two webpages are similar webpages if the similarity is more than or equal to the threshold value. By the method, the comparison of similar webpages can be performed outside a server, further judgment on a suspected phishing website can be performed according to the comparison result and influence on the server and the phishing website is avoided.

Description

A kind of recognition methods of similar web page and device
Technical field
The present invention relates to network safety filed, relate in particular to a kind of recognition methods and device of similar web page.
Background technology
In network safety filed, need judge the similarity degree of webpage sometimes automatically, to carry out safety precaution; Such as to the fake site; When discerning like fishing website etc., need be with the page (such as homepage) of these websites and comparing of true website, if similarity degree is too high; Can think that then the user has very big probability to think the page of these websites by mistake true website, and cheated being taken in.
Fishing website is mimic actual Web website and the forgery website of the malice of building.Most this type of website has the high visual similitude, with the deception victim.These fishing websites look the same with real website, makes that careless user is easy to have dust thrown into the eyes.Suffer the victim of phishing attack, may their other important informations such as bank account, password, credit card number be exposed to the operator behind the scenes of fake site.
With respect to other network crime (like virus and hacker attacks etc.), phishing is a kind of newer network crime relatively, and its incidence is the trend of quickening rising in recent years.In anti-phishing working group (Anti-Phishing Working Group) report display, phishing attack rises with the speed that increases every month 50%, and about 5% user can connect the Email of the phishing attack received and make response.Report also shows, one month of in June, 2005 15050 network phishing attacks have just taken place.This problem has caused the hig diligence of industry and academia, because it is a kind of serious safety and privacy problem, on internet world, causes very big negative effect.It is threatening people to carry out the confidence of online financial activities through Web.
Summary of the invention
The technical problem that the present invention will solve provides a kind of recognition methods and device of similar web page; Can carry out the contrast of similar web page in the server outside; Can further judge through this comparing result, can not produce any influence server and fishing website to a doubtful fishing website.
In order to address the above problem, the invention provides a kind of recognition device of similar web page, comprising:
Receiver module is used for receiving respectively the DOM Document Object Model dom tree of two webpages;
Comparison module is used for two dom trees that received are compared, and obtains the similarity of these two dom trees;
Judge module is used for a said similarity and a preset threshold value are compared, if more than or equal to this threshold value, judges that said two webpages are similar.
Further, said comparison module comprises:
The serializing unit is used for said two dom trees that received are changed into first, second sequence node respectively;
Extraction unit, the subtree that is used for finding out according to said first, second sequence node all maximum isomorphisms of two dom trees is right; The subtree of said isomorphism is an isomorphism to the tree that is meant two sub-tree; The subtree of said maximum isomorphism is to being meant that two sub-tree are isomorphisms, but is the subtree isomorphism no longer of root node with the father node of separately root node;
Metric element is used for by to calculating the right similarity of subtree of each the said maximum isomorphism found out, again according to the right similarity of the subtree of each said maximum isomorphism with, calculate the similarity of said two dom trees.
Further, when said serializing unit is converted into sequence node with dom tree, the vertex ticks of each node and the node degree of depth in the records series;
Said extraction unit comprises;
Subtree is searched subelement, is used for finding out the maximum subtree of the degree of depth from said first node sequence, and writes down the root node of this subtree; After receiving the indication that continues to search, from said first node sequence, find out the subtree that degree of depth maximum and root node were not write down, write down the root node of this subtree, power cut-off when can not find new subtree;
Search subelement with paper mulberry, be used for judging whether said Section Point sequence exists the subtree with the subtree isomorphism of being searched; If the interstitial content of two sub-tree is identical, and the node degree of depth of each node in the sequence of a sub-tree, it is all identical to deduct after the node degree of depth of same position node in the sequence of another subtree the difference of gained respectively, then judges two sub-tree isomorphisms; If have then subtree that to write down this two sub-tree be maximum isomorphism is right; The subtree of exporting this maximum isomorphism is right; And in said first, second sequence node, the node of this two sub-tree is left out or change special symbol into, indicate said subtree to search subelement then and continue to search; If not then indicate said subtree to search subelement and continue to search.
Further, said metric element is calculated the right similarity sim (T of subtree of said maximum isomorphism 1, T 2) be meant:
Said metric element is calculated subtree T 1And T 2In each similarity to corresponding node, with the similarity addition of all corresponding node, obtain sim (T then divided by n 1, T 2); Wherein n representes subtree T 1Or T 2The node number; Subtree T 1And T 2In a pair of corresponding node v iAnd u iSimilarity be:
s ( v i , u i ) = length _ of _ matched _ charater max ( | v i | , | u i | )
Wherein length_of_matched_charater is a characters matched length, | v i| with | u i| represent node v respectively iAnd u iThe character length of vertex ticks.
Further, said metric element according to the right similarity of the subtree of each said maximum isomorphism with, the similarity that calculates said two dom trees is meant:
Said metric element multiply by the node number of subtree centering one sub-tree of this maximum isomorphism respectively with the right similarity of subtree of each maximum isomorphism, with all product additions; The similarity that obtains two dom trees be said product addition and divided by | T 1|, | T 2| in the bigger resulting merchant of a number; | T 1|, | T 2| represent the node number of two dom trees respectively.
The present invention also provides a kind of recognition methods of similar web page, comprising:
Receive the DOM Document Object Model dom tree of two webpages respectively;
Two dom trees to receiving compare, and obtain the similarity of these two dom trees;
If said similarity, judges then that said two webpages are similar more than or equal to a preset threshold value.
Further, said two dom trees to reception compare, and the step that obtains the similarity of these two dom trees comprises:
A, said two dom trees that received are changed into first, second sequence node respectively;
B, the subtree that in said first, second sequence node, finds out all maximum isomorphisms are right;
C, by to calculating the right similarity of subtree of each the said maximum isomorphism found out, again according to the right similarity of the subtree of each said maximum isomorphism with, calculate the similarity of said two dom trees.
In the present embodiment; Said said two dom trees that received are changed into respectively in the step of first, second sequence node; Can but be not limited to order by preorder traversal, each node of dom tree is lined up sequence, and the vertex ticks and the node degree of depth of each node in the records series.
Further, said step B comprises:
B1, from said first node sequence, find out the maximum subtree of the degree of depth, and write down the root node of this subtree;
B2, judge the subtree of the subtree isomorphism that whether exists in the said Section Point sequence and searched; If the interstitial content of two sub-tree is identical, and the node degree of depth of each node in the sequence of a sub-tree, it is all identical to deduct after the node degree of depth of same position node in the sequence of another subtree the difference of gained respectively, then judges two sub-tree isomorphisms; If have then subtree that to write down this two sub-tree be maximum isomorphism is right, and in said first, second sequence node, the node of this two sub-tree left out or change special symbol into;
B3, from said first node sequence, find out the subtree that degree of depth maximum and root node were not write down, write down the root node of this subtree, return step B2; If can not find new subtree, then carry out step B4;
The subtree of the maximum isomorphism that B4, output are write down is right.
Further, among the said step C, said according to the right similarity of the subtree of each said maximum isomorphism with, the step that calculates the similarity of said two dom trees comprises:
Multiply by the node number of subtree centering one sub-tree of this maximum isomorphism respectively with the right similarity of subtree of each maximum isomorphism;
Obtain all products with;
The similarity of two dom trees is said and divided by | T 1|, | T 2| in the bigger resulting merchant of a number; | T 1|, | T 2| represent the node number of two dom trees respectively.
Further, among the said step C, the similarity sim (T that the subtree of said maximum isomorphism is right 1, T 2) be:
With subtree T 1And T 2In each similarity addition to corresponding node, the merchant who obtains divided by n;
Wherein n representes T 1Or T 2The node number; Subtree T 1And T 2In a pair of corresponding node v iAnd u iSimilarity be:
s ( v i , u i ) = length _ of _ matched _ charater max ( | v i | , | u i | )
Wherein length_of_matched_charater is a characters matched length, | v i| and u i| represent node v respectively iAnd u iThe character length of mark.
Technical scheme of the present invention can be carried out the identification of similar web page in the server outside, differentiates the basis for the identification fishing website provides.
Other advantages of the present invention, target; To in specification subsequently, set forth to a certain extent with characteristic; And to a certain extent,, perhaps can from practice of the present invention, obtain instruction based on being conspicuous to those skilled in the art to investigating of hereinafter.Target of the present invention and other advantages can be passed through following specification, claims, and the structure that is particularly pointed out in the accompanying drawing realizes and obtains.
Description of drawings
In order to make the object of the invention, technical scheme and advantage clearer, will combine accompanying drawing that the present invention is made further detailed description below, wherein:
Fig. 1 is compare device's the structural representation of a kind of webpage of embodiment one;
Fig. 2 is the structural representation of the comparison module of embodiment one;
Fig. 3 is that confession serializing cell translation is the sketch map of the dom tree of sequence node among the embodiment one;
Fig. 4 supplies extraction unit to extract the sketch map of two right dom trees of the subtree of maximum isomorphism among the embodiment one.
Embodiment
To combine accompanying drawing and embodiment that technical scheme of the present invention is explained in more detail below.
Need to prove,, it will be appreciated by those skilled in the art that to not being intended to the present invention is limited to these embodiment though will combine certain exemplary enforcement and method for using to describe the present invention hereinafter.Otherwise, be intended to cover all substitutes, correction and the equivalent that are included in defined spirit of the present invention of appending claims and the scope.If do not conflict, each characteristic among the embodiment of the invention and the embodiment can mutually combine, all within protection scope of the present invention.In addition; Can in computer system, carry out in the step shown in the flow chart of accompanying drawing such as a set of computer-executable instructions, and, though logical order has been shown in flow chart; But in some cases, can carry out step shown or that describe with the order that is different from here.
The present invention proposes a kind of page documents object model tree with true website and doubtful website representes with its sequence node; And then the subtree through maximum isomorphism is to searching true website and doubtful website similitude, judges with the webpage similarity degree of two websites relatively whether said doubtful website is a fishing website.
Embodiment one, and a kind of recognition device of similar web page is as shown in Figure 1, comprising:
Receiver module 101, the DOM (Document Object Model, DOM Document Object Model) that is used for receiving respectively two webpages sets;
Comparison module 102 is used for two dom trees that said dom tree receiver module receives are compared, and obtains the similarity of these two dom trees;
Judge module 103 is used for a said similarity and a preset threshold value are compared, if more than or equal to this threshold value, judges that said two webpages are similar.
In the present embodiment, said two webpages can be respectively the pages of certain website of true website and doubtful fishing website, can be specified by the user, also can be selected by network elements such as servers, or adopt alternate manner to select; Can but be not limited to select the homepage of true website and doubtful website to compare.
If said two webpages are respectively from true website and doubtful website, then when two webpages are similar, can think that doubtful website is that the probability of fishing website is very big.
In the present embodiment, said comparison module 102 is as shown in Figure 2, specifically can comprise:
Serializing unit 201 is used for said two dom trees that received are changed into first, second sequence node respectively; Here " first ", " second " only are used to distinguish two sequence nodes, and which dom tree is converted into first/Section Point sequence and all can;
Extraction unit 202, the subtree that is used for finding out according to said first, second sequence node all maximum isomorphisms of two dom trees is right; The subtree of said isomorphism is to being meant two sub-tree when not considering its vertex ticks, and its tree is an isomorphism; The subtree of said maximum isomorphism is to being meant that two sub-tree are isomorphisms, but is the subtree isomorphism no longer of root node with the father node of separately root node.
Metric element 203 is used for by to calculating the right similarity of subtree of each the said maximum isomorphism found out, again according to the right similarity of the subtree of each said maximum isomorphism with, calculate the similarity of said two dom trees; Can be the similarity that the similarity addition that the subtree of each said maximum isomorphism is right obtains two dom trees, also can be the similarity that on average obtains two dom trees after the addition; Can be direct summation during addition, also can be weighted sum.
In the present embodiment, when said serializing unit 201 is converted into sequence node with dom tree, can but be not limited to order by preorder traversal, each node of dom tree is lined up sequence, and the vertex ticks and the node degree of depth of each node in the records series.
Such as for dom tree as shown in Figure 3, begin from root node, the vertex ticks of record root node is A, the node degree of depth is 0; The vertex ticks that writes down the child node on the left side then is G, and the node degree of depth is 1; The vertex ticks that then writes down the child node of node G is B, and the node degree of depth is 2; The vertex ticks of the child node of record Node B is C, and the node degree of depth is 3; The subtree on the left side has write down and has finished at this moment; The vertex ticks that writes down the child node on said root node the right is D, and the node degree of depth is 1; The vertex ticks that then writes down the child node of node D is E, and the node degree of depth is 2; The vertex ticks of the left side child node of record node E is B, and the node degree of depth is 3, and the vertex ticks of the right child node is G, and the node degree of depth is 3.So far, the subtree on the right also writes down and finishes, and said root node has not had other child node, and the dom tree serializing is accomplished, and obtains sequence node: { (A, 0) (G, 1) (B, 2) (C, 3) (D, 1) (E, 2) (B, 3) (G, 3) }.
In the present embodiment, said extraction unit 202 specifically comprises:
Subtree is searched subelement, is used for finding out the maximum subtree of the degree of depth from said first node sequence, and writes down the root node of this subtree; After receiving the indication that continues to search, from said first node sequence, find out the subtree that degree of depth maximum and root node were not write down, write down the root node of this subtree, power cut-off when can not find new subtree;
Search subelement with paper mulberry, be used for judging whether said Section Point sequence exists the subtree with the subtree isomorphism of being searched; If the interstitial content of two sub-tree is identical, and the node degree of depth of each node in the sequence of a sub-tree, it is all identical to deduct after the node degree of depth of same position node in the sequence of another subtree the difference of gained respectively, then judges two sub-tree isomorphisms; If have then subtree that to write down this two sub-tree be maximum isomorphism is right; And in said first, second sequence node, the node of this two sub-tree left out or change special symbol into; Can these nodes of rechecking when guaranteeing that the back is extracted, indicate said subtree to search subelement then and continue to search; If not then indicate said subtree to search subelement and continue to search.
For a node in the sequence node, this node begins from this node as the sequence of the subtree of root node exactly, is not more than the node sequence before of the node degree of depth of this node to first; Such as dom tree, be that the sequence of the subtree of root node is exactly { (3,1) (4 with node 3 for the left side as shown in Figure 4; 2) } (because the node degree of depth of node 5 is just identical with node 3), and be that the sequence of the subtree of root node is exactly { (5,1) (6 with node 5; 2) (7,3) (8,4) (9; 3) (10,4) } (because the node degree of depth of node 11 is just identical with node 5); By that analogy.
Be that example is elaborated with two dom trees shown in Figure 4 below.
What said subtree searched that subelement at first finds is the first node sequence that left dom tree is transformed; Saidly search subelement can not find isomorphism in the converted Section Point sequence of right dom tree subtree, therefore indicate said subtree to search subelement and continue to search with paper mulberry.
Said subtree is searched subelement then in said first node sequence, with the sequence that finds out subtree 5,6,7,8,9,10}; Saidly search subelement with paper mulberry and in the Section Point sequence, find subtree { 16,17,18; 19,20,21}; Two sub-tree are 6 nodes; And after the node degree of depth of node 5,6,7,8,9,10 subtracted the node degree of depth of node 16,17,18,19,20,21 respectively, 6 differences that obtain were 0, therefore two sub-tree isomorphisms; { 5,6,7,8,9, { 16,17,18,19,20,21} is that the subtree of maximum isomorphism is right to the record subtree for 10} and subtree.
Saidly search subelement with paper mulberry { 5,6,7,8,9, { 16,17,18,19,20, the node that 21} comprised replaces with special elements, indicates said subtree to search subelement again and continues to search for 10} and subtree with subtree from first, second sequence node.At this moment, said subtree search subelement will find subtree for 3,4} (also can be 11,12}), and said search subelement with paper mulberry will be in the Section Point sequence, find isomorphism subtree 25,26}; { 3, { 25,26} is that the subtree of maximum isomorphism is right to the record subtree for 4} and subtree.
Saidly search subelement with paper mulberry { 3, { 25, the node that 26} comprised also replaces with special elements, indicates said subtree to search subelement again and continues to search for 4} and subtree with subtree from first, second sequence node.At this moment, said subtree is searched subelement and will be found subtree for { 11,12} searches the subtree less than isomorphism and search subelement with paper mulberry in the Section Point sequence; When subtree is searched subelement and is continued to search with finding can not find new subtree, the subtree of therefore exporting maximum isomorphism to for subtree 3,4} and subtree 25,26}, and subtree 5,6,7,8,9,10} and subtree 16,17,18,19,20,21}.
In the present embodiment, said metric element is calculated the right similarity sim (T of subtree of said maximum isomorphism 1, T 2) be meant:
Said metric element is calculated subtree T 1And T 2In each similarity to corresponding node, with the similarity addition of all corresponding node, obtain sim (T then divided by n 1, T 2);
Corresponding node is in the node of same position exactly in subtree (or sequence node of subtree), such as subtree 5,6,7,8,9,8 among the 10}, and subtree { 16,17,18,19,20,19 among the 21} is exactly a pair of corresponding node.
Said metric element 203 can but be not limited to according to the right similarity sim (T of the subtree of the said maximum isomorphism of computes 1, T 2):
sim ( T 1 , T 2 ) = 1 n Σs ( v i , u i )
V wherein i∈ V (T 1) and u i∈ V (T 2) be respectively T 1And T 2In node, n representes T 1Or T 2The node number.S (v i, u i) expression subtree T 1And T 2Middle corresponding node v iAnd u iSimilarity, can measure with characters matched number or simple hamming weight.
A kind of computational methods are following:
s ( v i , u i ) = length _ of _ matched _ charater max ( | v i | , | u i | )
Wherein length_of_matched_charater is a characters matched length, | v i| with | u i| represent node v respectively iAnd u iThe character length of mark.
Said metric element 203 according to the right similarity of the subtree of each said maximum isomorphism with, the similarity that calculates said two dom trees specifically can be meant:
Said metric element 203 multiply by the node number of subtree centering one sub-tree of this maximum isomorphism with the right similarity of subtree of each maximum isomorphism respectively, and (the node number of two sub-tree of maximum isomorphism is inevitable the same; So with whichever will do), with all product additions; The similarity that obtains two dom trees be said product addition and divided by | T 1|, | T 2| in the bigger resulting merchant of a number; | T 1|, | T 2| represent the node number of two dom trees respectively.
Embodiment two, and a kind of recognition methods of similar web page comprises:
Receive the dom tree of two webpages respectively;
Two dom trees to receiving compare, and obtain the similarity of these two dom trees;
If said similarity, judges then that said two webpages are similar more than or equal to a preset threshold value.
In the present embodiment, said two dom trees to reception compare, and the step that obtains the similarity of these two dom trees specifically can comprise:
A, said two dom trees that received are changed into first, second sequence node respectively;
B, the subtree that in said first, second sequence node, finds out all maximum isomorphisms are right;
C, by to calculating the right similarity of subtree of each the said maximum isomorphism found out, again according to the right similarity of the subtree of each said maximum isomorphism with, calculate the similarity of said two dom trees.
In the present embodiment; Said said two dom trees that received are changed into respectively in the step of first, second sequence node; Can but be not limited to order by preorder traversal, each node of dom tree is lined up sequence, and the vertex ticks and the node degree of depth of each node in the records series.
In the present embodiment, said step B specifically can comprise:
B1, from said first node sequence, find out the maximum subtree of the degree of depth, and write down the root node of this subtree;
B2, judge the subtree of the subtree isomorphism that whether exists in the said Section Point sequence and searched; If the interstitial content of two sub-tree is identical, and the node degree of depth of each node in the sequence of a sub-tree, it is all identical to deduct after the node degree of depth of same position node in the sequence of another subtree the difference of gained respectively, then judges two sub-tree isomorphisms; If have then subtree that to write down this two sub-tree be maximum isomorphism is right, and in said first, second sequence node, the node of this two sub-tree left out or change special symbol into, can these nodes of rechecking when guaranteeing that the back is extracted;
B3, from said first node sequence, find out the subtree that degree of depth maximum and root node were not write down, write down the root node of this subtree, return step B2; If can not find new subtree, then carry out step B4;
The subtree of the maximum isomorphism that B4, output are write down is right.
In the present embodiment, among the said step C, said according to the right similarity of the subtree of each said maximum isomorphism with, the step that calculates the similarity of said two dom trees specifically can comprise:
Multiply by the node number of subtree centering one sub-tree of this maximum isomorphism respectively with the right similarity of subtree of each maximum isomorphism;
Obtain all products with;
The similarity of two dom trees is said and divided by | T 1|, | T 2| in the bigger resulting merchant of a number; | T 1|, | T 2| represent the node number of two dom trees respectively.
In the present embodiment, among the said step C, the similarity sim (T that the subtree of said maximum isomorphism is right 1, T 2) be:
With subtree T 1And T 2In each similarity addition to corresponding node, the merchant who obtains divided by n;
V wherein i∈ V (T 1) and u i∈ V (T 2) be respectively T 1And T 2In node, n representes T 1Or T 2The node number.S (v i, u i) expression subtree T 1And T 2In a pair of corresponding node v iAnd u iSimilarity, can measure with characters matched number or simple hamming weight.
A kind of computational methods are following:
s ( v i , u i ) = length _ of _ matched _ charater max ( | v i | , | u i | )
Wherein length_of_matched_charater is a characters matched length, | v i| with | u i| represent node v respectively iAnd u iThe character length of mark.
Other realizes that details can be with embodiment one.
Explain with a concrete example below; In this example; Be to judge through the identification similar web page whether a doubtful website is fishing website, when when similar, thinking that said doubtful website is a fishing website from two webpages of true website and doubtful website respectively; The actual application of present embodiment is not limited to the content of this example; Detailed process is following:
Step S601, predetermined threshold value s obtains the webpage dom tree T of doubtful website and true website 1And T 2(suppose | T 1|≤| T 2|, | T 1|, | T 2| represent the node number of two dom trees respectively), its root node is respectively v 0And u 0
Step S602, first, second sequence node of obtaining of serializing T1 and T2 is following respectively:
Ldfts (T 1)={ (v 0, 0) and (v 1, d 1) ... (v k, d k) and ldfts (T 2)={ (u 0, 0) and (u 1, d 1) ... (u m, d m).
Step S603 puts the similarity Similar (T of two dom trees 1, T 2)=0;
Step S604 is from 1dfts (T 1) and ldfts (T 2), search T 1And T 2The subtree of each maximum isomorphism to T 1(v i) and T 2(u i); Calculate the right similarity sim (T of subtree of each maximum isomorphism respectively 1(v i), T 2(u i)); Multiply by the node number of subtree centering one sub-tree of this maximum isomorphism respectively with the right similarity of subtree of each maximum isomorphism, with all products and as Similar (T 1, T 2);
Step S605 puts Similar ( T 1 , T 2 ) = 1 Max ( | T 1 | , | T 2 | ) Similar ( T 1 , T 2 ) ;
If Similar is (T 1, T 2)>s, output " being " shows that said two webpages are similar, can think that doubtful website probably is a fishing website, otherwise output " denying " shows said two webpage dissmilarities, doubtful website probably is not a fishing website.
The right step of subtree of searching maximum isomorphism among the said step S604 specifically can comprise:
Step S801 puts pending set of node S for empty.To set T 1Root node put into set of node S.
Step S802 if set of node S is empty, finishes to search.Otherwise for each node among the set of node S, calculating with this node is the degree of depth of the subtree of root node, gets the T of the maximum subtree of the degree of depth 1(v i); If the degree of depth is equally big, then can get the subtree that finds earlier.
Step S803 is at ldfts (T 2) search and T 1(v i) the subtree T of isomorphism 2(u i), if find, then export T 1(v i), T 2(u i), then at ldfts (T 1) in T 1(v i) the corresponding nodes sequence all uses special elements (1 ,-1) to replace, and at ldfts (T 2) in T 2(u i) the corresponding nodes sequence all uses special elements (2 ,-2) to replace, and carries out step S804 then; Otherwise with node v iChild node put into set of node S and carry out step S804.
Step S804, deletion of node v from set of node S iExecution in step S802.
Such as the dom tree for the left side among Fig. 4, what put into set of node S for the first time is exactly root node 2; Owing to can not find the subtree of isomorphism, therefore node 3,5,11 is put into set of node S, the maximum subtree of the degree of depth that finds is exactly with the subtree of node 5 as root node; Owing to can find the subtree of isomorphism, therefore will all change special elements into as the node on the subtree of root node with node 5, from set of node S, leave out node 5 then; Be node 3 and 11 this moment among the set of node S; Suppose to fail to find the subtree of isomorphism with node 5 as the subtree of root node, then 5 child node 6 is put into set of node S, and from set of node S, leave out node 5, be node 3,6 and 11 this moment among the set of node S.
Judge T among the said step S803 1(v i) and T 2(u i) isomorphism method can for:
T 1(v i) sequence node in the degree of depth and the T of each node 2(u i) sequence node in the degree of depth of each corresponding node differ identical number, then a T 1(v i) and T 2(u i) isomorphism, otherwise isomorphism not.
Among the said step S604, the subtree of maximum isomorphism is to T 1(v i), T 2(u i) similarity sim (T 1, T 2) as follows:
sim ( T 1 , T 2 ) = 1 n Σs ( v i , u i )
V wherein i∈ V (T 1) and u i∈ V (T 2) be respectively T 1And T 2In node, n representes T 1Or T 2The node number.S (v i, u i) the expression node | v i| with | u i| similarity, can measure with characters matched number or simple hamming weight.
A kind of computational methods are following:
s ( v i , u i ) = length _ of _ matched _ charater max ( | v i | , | u i | )
Wherein | v i| with | u i| represent node v respectively iAnd u iThe character length of vertex ticks; Length_of_matched_charater is a characters matched length.
Subtree to two maximum isomorphisms that find in Fig. 4 is right below, and concrete calculating process is described.
According to embodiment 1, from the stratification sequence node of two trees, find the subtree of the maximum isomorphism of its two couple, be respectively subtree 3,4} and subtree 25,26}, and subtree 5,6,7,8,9,10} and subtree 16,17,18,19,20,21}.According to the vertex ticks among the figure, { 3, { 25, the vertex ticks of 26} is respectively { TR, TD} and { TR, TD}, therefore identical (coupling) to subtree for 4} and subtree.Therefore this similarity to the right node of the subtree of isomorphism is:
s(3,25)=s(4,26)=1
{ 3, { 25, the 26} similarity is
Figure BSA00000181152500132
to subtree and subtree { 5,6,7 for 4} and subtree; 8,9,10} and subtree { 16,17,18; 19,20, the vertex ticks of 21} is respectively { TR, table, TR; TD, TR, TD} and { TR, table, TR; TD, TR, TD}, therefore also identical (coupling).Therefore this to the right node similarity of the subtree of isomorphism is:
s(5,16)=s(6,17)=s(7,18)=s(8,19)=s(9,20)=s(10,21)=1
Therefore this to the right similarity of the subtree of isomorphism is:
1 6 ( 1 + 1 + 1 + 1 + 1 + 1 ) = 1 .
And the middle maximum node number of two trees is 12 (the tree node number on the left side is 11, the right be 12) among Fig. 4, so the similarity of this two sub-tree is:
6 * 1 + 2 * 1 12 = 0.67 .
Obviously, it is apparent to those skilled in the art that above-mentioned each module of the present invention or each step can realize with the general calculation device; They can concentrate on the single calculation element; Perhaps be distributed on the network that a plurality of calculation element forms, alternatively, they can be realized with the executable program code of calculation element; Thereby; Can they be stored in the storage device and carry out, perhaps they are made into each integrated circuit modules respectively, perhaps a plurality of modules in them or step are made into the single integrated circuit module and realize by calculation element.Like this, the present invention is not restricted to any specific hardware and software combination.
Certainly; The present invention also can have other various embodiments; Under the situation that does not deviate from spirit of the present invention and essence thereof; Those of ordinary skill in the art work as can make various corresponding changes and distortion according to the present invention, but these corresponding changes and distortion all should belong to the protection range of claim of the present invention.

Claims (10)

1. the recognition device of a similar web page is characterized in that, comprising:
Receiver module is used for receiving respectively the DOM Document Object Model dom tree of two webpages;
Comparison module is used for two dom trees that received are compared, and obtains the similarity of these two dom trees;
Judge module is used for a said similarity and a preset threshold value are compared, if more than or equal to this threshold value, judges that said two webpages are similar.
2. device as claimed in claim 1 is characterized in that, said comparison module comprises:
The serializing unit is used for said two dom trees that received are changed into first, second sequence node respectively;
Extraction unit, the subtree that is used for finding out according to said first, second sequence node all maximum isomorphisms of two dom trees is right; The subtree of said isomorphism is an isomorphism to the tree that is meant two sub-tree; The subtree of said maximum isomorphism is to being meant that two sub-tree are isomorphisms, but is the subtree isomorphism no longer of root node with the father node of separately root node;
Metric element is used for by to calculating the right similarity of subtree of each the said maximum isomorphism found out, again according to the right similarity of the subtree of each said maximum isomorphism with, calculate the similarity of said two dom trees.
3. device as claimed in claim 2 is characterized in that:
When said serializing unit is converted into sequence node with dom tree, the vertex ticks of each node and the node degree of depth in the records series;
Said extraction unit comprises;
Subtree is searched subelement, is used for finding out the maximum subtree of the degree of depth from said first node sequence, and writes down the root node of this subtree; After receiving the indication that continues to search, from said first node sequence, find out the subtree that degree of depth maximum and root node were not write down, write down the root node of this subtree, power cut-off when can not find new subtree;
Search subelement with paper mulberry, be used for judging whether said Section Point sequence exists the subtree with the subtree isomorphism of being searched; If the interstitial content of two sub-tree is identical, and the node degree of depth of each node in the sequence of a sub-tree, it is all identical to deduct after the node degree of depth of same position node in the sequence of another subtree the difference of gained respectively, then judges two sub-tree isomorphisms; If have then subtree that to write down this two sub-tree be maximum isomorphism is right; The subtree of exporting this maximum isomorphism is right; And in said first, second sequence node, the node of this two sub-tree is left out or change special symbol into, indicate said subtree to search subelement then and continue to search; If not then indicate said subtree to search subelement and continue to search.
4. device as claimed in claim 2 is characterized in that, said metric element is calculated the right similarity sim (T of subtree of said maximum isomorphism 1, T 2) be meant:
Said metric element is calculated subtree T 1And T 2In each similarity to corresponding node, with the similarity addition of all corresponding node, obtain sim (T then divided by n 1, T 2); Wherein n representes subtree T 1Or T 2The node number; Subtree T 1And T 2In a pair of corresponding node v iAnd u iSimilarity be:
s ( v i , u i ) = length _ of _ matched _ charater max ( | v i | , | u i | )
Wherein length_of_matched_charater is a characters matched length, | v i| and u i| represent node v respectively iAnd u iThe character length of vertex ticks.
5. device as claimed in claim 4 is characterized in that, said metric element according to the right similarity of the subtree of each said maximum isomorphism with, the similarity that calculates said two dom trees is meant:
Said metric element multiply by the node number of subtree centering one sub-tree of this maximum isomorphism respectively with the right similarity of subtree of each maximum isomorphism, with all product additions; The similarity that obtains two dom trees be said product addition and divided by | T 1|, | T 2| in the bigger resulting merchant of a number; | T 1|, | T 2| represent the node number of two dom trees respectively.
6. the recognition methods of a similar web page comprises:
Receive the DOM Document Object Model dom tree of two webpages respectively;
Two dom trees to receiving compare, and obtain the similarity of these two dom trees;
If said similarity, judges then that said two webpages are similar more than or equal to a preset threshold value.
7. method as claimed in claim 6 is characterized in that, said two dom trees to reception compare, and the step that obtains the similarity of these two dom trees comprises:
A, said two dom trees that received are changed into first, second sequence node respectively;
B, the subtree that in said first, second sequence node, finds out all maximum isomorphisms are right;
C, by to calculating the right similarity of subtree of each the said maximum isomorphism found out, again according to the right similarity of the subtree of each said maximum isomorphism with, calculate the similarity of said two dom trees.
8. method as claimed in claim 7 is characterized in that, said step B comprises:
B1, from said first node sequence, find out the maximum subtree of the degree of depth, and write down the root node of this subtree;
B2, judge the subtree of the subtree isomorphism that whether exists in the said Section Point sequence and searched; If the interstitial content of two sub-tree is identical, and the node degree of depth of each node in the sequence of a sub-tree, it is all identical to deduct after the node degree of depth of same position node in the sequence of another subtree the difference of gained respectively, then judges two sub-tree isomorphisms; If have then subtree that to write down this two sub-tree be maximum isomorphism is right, and in said first, second sequence node, the node of this two sub-tree left out or change special symbol into;
B3, from said first node sequence, find out the subtree that degree of depth maximum and root node were not write down, write down the root node of this subtree, return step B2; If can not find new subtree, then carry out step B4;
The subtree of the maximum isomorphism that B4, output are write down is right.
9. method as claimed in claim 7 is characterized in that, among the said step C, said according to the right similarity of the subtree of each said maximum isomorphism with, the step that calculates the similarity of said two dom trees comprises:
Multiply by the node number of subtree centering one sub-tree of this maximum isomorphism respectively with the right similarity of subtree of each maximum isomorphism;
Obtain all products with;
The similarity of two dom trees is said and divided by | T 1|, | T 2| in the bigger resulting merchant of a number; | T 1|, | T 2| represent the node number of two dom trees respectively.
10. method as claimed in claim 7 is characterized in that, among the said step C, and the similarity sim (T that the subtree of said maximum isomorphism is right 1, T 2) be:
With subtree T 1And T 2In each similarity addition to corresponding node, the merchant who obtains divided by n;
Wherein n representes T 1Or T 2The node number; Subtree T 1And T 2In a pair of corresponding node v iAnd u iSimilarity be:
s ( v i , u i ) = length _ of _ matched _ charater max ( | v i | , | u i | )
Wherein length_of_matched_charater is a characters matched length, | v i| with | u i| represent node v respectively iAnd u iThe character length of mark.
CN2010102222145A 2010-06-30 2010-06-30 Method and device for identifying similar webpage Pending CN102316081A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN2010102222145A CN102316081A (en) 2010-06-30 2010-06-30 Method and device for identifying similar webpage

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN2010102222145A CN102316081A (en) 2010-06-30 2010-06-30 Method and device for identifying similar webpage

Publications (1)

Publication Number Publication Date
CN102316081A true CN102316081A (en) 2012-01-11

Family

ID=45428905

Family Applications (1)

Application Number Title Priority Date Filing Date
CN2010102222145A Pending CN102316081A (en) 2010-06-30 2010-06-30 Method and device for identifying similar webpage

Country Status (1)

Country Link
CN (1) CN102316081A (en)

Cited By (17)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102624713A (en) * 2012-02-29 2012-08-01 深信服网络科技(深圳)有限公司 Website tampering identification method and website tampering identification device
CN102662959A (en) * 2012-03-07 2012-09-12 南京邮电大学 Method for detecting phishing web pages with spatial mixed index mechanism
CN102682098A (en) * 2012-04-27 2012-09-19 北京神州绿盟信息安全科技股份有限公司 Method and device for detecting web page content changes
CN103049562A (en) * 2012-12-31 2013-04-17 华为技术有限公司 Method and device for recognizing similar webpages
CN103544213A (en) * 2013-09-16 2014-01-29 青岛英网资讯股份有限公司 Network content upgrading detection assessment method and system
CN103577526A (en) * 2013-08-01 2014-02-12 星云融创(北京)信息技术有限公司 Method and system as well as browser for verifying page modification
CN104008131A (en) * 2014-04-30 2014-08-27 广州市动景计算机科技有限公司 Processing method and device for web page data
CN104050198A (en) * 2013-03-15 2014-09-17 阿里巴巴集团控股有限公司 Method and device for identifying webpage information
CN104852883A (en) * 2014-02-14 2015-08-19 腾讯科技(深圳)有限公司 Method and system for protecting safety of account information
CN106095674A (en) * 2016-06-07 2016-11-09 百度在线网络技术(北京)有限公司 A kind of website automation test method and device
CN107566354A (en) * 2017-08-22 2018-01-09 北京小米移动软件有限公司 Web page contents detection method, device and storage medium
CN108306878A (en) * 2018-01-30 2018-07-20 平安科技(深圳)有限公司 Detection method for phishing site, device, computer equipment and storage medium
CN108650250A (en) * 2018-04-27 2018-10-12 北京奇安信科技有限公司 Illegal page detection method, system, computer system and readable storage medium storing program for executing
CN109542776A (en) * 2018-11-07 2019-03-29 北京潘达互娱科技有限公司 Page comparison method, device and equipment
CN110049052A (en) * 2019-04-23 2019-07-23 哈尔滨工业大学(威海) The malice domain name detection method of label and attribute similarity based on dom tree
CN110781497A (en) * 2019-10-21 2020-02-11 新华三信息安全技术有限公司 Method for detecting web page link and storage medium
CN113312029A (en) * 2021-06-11 2021-08-27 四川大学 Interface recommendation method and device, electronic equipment and medium

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20080232266A1 (en) * 2007-03-20 2008-09-25 Kouji Jitsui Network monitoring apparatus, network monitoring method and recording medium
CN101309272A (en) * 2008-07-09 2008-11-19 中兴通讯股份有限公司 Authentication server and mobile communication terminal access controlling method of virtual private network
CN101694668A (en) * 2009-09-29 2010-04-14 百度在线网络技术(北京)有限公司 Method and device for confirming web structure similarity

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20080232266A1 (en) * 2007-03-20 2008-09-25 Kouji Jitsui Network monitoring apparatus, network monitoring method and recording medium
CN101309272A (en) * 2008-07-09 2008-11-19 中兴通讯股份有限公司 Authentication server and mobile communication terminal access controlling method of virtual private network
CN101694668A (en) * 2009-09-29 2010-04-14 百度在线网络技术(北京)有限公司 Method and device for confirming web structure similarity

Cited By (28)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102624713A (en) * 2012-02-29 2012-08-01 深信服网络科技(深圳)有限公司 Website tampering identification method and website tampering identification device
CN102624713B (en) * 2012-02-29 2016-01-06 深信服网络科技(深圳)有限公司 The method of website tamper Detection and device
CN102662959A (en) * 2012-03-07 2012-09-12 南京邮电大学 Method for detecting phishing web pages with spatial mixed index mechanism
CN102662959B (en) * 2012-03-07 2014-07-16 南京邮电大学 Method for detecting phishing web pages with spatial mixed index mechanism
CN102682098A (en) * 2012-04-27 2012-09-19 北京神州绿盟信息安全科技股份有限公司 Method and device for detecting web page content changes
CN102682098B (en) * 2012-04-27 2014-05-14 北京神州绿盟信息安全科技股份有限公司 Method and device for detecting web page content changes
CN103049562B (en) * 2012-12-31 2016-07-13 华为技术有限公司 A kind of method identifying similar web page and device
CN103049562A (en) * 2012-12-31 2013-04-17 华为技术有限公司 Method and device for recognizing similar webpages
CN104050198B (en) * 2013-03-15 2018-08-24 阿里巴巴集团控股有限公司 A kind of recognition methods of webpage information and device
CN104050198A (en) * 2013-03-15 2014-09-17 阿里巴巴集团控股有限公司 Method and device for identifying webpage information
CN103577526A (en) * 2013-08-01 2014-02-12 星云融创(北京)信息技术有限公司 Method and system as well as browser for verifying page modification
CN103577526B (en) * 2013-08-01 2017-06-06 星云融创(北京)科技有限公司 It is a kind of to verify method, system and browser that whether the page is changed
CN103544213A (en) * 2013-09-16 2014-01-29 青岛英网资讯股份有限公司 Network content upgrading detection assessment method and system
CN103544213B (en) * 2013-09-16 2016-10-12 青岛英网资讯股份有限公司 Web site contents updates method of determination and evaluation and system
CN104852883A (en) * 2014-02-14 2015-08-19 腾讯科技(深圳)有限公司 Method and system for protecting safety of account information
US10484424B2 (en) 2014-02-14 2019-11-19 Tencent Technology (Shenzhen) Company Limited Method and system for security protection of account information
CN104008131A (en) * 2014-04-30 2014-08-27 广州市动景计算机科技有限公司 Processing method and device for web page data
CN106095674A (en) * 2016-06-07 2016-11-09 百度在线网络技术(北京)有限公司 A kind of website automation test method and device
CN106095674B (en) * 2016-06-07 2019-05-24 百度在线网络技术(北京)有限公司 A kind of website automation test method and device
CN107566354A (en) * 2017-08-22 2018-01-09 北京小米移动软件有限公司 Web page contents detection method, device and storage medium
CN108306878A (en) * 2018-01-30 2018-07-20 平安科技(深圳)有限公司 Detection method for phishing site, device, computer equipment and storage medium
WO2019148712A1 (en) * 2018-01-30 2019-08-08 平安科技(深圳)有限公司 Phishing website detection method, device, computer equipment and storage medium
CN108650250A (en) * 2018-04-27 2018-10-12 北京奇安信科技有限公司 Illegal page detection method, system, computer system and readable storage medium storing program for executing
CN109542776A (en) * 2018-11-07 2019-03-29 北京潘达互娱科技有限公司 Page comparison method, device and equipment
CN110049052A (en) * 2019-04-23 2019-07-23 哈尔滨工业大学(威海) The malice domain name detection method of label and attribute similarity based on dom tree
CN110781497A (en) * 2019-10-21 2020-02-11 新华三信息安全技术有限公司 Method for detecting web page link and storage medium
CN113312029A (en) * 2021-06-11 2021-08-27 四川大学 Interface recommendation method and device, electronic equipment and medium
CN113312029B (en) * 2021-06-11 2023-09-08 四川大学 Interface recommendation method and device, electronic equipment and medium

Similar Documents

Publication Publication Date Title
CN102316081A (en) Method and device for identifying similar webpage
US8935783B2 (en) Document classification using multiscale text fingerprints
CN105718577B (en) Method and system for automatically detecting phishing aiming at newly added domain name
CN102629261B (en) Method for finding landing page from phishing page
CN105224600B (en) A kind of detection method and device of Sample Similarity
CN103136228A (en) Image search method and image search device
US10484426B2 (en) Auto-generated synthetic identities for simulating population dynamics to detect fraudulent activity
CN102082792A (en) Phishing webpage detection method and device
Zhou et al. Visual similarity based anti-phishing with the combination of local and global features
CN107943873B (en) Knowledge graph establishing method, knowledge graph establishing device, computer equipment and storage medium
KR102110642B1 (en) Password protection question setting method and device
CN102999638A (en) Phishing website detection method excavated based on network group
CN108768982B (en) Phishing website detection method and device, computing equipment and computer storage medium
CN105138921A (en) Phishing site target domain name identification method based on page feature matching
CN111538816B (en) Question-answering method, device, electronic equipment and medium based on AI identification
CN110474889A (en) One kind being based on the recognition methods of web graph target fishing website and device
CN102682011B (en) Method, device and system for establishing domain description name information sheet and searching
CN107688563B (en) Synonym recognition method and recognition device
CN107786529B (en) Website detection method, device and system
CN112751804A (en) Method, device and equipment for identifying counterfeit domain name
CN106202349A (en) Web page classifying dictionary creation method and device
JP5050724B2 (en) Document monitoring program, document monitoring apparatus, and document monitoring method
Raj et al. Picture captchas with sequencing: Their types and analysis
CN103294686A (en) Method and system for identifying webpage spam user and spam webpage
US11632395B2 (en) Method for detecting webpage spoofing attacks

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
C02 Deemed withdrawal of patent application after publication (patent law 2001)
WD01 Invention patent application deemed withdrawn after publication

Application publication date: 20120111