CN110134901A

CN110134901A - A kind of multilink webpage tamper determination method based on flow analysis

Info

Publication number: CN110134901A
Application number: CN201910364169.8A
Authority: CN
Inventors: 杨武
Original assignee: Harbin Talent Information Technology Co Ltd
Current assignee: Harbin Talent Information Technology Co Ltd
Priority date: 2019-04-30
Filing date: 2019-04-30
Publication date: 2019-08-16
Anticipated expiration: 2039-04-30
Also published as: CN110134901B

Abstract

The invention discloses a kind of multilink webpage tamper determination method based on flow analysis, described method includes following steps: Step 1: configuration website rule；Step 2: capturing webpage in multiple link nodes, and compare history web pages and current web page using similarity alignment algorithm, obtaining the conclusion whether webpage is tampered；Step 3: the conclusion of multiple link nodes is summarized, and comprehensive analysis, show that webpage is to be distorted by flow or source is distorted.The present invention fully considers the characteristic of structure of web page, proposes layer weight concept；In conjunction with structure of web page and network upgrade and feature is distorted, proposes element classification concept, element influences factor concept and important attribute concept；In conjunction with web page contents and network upgrade and feature is distorted, proposes the abnormal judgment criteria concept of properties collection.The rate of false alarm of the method for the present invention is much smaller than the webpage tamper decision technology of other modes, and is able to detect whether webpage is kidnapped by flow.

Description

A kind of multilink webpage tamper determination method based on flow analysis

Technical field

The present invention relates to a kind of webpage tamper decision technologies, and in particular to a kind of multilink webpage based on flow analysis is usurped Change determination method.

Background technique

According to the position of deployment, existing webpage tamper decision technology can be divided into two major classes: webpage tamper is locally sentenced Determine technology and the long-range decision technology of webpage tamper.Current webpage tamper-resistant software distorts decision technology using local mostly, existing Some, which is distorted, to be determined in software, the WebGuard of Tianjin StarNet, the InforGuard of middleware company, CVIC SE, Shanghai Its iGuard deposited, the barracuda of Barracuda company, E-lock company Tripwire webalarm take local distort Decision technology, and it is less using the software for remotely distorting decision technology and maturation at present.

In the webpage tamper decision technology based on webpage similarity, the similarity comparison method based on editing distance for For static Web page, can simply and quickly obtain the similarity of webpage, and according to similarity judge webpage whether by It distorts；But for dynamic web page, this method is also it can be concluded that similarity, but the value of similarity does not have reference price but Value, can not accurately be inferred to normal network upgrade or webpage is tampered.Similarity comparison method based on structure of web page Accurate judgement can be made to the change of structure of web page, similarity algorithm is mainly for structure of web page, it may be assumed that from history web pages Similarity judge whether identical namely webpage is tampered two webpages in structure.It is distorted if applying the method in On decision technology, then this method that seems is considered not exclusively to tampering, if rogue program is only to the text shown on webpage It is distorted, then this method will cannot get correct conclusion.Similarity comparison method based on semantic analysis is suitable for theme Clearly demarcated webpage, if this method is applied on webpage tamper decision technology, then for example for not clearly demarcated enough the website of theme News or notification type, which will cannot get correct conclusion.Even the clearly demarcated webpage of theme, if interior after distorting Hold it is consistent with the content topic before distorting, then the tampering will not be found.

Summary of the invention

For the above problem existing for existing webpage tamper determination method, and a kind of energy low the present invention provides rate of false alarm Enough whether detection webpage is by the multilink webpage tamper determination method based on flow analysis of flow abduction.

The purpose of the present invention is what is be achieved through the following technical solutions:

A kind of multilink webpage tamper determination method based on flow analysis, includes the following steps:

Step 1: configuration website rule；

Step 2: capture webpage in multiple link nodes, and using similarity alignment algorithm by history web pages with work as Preceding webpage compares, and obtains the conclusion whether webpage is tampered；

Step 3: the conclusion of multiple link nodes is summarized, and comprehensive analysis, show that webpage is to be distorted by flow also It is that source is distorted.

Compared with the prior art, the present invention has the advantage that

1, the characteristic of structure of web page is fully considered, structure of web page is an inverted tree construction, closer to the member of root node Element, the influence to entire structure of web page is bigger, if father node changes, son's node has a possibility that very big that can change, can also be with Saying is that son's node can change with father node, so different weights are assigned to different layers in webpage tree construction, from And propose layer weight concept, i.e. influence degree of this layer to entire structure of web page.

2, in conjunction with structure of web page and network upgrade and feature is distorted, proposes element classification concept, is i.e. certain several element belong to Same class expresses the same meaning；It is proposed element influences factor concept, i.e. influence degree of the change of some element to structure； It is proposed important attribute concept, i.e. certain attributes of element are the attributes being often tampered, significant to the judgement distorted.

3, in conjunction with web page contents and network upgrade and feature is distorted, proposes the abnormal judgment criteria concept of properties collection, i.e., Changing for properties collection, the change which belongs to when website normally updates illustrated, which belongs to change when webpage tamper, and And rate of false alarm is much smaller than the webpage tamper decision technology of other modes.

Detailed description of the invention

Fig. 1 is webpage tamper decision model.

Fig. 2 is the flow chart of webpage similarity alignment algorithm.

Fig. 3 is the classification chart of text collection update status.

Fig. 4 is webpage tamper decision model --- the network topology compared based on multiple spot similarity.

Specific embodiment

Further description of the technical solution of the present invention with reference to the accompanying drawing, and however, it is not limited to this, all to this Inventive technique scheme is modified or replaced equivalently, and without departing from the spirit and scope of the technical solution of the present invention, should all be covered Within the protection scope of the present invention.

The present invention provides a kind of multilink webpage tamper determination method based on flow analysis, as shown in Figure 1, the side Method includes the following steps:

Step 1: addition targeted website and configuration website rule.

Webpage is carried out to the division of dynamic area and fixed area, during network upgrade, the content of dynamic area is In variation, and the content of fixed area is basically unchanged.The purpose for configuring website rule is mainly to determine the region of webpage, website There are two types of regular configuration modes:

1) artificial designated mode: which is partially fixed area in artificial specified target webpage, which is partially dynamic Region.The content needed to configure is: which properties collection is fixed area have in webpage dom tree；Dynamic area is in webpage DOM There is which properties collection in tree.

2) without artificial designated mode: by grab the website preceding M webpage (webpage grabbed each time with it is previous Webpage is different), the difference of comparison front and back webpage twice respectively, the dynamic area for obtaining webpage and fixed area are (if the value of M Too small, the judgement of fixed area may be wrong, to influence final conclusion), the content needed to configure is: the value of M.

Step 2: the website to configuration crawls, and it is compared using webpage similarity alignment algorithm.

After grabbing a webpage, the related data of webpage is first extracted, then by searching for the URL of the webpage, judgement is The related data of current web page and historical information are carried out similarity comparison if having, obtained by the no historical information for having the webpage The value of similarity, and be compared with the similarity a reference value of the webpage, it was therefore concluded that.Conclusion is summarized, so as to next Step carries out comprehensive Analysis of conclusion.If obtaining update conclusion, historical information is replaced with to current related data.The correlation of webpage Data include: virtual DOM, the location information of fixed area, the location information of dynamic area and similarity a reference value.Such as figure Shown in 2, specific step is as follows for webpage similarity alignment algorithm:

The first step, initialization: two sufficiently large queue q of initialization₁, q₂, for traversing the node in virtual DOM； A map<string is initialized, int>map_tag_affectoi is used to store impact factor for 2 element, i.e. in β conjunction Element；A map<string is initialized, int>map_tag_classify is for storage element classification；Initialization one is enough Big integer array array1, each element store one layer of variation ratio, i.e., the part that internal layer is summed in formula (4)；Just Beginningization one sufficiently large two-dimentional integer array arrayb, stores corresponding text collection and belongs to fixed area or dynamic area Domain；Two vector<text>v1, v2 are initialized, for storing text collection；Initialize an integer array array2, array In each element store one set variation ratio, i.e., in formula internal layer sum part；Initialize double-precision floating points Nu records the cumulative and i.e. α in formula of the element changed in one layer_i,j*X_i,jIt is cumulative and；Initialize double-precision floating points De, for recording the cumulative and i.e. α in formula of all elements in one layer_i,jIt is cumulative and, execute second step.

Second step presses layer two virtual DOM of traversal simultaneously, and level number attribute and father are added into node in ergodic process Node serial number attribute, two root nodes join the team and (during joining the team, only focus on element, text, src attribute and href attribute), hold Row third step.

Third step goes out team: if q₁, q₂In one be sky, another be not it is empty, then execute the 9th step；If two queues are all Sky then executes the tenth step；Otherwise, q₁And q₂Team out, team's node is N1, N2 out, and son's node of N1 and N2 is entered in order respectively Team executes the 4th step.

4th step, comparison two node N1, N2: history father node serial number and current father node serial number are compared (if Without history father node number, then without comparison), if it is different, then executing the 5th step；Otherwise, the 6th step is executed.

5th step, comparison two text collection v1, v2: by set variation proportional recording to array array2, the 6th is executed Step.The text collection alignment algorithm used in this step is as shown in table 1:

Specific step is as follows for text collection alignment algorithm:

Step 1: two text collection entry sum textN1, textN2 are taken to initialize if textN1==textN2 Update number of entries textU is textN1, and executes step 2；Otherwise step 10 is executed.

Step 2: the front pointer (forward iteration device can be used) of v1 and v2 and rear pointer is taken (to can be used reversed Iterator), oldhead, newhead, oldrear, newrear execute step 3.

Step 3: if only one entry in v1, and the length of the entry is not less than static_len (static Web page text Length), then follow the steps nine；If oldhead < oldrear, and newhead < newrear, then follow the steps four；Otherwise, it executes Step 11.

Step 4: comparison oldrear and newrear is marked two entries if signified entry is identical, Oldrear and newrear subtracts one simultaneously, and textU subtracts one, executes step 3；Otherwise, step 5 is executed.

Step 5: if oldhead < oldrear, and newhead < newrear, then follow the steps six；If oldhead== Oldrear, and newhead==newrear, then follow the steps 11.

Step 6: comparison oldhead and newhead is marked two entries if signified entry is identical, Oldhead and newhead increases one, textU simultaneously and subtracts one, executes step 5；Otherwise, step 7 is executed.

Step 7: if all entries are labeled in v1 or comparison terminates one by one, step 11 is executed；Otherwise, by v1 not Labeled entry is not compared one by one in labeled entry and v2, if they are the same, then two entries is marked, TextU subtracts one, executes step 7；If it is different, thening follow the steps eight；

Step 8: comparing the two entries in a manner of simple editing distance, if meet ldmatch (text1, Text2)=1, then two entries are marked, execute step 7.

Step 9: editing distance algorithm is used, to calculate the similarity of entire webpage, entire alignment algorithm terminates.

Step 10: determine that the webpage is tampered, text collection alignment algorithm terminates.

Step 11: determine that the webpage is not tampered with, text collection alignment algorithm terminates.

6th step, recording layer change ratio: by the text text1 in node, text2 is stored in v1, v2 respectively, by history layer Number and current level number compare (if without history level number, without comparison), if it is different, then this layer of structure is become according to nu and de Change proportional recording into array array1, executes the 7th step.

7th step, comparison important attribute propl1, propl2: if propl1 and propl2 exist, and being all src or same For href, and attribute value is identical or propl1 and propl2 is not present, then executes the 8th step；Otherwise the 9th step is executed.

Two 8th step, comparison element tag₁、tag₂If: tag₁And tag₂It is not sky, and is not belonging to identity element classification (according to map_tag_classify) then executes the 9th step；Otherwise, the cumulative (according to map_tag_ of nu and de is carried out Affectoi), and third step is executed.

9th step determines that the webpage is to be tampered, and algorithm terminates.

Tenth step determines that the webpage is to be not tampered with, and algorithm terminates.

Use following parameter and related definition in algorithm: node, element classification, element variation degree, the element influences factor, Layer weight, text collection, gathers variation degree, important attribute, content similarity, similarity a reference value at structural similarity herein, Wherein:

Node: including element tag, text text and attribute prop in given one tree T, tree a height of H, node N, wherein One node N, which may include at most element a tag, a node N, may include at most text a text, a node N It may include any attribute prop, but node N can not be for sky.Node N can be embodied as in tree construction: N_i,j, Wherein i is level number, and the minimum value of i is 1, maximum value H；J is ordinal position of the node in i-th layer, and the minimum value of j is 1, most Big value is numN_i, numN_iIndicate the n-th umN_iThe number of node in layer.

Element classification: element common in webpage is subjected to artificial classification, classifying rules are as follows: if the meaning phase of element representation Closely, then these elements are classified as one kind.

Element variation degree: two element tag are given₁、tag₂, variation degree X are as follows: if two elements are identical, then changing Degree is 0；If two elements are different and belong to same classification, then variation degree is 0.5；If tag₁Or tag₂There is a presence Another is not present, then variation degree is 1；If two elements are different and are not belonging to same classification, then being directly judged to usurping Change.Variation degree X may be expressed as:

The element influences factor: giving an element tag, if tag belongs to set β, the impact factor α of that identical element element is 2；It is no It is then 1.Element influences factor-alpha is expressed as follows:

Wherein, i is level number, and the minimum value of i is 1, maximum value H；J is ordinal position of the node in i-th layer, and j is most Small value is 1, maximum value numN_i, β set is comprising { div, table, form, tr, td } and belonging to same point with these elements The element of class.

Layer weight: given one tree T sets a height of H, and layer weight W can be indicated are as follows:

Wherein, i is level number, and the minimum value of i is 1, maximum value H.

Structural similarity: two tree T are given₁, T₂, two similarity Sss of the tree in structure can indicate are as follows:

Text collection: providing one section of text text, and leaf node a node, node include text, if node have it is other Sibling, and have text text in other siblings₁, text₂..., text_n, then text, text₁, text₂..., text_nFor a text collection textS；Otherwise text text is individually for a text collection textS.

Text collection variation degree: two text collection text are given₁, text₂, two set number of entries be textN₁, textN₂, two set, which compare, show that the quantity for updating entry is textU, and the editing distance of certain two entry is ld (text₁,text₂), text size is len (text).When text collection is fixed area, if textN₁=textN₂, and TextU=0, then text collection impact factor β=0；Otherwise directly determine that the webpage is tampered.When text collection is dynamic area Domain, if textN₁=textN₂, textU=0, then text collection impact factor β=0；If textN₁≠textN₂, then directly It connects and is determined as that the webpage is tampered；If textN₁=textN₂, textU ≠ 0, and ldmatch (text is not present₁,text₂)=1 (a part of some entry is tampered in set, it may be assumed that textS₁In there are text₁, textS₂In there are text₂, ld (text₁, text₂) > 1/3max (len (text₁),len(text₂)), ld (text₁,text₂) < 1/2max (len (text₁),len (text₂)), which is expressed as ldmatch (text₁,text₂)=1), then text collection impact factor β=0.5；If textN₁=textN₂, textU ≠ 0, ldmatch (text₁,text₂)=1, then text collection impact factor β=1.Text Set variation degree β can be indicated are as follows:

Fixed area:

Dynamic area:

Wherein, i is level number, and the minimum value of i is 1, maximum value H；K is ordinal position of the text collection in i-th layer, j Minimum value be 1, maximum value numT_i, numT_iIndicate the number of text collection in i-th layer.

Important attribute: the certain attributes played an important role in webpage tamper judgement for including in element are referred to as to attach most importance to Want attribute.In the case where text or constant picture, if important attribute changes, directly determine that the webpage is tampered.Important category Property set Y={ src, href }.

Content similarity: two tree T are provided₁, T₂, two tree similarity Sc in terms of content can indicate are as follows:

Wherein, i is level number, and w is the quantity of text collection in i-th layer, textS_i,wU indicates the item changed in text collection Mesh number amount, textS_i,wN indicates entry total quantity in text collection, numS_iIndicate the quantity gathered herein in i-th layer.

Similarity a reference value: structural similarity a reference value and content similarity a reference value formula are as follows:

Step 3: the conclusion obtained to multiple nodes is analyzed.

The reason of being deployed in multinode is: after similarity alignment algorithm is drawn a conclusion, if only one section of whole system Point, then we can not judge that the webpage is distorted or be tampered in intermediate line link by source.If there are multiple nodes, can lead to It crosses and compares the conclusions of multiple nodes and do further judgement, obtain the conclusion whether webpage is distorted in intermediate line link by flow.It is false Equipped with n node, then node can be expressed as k₁, k₂, k₃..., k_n, target website server is expressed as s, specifically compares Scheme are as follows:

If 1) k₁, k₂, k₃..., k_nConclusion is all non-and distorts, then final conclusion is: the feelings that no webpage flow is distorted Condition；

2) if certain several node conclusion is to distort (such as k₁, k₂Conclusion is to distort), other node conclusions are distorted to be non-, then Final conclusion is: from s to distorting node (k₁And k₂) chain road all there is the case where webpage flow is distorted；As shown in figure 4, if K2, k3 node conclusion be distort, then k2, k3 to s chain on the road be held as a hostage.

If 3) k₁, k₂, k₃..., k_nConclusion is to distort, then it is possible that webpage is distorted by source, it is also possible to be in s to k₁, k₂, k₃..., k_nChain road all there is the case where webpage flow is distorted.

Claims

1. a kind of multilink webpage tamper determination method based on flow analysis, it is characterised in that the method includes walking as follows It is rapid:

Step 1: configuration website rule；

Step 2: capturing webpage in multiple link nodes, and use similarity alignment algorithm by history web pages and current net Page compares, and obtains the conclusion whether webpage is tampered；

Step 3: the conclusion of multiple link nodes is summarized, and comprehensive analysis, show that webpage is to be distorted by flow or source It distorts.

2. the multilink webpage tamper determination method according to claim 1 based on flow analysis, it is characterised in that described In step 1, there are two types of the configuration modes of website rule:

1) artificial designated mode: which is partially fixed area in artificial specified target webpage, which is partially dynamic area, The content needed to configure is: which properties collection is fixed area have in webpage dom tree, and dynamic area is in webpage dom tree There is which properties collection；

2) without artificial designated mode: the preceding M webpage by grabbing the website, the difference of comparison front and back webpage twice, obtains respectively The dynamic area of webpage and fixed area out, the content needed to configure is: the value of M.

3. the multilink webpage tamper determination method according to claim 1 based on flow analysis, it is characterised in that described Specific step is as follows for step 2:

After grabbing a webpage, the related data of webpage is first extracted, then by searching for the URL of the webpage, is judged whether there is The related data of current web page and historical information are carried out similarity comparison if having, obtained similar by the historical information of the webpage The value of degree, and be compared with the similarity a reference value of the webpage, it was therefore concluded that；If update conclusion is obtained, by historical information Replace with current related data.

4. the multilink webpage tamper determination method according to claim 3 based on flow analysis, it is characterised in that described The related data of webpage includes: virtual DOM, the location information of fixed area, the location information of dynamic area and similarity A reference value.

5. the multilink webpage tamper determination method according to claim 1 based on flow analysis, it is characterised in that described Specific step is as follows for similarity alignment algorithm:

The first step, initialization: two sufficiently large queue q of initialization₁, q₂, for traversing the node in virtual DOM；Initially Change a map<string, int>map_tag_affectoi is used to store the element that impact factor is 2；Initialize a map < String, int > map_tag_classify are for storage element classification；A sufficiently large integer array array1 is initialized, Each element stores one layer of variation ratio；A sufficiently large two-dimentional integer array arrayb is initialized, is stored corresponding Text collection belongs to fixed area or dynamic area；Two vector<text>v1, v2 are initialized, for storing text set It closes；An integer array array2 is initialized, each element stores the variation ratio of a set in array；The double essences of initialization Spend floating number nu, the element changed in one layer of record cumulative and；Double-precision floating points de is initialized, for recording in one layer The cumulative and execution second step of all elements；

Second step presses layer two virtual DOM of traversal simultaneously, and level number attribute and father node are added into node in ergodic process Serial number attribute, two root nodes are joined the team, and third step is executed；

Third step goes out team: if q₁, q₂In one be sky, another be not it is empty, then execute the 9th step；If two queues are all sky, Execute the tenth step；Otherwise, q₁And q₂Team out, team's node is N1, N2 out, and son's node of N1 and N2 is joined the team in order respectively, is held The 4th step of row；

4th step, comparison two node N1, N2: history father node serial number and current father node serial number are compared, if it is different, Then execute the 5th step；Otherwise, the 6th step is executed；

5th step, comparison two text collection v1, v2: by set variation proportional recording to array array2, the 6th step is executed；

6th step, recording layer change ratio: by the text text1 in node, text2 is stored in v1, v2 respectively, by history level number and Current level number compares, if it is different, then according to nu and de by this layer of structure change proportional recording into array array1, execute 7th step；

7th step, comparison important attribute propl1, propl2: if propl1 and propl2 exist, and it is all src or is all Href, and attribute value is identical or propl1 and propl2 is not present, then executes the 8th step；Otherwise the 9th step is executed；

Two 8th step, comparison element tag₁、tag₂If: tag₁And tag₂It is not sky, and is not belonging to identity element classification, then holds The 9th step of row；Otherwise, the cumulative of nu and de is carried out, and executes third step；

9th step determines that the webpage is to be tampered, and algorithm terminates；

6. the multilink webpage tamper determination method according to claim 1 based on flow analysis, it is characterised in that described In step 3, judge that webpage is as follows by flow is distorted or source is distorted method:

Assuming that having n node, then node is expressed as k₁, k₂, k₃..., k_n, target website server is expressed as s, specifically compares Scheme are as follows:

If 1) k₁, k₂, k₃..., k_nConclusion is all non-and distorts, then final conclusion is: the case where no webpage flow is distorted；

If 2) certain several node conclusion be distort, other node conclusions be it is non-distort, then final conclusion is: from s to distort section All there is the case where webpage flow is distorted in the chain road of point；