CN106599242A - Webpage change monitoring method and system based on similarity calculation - Google Patents

Webpage change monitoring method and system based on similarity calculation Download PDF

Info

Publication number
CN106599242A
CN106599242A CN201611182671.XA CN201611182671A CN106599242A CN 106599242 A CN106599242 A CN 106599242A CN 201611182671 A CN201611182671 A CN 201611182671A CN 106599242 A CN106599242 A CN 106599242A
Authority
CN
China
Prior art keywords
web page
page contents
webpage
module
judge
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN201611182671.XA
Other languages
Chinese (zh)
Other versions
CN106599242B (en
Inventor
刘坤朋
郑杭
练军
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
FUJIAN LIUREN NETWORK SECURITY Co Ltd
Original Assignee
FUJIAN LIUREN NETWORK SECURITY Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by FUJIAN LIUREN NETWORK SECURITY Co Ltd filed Critical FUJIAN LIUREN NETWORK SECURITY Co Ltd
Priority to CN201611182671.XA priority Critical patent/CN106599242B/en
Publication of CN106599242A publication Critical patent/CN106599242A/en
Application granted granted Critical
Publication of CN106599242B publication Critical patent/CN106599242B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/95Retrieval from the web
    • G06F16/951Indexing; Web crawling techniques
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F21/00Security arrangements for protecting computers, components thereof, programs or data against unauthorised activity
    • G06F21/50Monitoring users, programs or devices to maintain the integrity of platforms, e.g. of processors, firmware or operating systems
    • G06F21/55Detecting local intrusion or implementing counter-measures
    • G06F21/554Detecting local intrusion or implementing counter-measures involving event detection and direct action
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F21/00Security arrangements for protecting computers, components thereof, programs or data against unauthorised activity
    • G06F21/50Monitoring users, programs or devices to maintain the integrity of platforms, e.g. of processors, firmware or operating systems
    • G06F21/55Detecting local intrusion or implementing counter-measures
    • G06F21/56Computer malware detection or handling, e.g. anti-virus arrangements

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Computer Security & Cryptography (AREA)
  • Software Systems (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Physics & Mathematics (AREA)
  • Computer Hardware Design (AREA)
  • Databases & Information Systems (AREA)
  • Virology (AREA)
  • General Health & Medical Sciences (AREA)
  • Health & Medical Sciences (AREA)
  • Data Mining & Analysis (AREA)
  • Information Transfer Between Computers (AREA)

Abstract

The invention discloses a webpage change monitoring method and system based on similarity calculation, and the method comprises the steps: storing webpage contents locally through employing the technology of web spider, obtaining the webpage contents again in a set time period, and carrying out the similarity comparison of the obtained webpage contents with the local webpage contents through a fuzzy hash algorithm. The method can customize the attributes of webpage contents, and the webpage contents cannot be changed. The monitoring steps are simpler, and the monitoring efficiency is high. For the webpage contents which can be changed, the method further carries out the difference analysis, recognizes the tampering of characters or images, can accurately recognize whether the webpage contents are tampered or updated normally in the first time, and improves the safety of webpage contents.

Description

A kind of webpage change monitoring method and system based on Similarity Measure
Technical field
The present invention relates to a kind of info web monitoring technology, relates in particular to a kind of webpage based on Similarity Measure and becomes More monitoring method and system.
Background technology
A key content for ensureing user's normal browsing webpage is the webpage (page) for preventing website side from issuing by hacker Distort.It is so-called to distort, legal web page contents modification (refreshing) are different from, the change for referring to web page contents does not meet portal management Member or the expection of user institute requested webpage.Webpage all faces with internet information explosive growth, in every day the Internet Face the risk being tampered.Such as can not in time find that webpage is tampered and will bring immeasurable loss to website and user.
Webpage is mainly had by the mode that hacker distorts:Hacker may break through website, and directly the web page contents of the issue are entered Row modification.Detect that the scheme that webpage is tampered is in prior art::Periodicity monitoring is carried out to website using scanning device, specifically For:Surface sweeping device software is installed, URL (Uniform Resoure Locator, the unification for accessing monitored webpage is periodically obtained Resource localizer), according to certain algorithm, the benchmark page is set, and the page of monitored webpage is compared with the benchmark page, obtain Go out the ratio that the page elements changed in monitored webpage account for all page elements of the webpage, and according to the ratio with set in advance The proportion threshold value put judges whether the page is changed, and the ratio thinks that monitored website is not tampered with less than proportion threshold value, otherwise Think that monitored webpage is tampered.Or, some sensitive words are pre-set, judge that monitored webpage includes such sensitive word When, then it is assumed that the page is distorted by hacker.Because existing website dynamic web page technique is a lot, therefore existing technical scheme is difficult Accurately identify webpage to be tampered or normal content refreshing, be inevitably present flase drop and missing inspection.
The content of the invention
For this purpose, the technical problem to be solved is that real-time monitoring webpage cannot accurately identify net in prior art Page is tampered or normal update content.
To solve above-mentioned technical problem, the technical solution adopted in the present invention:
A kind of webpage change monitoring method based on Similarity Measure, comprises the steps of:
S1:Web page contents in network are stored the mould that web page contents are calculated to local memory device by using web crawlers Paste cryptographic Hash;
S2:Judge that the web page contents belong to the first type of webpage and still fall within the second type of webpage, and make respective markers, the One type of webpage is the webpage that web page contents will not change, and the second type of webpage is the net that web page contents can change Page;
S3:Crawl the web page contents from network again after the time interval of setting, and calculate the mould of web page contents this moment Paste cryptographic Hash;
S4:The similarity of the fuzzy hash value obtained in the fuzzy hash value obtained in calculation procedure S3 and step S1, similarity Span be 0-100;
S5:Judge the affiliated type of webpage of the web page contents, if the web page contents belong to the first web page contents, carry out step S6;If the web page contents belong to the second web page contents, step S7 is carried out;
S6:Whether the value for judging similarity is 100, is, then carry out step S61;It is no, then carry out step S62;
S61:Terminate the monitoring of the web page contents;
S62:Give a warning, terminate the monitoring of the web page contents;
S7:Whether the value for judging similarity is 100, is then to terminate the monitoring of the web page contents;It is no, then carry out step S71;
S71:The difference that the web page contents compare original state is found out using DIFF instruments;
S72:Judge that difference, whether because picture change causes, is then to carry out step S8;It is no, then carry out step S9;
S8:Image content is matched with hostile content feature, whether is had anomalous content in detection picture;It is then to be walked Rapid S81;It is no, then carry out step S82;
S81:Give a warning, terminate the monitoring of the web page contents;
S82:Terminate the monitoring of the web page contents;
S9:Matched with sensitive dictionary, if matching sensitive word, given a warning.
In step S9, also comprising being matched with Trojan characteristics storehouse, if matching Trojan characteristics, give a warning.
Picture recognizer is called to be identified image content in step S8, image content and hostile content is special Levy and matched, whether have anomalous content in detection picture;It is then to carry out step S81;Otherwise carry out step S82.
A kind of webpage change monitoring system based on Similarity Measure, comprising with lower module:
Initial acquisition module:Web page contents in network are stored to local memory device by using web crawlers, net is calculated The fuzzy hash value of page content;
Judge module:Judge that the web page contents belong to the first type of webpage and still fall within the second type of webpage, and make corresponding Labelling, the first type of webpage is the webpage that web page contents will not change, and the second type of webpage can become for web page contents The webpage of change;
Real-time Collection module:Crawl the web page contents from network again after the time interval of setting, and calculate net this moment The fuzzy hash value of page content;
Computing module:Calculate the fuzzy Hash of the fuzzy hash value and acquisition in initial acquisition module obtained in Real-time Collection module The similarity of value, the span of similarity is 0-100;
Webpage judge module:Judge the affiliated type of webpage of the web page contents, if the web page contents belong to the first web page contents, Then proceed to the first judge module;If the web page contents belong to the second web page contents, the second judge module is proceeded to;
First judge module:Whether the value for judging similarity is 100, is then to terminate the monitoring of the web page contents;It is no, then Proceed to the first alert module;
First alert module:Give a warning, terminate the monitoring of the web page contents;
Second judge module:Whether the value for judging similarity is 100, is then to proceed to the first termination module;It is no, then proceed to difference Different analysis module;
First terminates module:Terminate the monitoring of the web page contents;
Variation analyses module:The difference that the web page contents compare original state is found out using DIFF instruments;
3rd judge module:Judge that difference, whether because picture change causes, is then to proceed to the first matching module;It is no, then access Second matching module;
First matching module:Image content is matched with hostile content feature, whether is had anomalous content in detection picture; It is then to proceed to the second alert module;It is no, then proceed to the second termination module;
Second alert module:Give a warning, terminate the monitoring of the web page contents;
Second terminates module:Terminate the monitoring of the web page contents;
Second matching module:Matched with sensitive dictionary, if matching sensitive word, given a warning.
Second matching module also comprising being matched with Trojan characteristics storehouse, if matching Trojan characteristics, sends police Accuse.
Call picture recognizer to be identified image content in 3rd judge module, judge difference whether due to picture Change causes, and is, then proceed to the first matching module;It is no, then the matching module of access second.
The above-mentioned technical proposal of the present invention has compared to existing technology advantages below.
A kind of webpage change monitoring method and system based on Similarity Measure of the present invention, will using web crawlers technology Web page contents are saved in locally, and in the time interval of setting web page contents are obtained again, using fuzzy hash algorithm and local guarantor The content of pages similarity deposited is compared.Can be with self-defined web page contents attribute, the web page contents that content will not change, monitoring Step is more succinct, and monitoring efficiency is high.For the changeable web page contents of content, variation analyses are further carried out, recognize character Or picture is distorted, web page contents can be accurately identified with the very first time and be tampered or normally update, be improved in webpage The safety of appearance.
Description of the drawings
In order that present disclosure is more likely to be clearly understood, the specific embodiment below according to the present invention is simultaneously combined Accompanying drawing, the present invention is further detailed explanation, wherein,
Fig. 1 is a kind of flow chart of the webpage change monitoring method based on Similarity Measure of the present invention;
Fig. 2 is a kind of structured flowchart of the webpage change monitoring system based on Similarity Measure of the present invention.
Reference is expressed as in figure:1- initial acquisition modules;2- judge modules;3- Real-time Collection modules;4- calculates mould Block;5- webpage judge modules;The judge modules of 6- first;The alert modules of 61- first;The judge modules of 7- second;71- first terminates mould Block;72- variation analyses modules;The judge modules of 8- the 3rd;The matching modules of 81- first;The matching modules of 82- second;811- second is warned Accuse module;812- second terminates module.
Specific embodiment
A kind of webpage change monitoring method based on Similarity Measure, as shown in figure 1, comprising the steps of:
S1:Web page contents in network are stored the mould that web page contents are calculated to local memory device by using web crawlers Paste cryptographic Hash.Fuzzy hash value mainly uses fuzzy hash algorithm, can call ssdeep instruments.Fuzzy hash algorithm is called Burst hash algorithm based on content segmentation(context triggered piecewise hashing, CTPH), it is main to use In the similarity system design of file.2006, Jesse Kornblum proposed CTPH, and provide the algorithm of an entitled spamsum Example.Subsequently, Jason Sherman develop ssdeep instruments(http://ssdeep.sourceforge.net/).The calculation Method can be used in the present invention Malicious Code Detection, it is also possible to for bug excavation etc..The cardinal principle of fuzzy Hash is to make With a weak Hash calculation file local content, under given conditions burst is carried out to file, then using a strong Hash pair File calculates cryptographic Hash per piece, takes a part for these values and couples together, and a fuzzy Hash is constituted together with fragmented condition As a result.Using string-similarity contrast algorithm judge two fuzzy hash values similarity how many, so as to judge two The similarity degree of individual file.The part of file is changed(Many places modification is included in, is increased, deleted partial content), using fuzzy Hash can find the similarity relation with source file, be to judge similarity a kind of preferably method at present.
S2:Judge that the web page contents belong to the first type of webpage and still fall within the second type of webpage, and make corresponding mark Note, the first type of webpage is the webpage that web page contents will not change, and the second type of webpage can change for web page contents Webpage.Can with manually classified, it is also possible to using web page contents of the prior art identification and sorting technique(Such as China Patent documentation 201210299843.7,201210376933.1 etc. is recorded)Web page contents are classified.
S3:Crawl the web page contents from network again after the time interval of setting, and calculate web page contents this moment Fuzzy hash value.
The calculating process of fuzzy hash value is as follows in step S1 and S3:
With file fragmentation of the weak hash algorithm to the web page contents.Concrete grammar is:
Read a part of content hereof, calculated with weak hash algorithm Alder-32, to roll Hash in the way of obtain The cryptographic Hash of one 4 byte.It is so-called rolling Hash refer to, such as had calculated that cryptographic Hash h1 of abcdef originally, next Calculate the cryptographic Hash of bcdefg, it is not necessary to recalculate completely, it is only necessary to h1-X (a)+Y (g).Wherein X, Y are two Individual function, that is, only need to accordingly increase and decrease impact of the residual quantity to cryptographic Hash.This Hash can greatly speed up burst judgement Speed.
Setting burst value n, by it fragmented condition is controlled.The value of n determines according to file size, file content etc..It is determined that Principle and method are as follows:
The value of n takes all the time 2 integer power, such Alder-32 cryptographic Hash divided by n remainder close to being uniformly distributed.Only when remaining Burst when number is equal to n-1, can burst in the case of being equivalent to only similar 1/n.That is, to a file, window It is often mobile once, just there is a 1/n may burst.If the piece number that certain once divides is too little, that is reduced by the value of n, makes to divide every time The probability of piece increases, and increases piece number.And if the piece felt point is too many, just increasing the value of n, the probability for making each burst subtracts It is few, reduce piece number.Every time by n divided by or be multiplied by 2, be adjusted, make final piece number as far as possible between 32 to 64.Due to The probability of burst is almost 1/n, so during each run ssdeep, the n values attempted for the first time are exactly one close to text The value of part length/64.
When Alder-32 cryptographic Hash is exactly equal to n-1 divided by the remainder of n, just in current location burst;Otherwise, regardless of Piece, window rolls a byte backward, Alder-32 cryptographic Hash is then calculated again and is judged, so continues.
Cryptographic Hash is calculated to each piece obtained in S101 with a strong hash algorithm.Fowler-Noll-Vo can be used Hash hash algorithms.
Compression cryptographic Hash.To each file fragmentation, after being calculated a cryptographic Hash, can select to compress result It is short.Specially:Minimum 6 of Hash result are taken, and is showed with an ascii character, as the final Kazakhstan of this burst The result of uncommon value.
Connection cryptographic Hash.Cryptographic Hash after compressing per piece is connected together, that is, obtains the fuzzy hash value of this document.Such as Fruit burst value n is different to different files, should also n be included in fuzzy hash value, and specific practice is that directly n is added in former Kazakhstan A part of the uncommon value finally, as cryptographic Hash.
S4:The similarity of the fuzzy hash value obtained in the fuzzy hash value obtained in calculation procedure S3 and step S1, phase It is 0-100 like the span spent.The calculating process of similarity is as follows in step S4:The fuzzy hash value of the web page contents is One character string, is set to s1, s2.Using the weighing edit distance of s1 to s2 as the foundation for evaluating its similarity;Weighting editor away from From referring to, first judge to be changed into s2 from s1, it is minimum to need how many step operations(Including insertion, delete, change), then to different operating Provide a weights.Insertion, the weights deleted, change are set to:0.2、0.3、0.5.Finally, result is added up, is obtained final product To weighing edit distance.
By this distance divided by s1 and s2 length and, absolute results are changed into into relative result, re-map 0-100's In one integer value, wherein, 100 represent that two character strings are completely the same, and 0 represents completely dissimilar;The result can be used To judge the similarity degree of two web page contents.
S5:Judge the affiliated type of webpage of the web page contents, if the web page contents belong to the first web page contents, carry out Step S6;If the web page contents belong to the second web page contents, step S7 is carried out.
S6:Whether the value for judging similarity is 100, is, then carry out step S61;It is no, then carry out step S62.
S61:Terminate the monitoring of the web page contents.
S62:Give a warning, terminate the monitoring of the web page contents.
S7:Whether the value for judging similarity is 100, is then to terminate the monitoring of the web page contents;It is no, then walked Rapid S71.
S71:The difference that the web page contents compare original state is found out using DIFF instruments.
S72:Judge that difference, whether because picture change causes, is then to carry out step S8;It is no, then carry out step S9.
S8:Image content is matched with hostile content feature, whether is had anomalous content in detection picture;It is then to enter Row step S81;It is no, then carry out step S82.
S81:Give a warning, terminate the monitoring of the web page contents.
S82:Terminate the monitoring of the web page contents.
S9:Matched with sensitive dictionary, if matching sensitive word, given a warning.If changing unit is character string, use Regular expression mode is matched with default sensitive dictionary, such as matches sensitive word, then alerted.
In step S9, also comprising being matched with Trojan characteristics storehouse, if matching Trojan characteristics, give a warning.It is also Matched with default Trojan characteristics storehouse using regular expression mode.
Picture recognizer is called to be identified image content in step S8, image content and hostile content is special Levy and matched, whether have anomalous content in detection picture;It is then to carry out step S81;Otherwise carry out step S82.
A kind of webpage change monitoring system based on Similarity Measure, comprising with lower module:
Initial acquisition module 1:Web page contents in network are stored to local memory device by using web crawlers, is calculated The fuzzy hash value of web page contents.Fuzzy hash value mainly uses fuzzy hash algorithm, can call ssdeep instruments.
Judge module 2:Judge that the web page contents belong to the first type of webpage and still fall within the second type of webpage, and make Respective markers, the first type of webpage is the webpage that web page contents will not change, and the second type of webpage can be sent out for web page contents The webpage of changing.Can with manually classified, it is also possible to using web page contents of the prior art identification and sorting technique(It is all Such as Chinese patent literature 201210299843.7,201210376933.1 is recorded)Web page contents are classified.
Real-time Collection module 3:The web page contents are crawled from network again after the time interval of setting, and calculate this Carve the fuzzy hash value of web page contents.
The calculating process of fuzzy hash value is as follows in initial acquisition module 1 and Real-time Collection module 3:
With file fragmentation of the weak hash algorithm to the web page contents.Concrete grammar is:
Read a part of content hereof, calculated with weak hash algorithm Alder-32, to roll Hash in the way of obtain The cryptographic Hash of one 4 byte.It is so-called rolling Hash refer to, such as had calculated that cryptographic Hash h1 of abcdef originally, next Calculate the cryptographic Hash of bcdefg, it is not necessary to recalculate completely, it is only necessary to h1-X (a)+Y (g).Wherein X, Y are two Individual function, that is, only need to accordingly increase and decrease impact of the residual quantity to cryptographic Hash.This Hash can greatly speed up burst judgement Speed.
Setting burst value n, by it fragmented condition is controlled.The value of n determines according to file size, file content etc..It is determined that Principle and method are as follows:
The value of n takes all the time 2 integer power, such Alder-32 cryptographic Hash divided by n remainder close to being uniformly distributed.Only when remaining Burst when number is equal to n-1, can burst in the case of being equivalent to only similar 1/n.That is, to a file, window It is often mobile once, just there is a 1/n may burst.If the piece number that certain once divides is too little, that is reduced by the value of n, makes to divide every time The probability of piece increases, and increases piece number.And if the piece felt point is too many, just increasing the value of n, the probability for making each burst subtracts It is few, reduce piece number.Every time by n divided by or be multiplied by 2, be adjusted, make final piece number as far as possible between 32 to 64.Due to The probability of burst is almost 1/n, so during each run ssdeep, the n values attempted for the first time are exactly one close to text The value of part length/64.
When Alder-32 cryptographic Hash is exactly equal to n-1 divided by the remainder of n, just in current location burst;Otherwise, regardless of Piece, window rolls a byte backward, Alder-32 cryptographic Hash is then calculated again and is judged, so continues.
Cryptographic Hash is calculated to each piece obtained in S101 with a strong hash algorithm.Fowler-Noll-Vo can be used Hash hash algorithms.
Compression cryptographic Hash.To each file fragmentation, after being calculated a cryptographic Hash, can select to compress result It is short.Specially:Minimum 6 of Hash result are taken, and is showed with an ascii character, as the final Kazakhstan of this burst The result of uncommon value.
Connection cryptographic Hash.Cryptographic Hash after compressing per piece is connected together, that is, obtains the fuzzy hash value of this document.Such as Fruit burst value n is different to different files, should also n be included in fuzzy hash value, and specific practice is that directly n is added in former Kazakhstan A part of the uncommon value finally, as cryptographic Hash.
Computing module 4:Calculate the mould of the fuzzy hash value and acquisition in initial acquisition module obtained in Real-time Collection module The similarity of paste cryptographic Hash, the span of similarity is as follows for the calculating process of similarity in 0-100 computing modules 4:It is described The fuzzy hash value of web page contents is a character string, is set to s1, s2.Using the weighing edit distance of s1 to s2 as its phase of evaluation Like the foundation of property;Weighing edit distance refers to, first judges to be changed into s2 from s1, minimum to need how many step operations(Including inserting, delete Except, modification), a weights are then given to different operating.Insertion, the weights deleted, change are set to:0.2、0.3、0.5. Finally, result is added up, that is, obtains weighing edit distance.
Webpage judge module 5:The affiliated type of webpage of the web page contents is judged, if the web page contents belong to the first webpage Content, then proceed to the first judge module 6;If the web page contents belong to the second web page contents, the second judge module 7 is proceeded to.
First judge module 6:Whether the value for judging similarity is 100, is then to terminate the monitoring of the web page contents; It is no, then proceed to the first alert module 61.
First alert module 61:Give a warning, terminate the monitoring of the web page contents.
Second judge module 7:Whether the value for judging similarity is 100, is then to proceed to the first termination module 71;It is no, then Proceed to variation analyses module 72.
First terminates module 71:Terminate the monitoring of the web page contents.
Variation analyses module 72:The difference that the web page contents compare original state is found out using DIFF instruments.
3rd judge module 8:Judge that difference, whether because picture change causes, is then to proceed to the first matching module 81; It is no, then the second matching module of access 82.
First matching module 81:Image content is matched with hostile content feature, whether is had exception in detection picture Content;It is then to proceed to the second alert module 811;It is no, then proceed to the second termination module 812.
Second alert module 811:Give a warning, terminate the monitoring of the web page contents.
Second terminates module 812:Terminate the monitoring of the web page contents.
Second matching module 82:Matched with sensitive dictionary, if matching sensitive word, given a warning.
Second matching module 82 also comprising being matched with Trojan characteristics storehouse, if matching Trojan characteristics, sends Warning.
Call picture recognizer to be identified image content in 3rd judge module 8, judge difference whether due to figure Piece change causes, and is then to proceed to the first matching module 81;It is no, then the second matching module of access 82.
A kind of webpage change monitoring method and system based on Similarity Measure of the present invention, will using web crawlers technology Web page contents are saved in locally, and in the time interval of setting web page contents are obtained again, using fuzzy hash algorithm and local guarantor The content of pages similarity deposited is compared.Can be with self-defined web page contents attribute, the web page contents that content will not change, monitoring Step is more succinct, and monitoring efficiency is high.For the changeable web page contents of content, variation analyses are further carried out, recognize character Or picture is distorted, web page contents can be accurately identified with the very first time and be tampered or normally update, be improved in webpage The safety of appearance.
Obviously, above-described embodiment is only intended to clearly illustrate example, and not to the restriction of embodiment.It is right For those of ordinary skill in the art, can also make on the basis of the above description other multi-forms change or Change.There is no need to be exhaustive to all of embodiment.And the obvious change thus extended out or Among changing still in the protection domain of the invention.

Claims (6)

1. a kind of webpage change monitoring method based on Similarity Measure, it is characterised in that comprise the steps of:
S1:Web page contents in network are stored the mould that web page contents are calculated to local memory device by using web crawlers Paste cryptographic Hash;
S2:Judge that the web page contents belong to the first type of webpage and still fall within the second type of webpage, and make respective markers, the One type of webpage is the webpage that web page contents will not change, and the second type of webpage is the net that web page contents can change Page;
S3:Crawl the web page contents from network again after the time interval of setting, and calculate the mould of web page contents this moment Paste cryptographic Hash;
S4:The similarity of the fuzzy hash value obtained in the fuzzy hash value obtained in calculation procedure S3 and step S1, similarity Span be 0-100;
S5:Judge the affiliated type of webpage of the web page contents, if the web page contents belong to the first web page contents, carry out step S6;If the web page contents belong to the second web page contents, step S7 is carried out;
S6:Whether the value for judging similarity is 100, is, then carry out step S61;It is no, then carry out step S62;
S61:Terminate the monitoring of the web page contents;
S62:Give a warning, terminate the monitoring of the web page contents;
S7:Whether the value for judging similarity is 100, is then to terminate the monitoring of the web page contents;It is no, then carry out step S71;
S71:The difference that the web page contents compare original state is found out using DIFF instruments;
S72:Judge that difference, whether because picture change causes, is then to carry out step S8;It is no, then carry out step S9;
S8:Image content is matched with hostile content feature, whether is had anomalous content in detection picture;It is then to be walked Rapid S81;It is no, then carry out step S82;
S81:Give a warning, terminate the monitoring of the web page contents;
S82:Terminate the monitoring of the web page contents;
S9:Matched with sensitive dictionary, if matching sensitive word, given a warning.
2. a kind of webpage change monitoring method based on Similarity Measure according to claim 1, it is characterised in that step In S9, also comprising being matched with Trojan characteristics storehouse, if matching Trojan characteristics, give a warning.
3. a kind of webpage change monitoring method based on Similarity Measure according to claim 2, it is characterised in that described Call picture recognizer to be identified image content in step S8, image content matched with hostile content feature, Whether there is anomalous content in detection picture;It is then to carry out step S81;Otherwise carry out step S82.
4. a kind of webpage change monitoring system based on Similarity Measure, it is characterised in that comprising with lower module:
Initial acquisition module:Web page contents in network are stored to local memory device by using web crawlers, net is calculated The fuzzy hash value of page content;
Judge module:Judge that the web page contents belong to the first type of webpage and still fall within the second type of webpage, and make corresponding Labelling, the first type of webpage is the webpage that web page contents will not change, and the second type of webpage can become for web page contents The webpage of change;
Real-time Collection module:Crawl the web page contents from network again after the time interval of setting, and calculate net this moment The fuzzy hash value of page content;
Computing module:Calculate the fuzzy Hash of the fuzzy hash value and acquisition in initial acquisition module obtained in Real-time Collection module The similarity of value, the span of similarity is 0-100;
Webpage judge module:Judge the affiliated type of webpage of the web page contents, if the web page contents belong to the first web page contents, Then proceed to the first judge module;If the web page contents belong to the second web page contents, the second judge module is proceeded to;
First judge module:Whether the value for judging similarity is 100, is then to terminate the monitoring of the web page contents;It is no, then Proceed to the first alert module;
First alert module:Give a warning, terminate the monitoring of the web page contents;
Second judge module:Whether the value for judging similarity is 100, is then to proceed to the first termination module;It is no, then proceed to difference Different analysis module;
First terminates module:Terminate the monitoring of the web page contents;
Variation analyses module:The difference that the web page contents compare original state is found out using DIFF instruments;
3rd judge module:Judge that difference, whether because picture change causes, is then to proceed to the first matching module;It is no, then access Second matching module;
First matching module:Image content is matched with hostile content feature, whether is had anomalous content in detection picture; It is then to proceed to the second alert module;It is no, then proceed to the second termination module;
Second alert module:Give a warning, terminate the monitoring of the web page contents;
Second terminates module:Terminate the monitoring of the web page contents;
Second matching module:Matched with sensitive dictionary, if matching sensitive word, given a warning.
5. a kind of webpage change monitoring system based on Similarity Measure according to claim 4, it is characterised in that described Second matching module also comprising being matched with Trojan characteristics storehouse, if matching Trojan characteristics, gives a warning.
6. a kind of webpage change monitoring system based on Similarity Measure according to claim 5, it is characterised in that the 3rd Call picture recognizer to be identified image content in judge module, judge that difference, whether because picture change causes, is, Then proceed to the first matching module;It is no, then the matching module of access second.
CN201611182671.XA 2016-12-20 2016-12-20 A kind of webpage change monitoring method and system based on similarity calculation Active CN106599242B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201611182671.XA CN106599242B (en) 2016-12-20 2016-12-20 A kind of webpage change monitoring method and system based on similarity calculation

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201611182671.XA CN106599242B (en) 2016-12-20 2016-12-20 A kind of webpage change monitoring method and system based on similarity calculation

Publications (2)

Publication Number Publication Date
CN106599242A true CN106599242A (en) 2017-04-26
CN106599242B CN106599242B (en) 2019-03-26

Family

ID=58600081

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201611182671.XA Active CN106599242B (en) 2016-12-20 2016-12-20 A kind of webpage change monitoring method and system based on similarity calculation

Country Status (1)

Country Link
CN (1) CN106599242B (en)

Cited By (12)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107301355A (en) * 2017-06-20 2017-10-27 深信服科技股份有限公司 A kind of webpage tamper monitoring method and device
CN107612908A (en) * 2017-09-15 2018-01-19 杭州安恒信息技术有限公司 webpage tamper monitoring method and device
CN108021692A (en) * 2017-12-18 2018-05-11 北京天融信网络安全技术有限公司 A kind of method of web page monitored, server and computer-readable recording medium
CN108540466A (en) * 2018-03-31 2018-09-14 甘肃万维信息技术有限责任公司 Based on webpage tamper monitoring and alarming system
CN108595583A (en) * 2018-04-18 2018-09-28 平安科技(深圳)有限公司 Dynamic chart class page data crawling method, device, terminal and storage medium
CN108809943A (en) * 2018-05-14 2018-11-13 苏州闻道网络科技股份有限公司 Web publishing method and its device
CN109241779A (en) * 2018-08-27 2019-01-18 浙江每日互动网络科技股份有限公司 A method of the detection page is distorted
CN109495471A (en) * 2018-11-15 2019-03-19 东信和平科技股份有限公司 A kind of pair of WEB attack result determination method, device, equipment and readable storage medium storing program for executing
CN109740094A (en) * 2018-12-27 2019-05-10 上海掌门科技有限公司 Page monitoring method, equipment and computer storage medium
CN110034921A (en) * 2019-04-18 2019-07-19 成都信息工程大学 The webshell detection method of hash is obscured based on cum rights
CN110598478A (en) * 2019-09-19 2019-12-20 腾讯科技(深圳)有限公司 Block chain based evidence verification method, device, equipment and storage medium
CN110659439A (en) * 2019-09-23 2020-01-07 杭州迪普科技股份有限公司 Black chain protection method, device, equipment and storage medium

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102571791A (en) * 2011-12-31 2012-07-11 奇智软件(北京)有限公司 Method and system for analyzing tampering of Web page contents
CN102682098A (en) * 2012-04-27 2012-09-19 北京神州绿盟信息安全科技股份有限公司 Method and device for detecting web page content changes
CN102779245A (en) * 2011-05-12 2012-11-14 李朝荣 Webpage abnormality detection method based on image processing technology
CN103279475A (en) * 2013-04-11 2013-09-04 广东电网公司信息中心 Detection method and system for WEB application system content change
CN105678193A (en) * 2016-01-06 2016-06-15 杭州数梦工场科技有限公司 Tamper-proof processing method and device

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102779245A (en) * 2011-05-12 2012-11-14 李朝荣 Webpage abnormality detection method based on image processing technology
CN102571791A (en) * 2011-12-31 2012-07-11 奇智软件(北京)有限公司 Method and system for analyzing tampering of Web page contents
CN102682098A (en) * 2012-04-27 2012-09-19 北京神州绿盟信息安全科技股份有限公司 Method and device for detecting web page content changes
CN103279475A (en) * 2013-04-11 2013-09-04 广东电网公司信息中心 Detection method and system for WEB application system content change
CN105678193A (en) * 2016-01-06 2016-06-15 杭州数梦工场科技有限公司 Tamper-proof processing method and device

Cited By (19)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107301355A (en) * 2017-06-20 2017-10-27 深信服科技股份有限公司 A kind of webpage tamper monitoring method and device
CN107301355B (en) * 2017-06-20 2021-07-02 深信服科技股份有限公司 Webpage tampering monitoring method and device
CN107612908B (en) * 2017-09-15 2020-06-05 杭州安恒信息技术股份有限公司 Webpage tampering monitoring method and device
CN107612908A (en) * 2017-09-15 2018-01-19 杭州安恒信息技术有限公司 webpage tamper monitoring method and device
CN108021692A (en) * 2017-12-18 2018-05-11 北京天融信网络安全技术有限公司 A kind of method of web page monitored, server and computer-readable recording medium
CN108021692B (en) * 2017-12-18 2022-03-11 北京天融信网络安全技术有限公司 Method for monitoring webpage, server and computer readable storage medium
CN108540466A (en) * 2018-03-31 2018-09-14 甘肃万维信息技术有限责任公司 Based on webpage tamper monitoring and alarming system
CN108595583A (en) * 2018-04-18 2018-09-28 平安科技(深圳)有限公司 Dynamic chart class page data crawling method, device, terminal and storage medium
CN108809943B (en) * 2018-05-14 2021-05-14 苏州闻道网络科技股份有限公司 Website monitoring method and device
CN108809943A (en) * 2018-05-14 2018-11-13 苏州闻道网络科技股份有限公司 Web publishing method and its device
CN109241779A (en) * 2018-08-27 2019-01-18 浙江每日互动网络科技股份有限公司 A method of the detection page is distorted
CN109495471A (en) * 2018-11-15 2019-03-19 东信和平科技股份有限公司 A kind of pair of WEB attack result determination method, device, equipment and readable storage medium storing program for executing
CN109495471B (en) * 2018-11-15 2021-07-02 东信和平科技股份有限公司 Method, device and equipment for judging WEB attack result and readable storage medium
CN109740094A (en) * 2018-12-27 2019-05-10 上海掌门科技有限公司 Page monitoring method, equipment and computer storage medium
CN110034921A (en) * 2019-04-18 2019-07-19 成都信息工程大学 The webshell detection method of hash is obscured based on cum rights
CN110034921B (en) * 2019-04-18 2022-04-15 成都信息工程大学 Webshell detection method based on weighted fuzzy hash
CN110598478A (en) * 2019-09-19 2019-12-20 腾讯科技(深圳)有限公司 Block chain based evidence verification method, device, equipment and storage medium
CN110598478B (en) * 2019-09-19 2024-06-07 腾讯科技(深圳)有限公司 Block chain-based evidence verification method, device, equipment and storage medium
CN110659439A (en) * 2019-09-23 2020-01-07 杭州迪普科技股份有限公司 Black chain protection method, device, equipment and storage medium

Also Published As

Publication number Publication date
CN106599242B (en) 2019-03-26

Similar Documents

Publication Publication Date Title
CN106599242A (en) Webpage change monitoring method and system based on similarity calculation
US9215246B2 (en) Website scanning device and method
CN108985057B (en) Webshell detection method and related equipment
US9519718B2 (en) Webpage information detection method and system
CN108650260B (en) Malicious website identification method and device
CN107038173B (en) Application query method and device and similar application detection method and device
CN110798488B (en) Web application attack detection method
JP5254443B2 (en) Surveillance method used for communication system images or multimedia video images
CN103634593B (en) Video camera movement detection method and system
CN112532624A (en) Black chain detection method and device, electronic equipment and readable storage medium
CN109302383B (en) URL monitoring method and device
CN107426136B (en) Network attack identification method and device
CN116956080A (en) Data processing method, device and storage medium
CN109670153B (en) Method and device for determining similar posts, storage medium and terminal
CN112257546B (en) Event early warning method and device, electronic equipment and storage medium
CN111488621A (en) Method and system for detecting falsified webpage, electronic equipment and storage medium
CN111382432A (en) Malicious software detection and classification model generation method and device
CN113378161A (en) Security detection method, device, equipment and storage medium
CN108881154A (en) Webpage is tampered detection method, apparatus and system
KR102423784B1 (en) Apparatus and method for correcting transportation data
CN107241342A (en) A kind of network attack crosstalk detecting method and device
CN111460448A (en) Malicious software family detection method and device
CN111083705A (en) Group-sending fraud short message detection method, device, server and storage medium
CN116028112A (en) Small program clone detection method based on complex network analysis
CN108536713B (en) Character string auditing method and device and electronic equipment

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant