CN106685936A - Webpage defacement detection method and apparatus - Google Patents

Webpage defacement detection method and apparatus Download PDF

Info

Publication number
CN106685936A
CN106685936A CN201611158763.4A CN201611158763A CN106685936A CN 106685936 A CN106685936 A CN 106685936A CN 201611158763 A CN201611158763 A CN 201611158763A CN 106685936 A CN106685936 A CN 106685936A
Authority
CN
China
Prior art keywords
webpage
text
detected
website
eigenvector
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN201611158763.4A
Other languages
Chinese (zh)
Other versions
CN106685936B (en
Inventor
王立明
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Shenzhen Shenxinfu Electronic Technology Co Ltd
Original Assignee
Shenzhen Shenxinfu Electronic Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Shenzhen Shenxinfu Electronic Technology Co Ltd filed Critical Shenzhen Shenxinfu Electronic Technology Co Ltd
Priority to CN201611158763.4A priority Critical patent/CN106685936B/en
Publication of CN106685936A publication Critical patent/CN106685936A/en
Application granted granted Critical
Publication of CN106685936B publication Critical patent/CN106685936B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L63/00Network architectures or network communication protocols for network security
    • H04L63/14Network architectures or network communication protocols for network security for detecting or protecting against malicious traffic
    • H04L63/1408Network architectures or network communication protocols for network security for detecting or protecting against malicious traffic by monitoring network traffic
    • H04L63/1416Event detection, e.g. attack signature detection
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/95Retrieval from the web
    • G06F16/951Indexing; Web crawling techniques
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L63/00Network architectures or network communication protocols for network security
    • H04L63/14Network architectures or network communication protocols for network security for detecting or protecting against malicious traffic
    • H04L63/1441Countermeasures against malicious traffic
    • H04L63/1483Countermeasures against malicious traffic service impersonation, e.g. phishing, pharming or web spoofing

Abstract

The invention discloses a webpage defacement detection method. The method includes the following steps: obtaining a text characteristic vector of a webpage to be detected and a text characteristic vector of a website to which the webpage to be detected belongs; calculating text similarity between the webpage to be detected and the website on the basis of the obtained text characteristic vector of the webpage to be detected and the text characteristic vector of the website; determining whether the text similarity is smaller than a preset threshold value; and if the text similarity is smaller than the preset threshold value, determining that the webpage to be detected is a defaced webpage. The invention also discloses a webpage defacement detection apparatus. The webpage defacement detection accuracy and efficiency can be improved.

Description

The detection method and device of webpage tamper
Technical field
The present invention relates to technical field of network security, more particularly to the detection method and device of webpage tamper.
Background technology
Webpage tamper is that a kind of malicious act that rear attacker is carried out is captured in website, and attacker would generally create new net Page simultaneously writes hostile content, or the web page portions or full content that have existed are revised as hostile content.Webpage tamper is not only Have impact on website normally to run, and a large amount of invalid informations to public propagation, endanger huge.At present webpage tamper detection have with Lower two methods:
1) blacklist keyword detection:The key word blacklist of hostile content is set up, by whether containing in inspection webpage Key word in blacklist is judging whether webpage is tampered.This method may be due to the key word that includes in blacklist not It is enough to produce and fail to report comprehensively, wrong report may be additionally produced, such as public security department of government issues certain bulletin strike illegal act, its In contain illegal key word, wrong report can be produced if the illegal key word is in blacklist, because this webpage is actually Normal webpage.
2) webpage digital finger-print is compared:The digital finger-print (such as md5 values) of each webpage of website is precalculated, and sets up fingerprint Storehouse, is then separated by the digital finger-print for recalculating each webpage for a period of time, if in front and back the digital finger-print of same webpage is different, Illustrate that the webpage is tampered.This method needs website to set up fingerprint base in advance before being not tampered with, per subnormal modification and increase Web page files must also update fingerprint base, cumbersome and less efficient;In addition this detecting system needs website webmaster to exist Local disposition is carried out on Website server, it is impossible to be applied to the Internet and detect on a large scale.
The content of the invention
Present invention is primarily targeted at proposing a kind of detection method and device of webpage tamper, it is intended to improve webpage tamper The accuracy rate and efficiency of detection.
For achieving the above object, the present invention provides a kind of detection method of webpage tamper, and methods described comprises the steps:
Obtain the Text eigenvector of webpage to be detected and the Text eigenvector of the webpage affiliated web site to be detected;
Calculated according to the Text eigenvector of the webpage described to be detected for getting and the Text eigenvector of the website Text similarity between the webpage to be detected and the website;
Judge the text similarity whether less than predetermined threshold value;
If so, then judge the webpage to be detected as the webpage being tampered.
Alternatively, the text of the Text eigenvector and the webpage affiliated web site to be detected for obtaining webpage to be detected The step of characteristic vector, includes:
The text feature collection of webpage to be detected and the text feature collection of the webpage affiliated web site to be detected are obtained, wherein, The text feature collection of the webpage to be detected and the text feature collection of the website include identical key word;
The word frequency and weight concentrated in the text feature of the webpage to be detected according to the key word is calculated, and is obtained The Text eigenvector of the webpage to be detected;
The word frequency and weight concentrated according to text feature of the key word in the website is calculated, and obtains the net The Text eigenvector stood.
Alternatively, the text of the text feature collection for obtaining webpage to be detected and the webpage affiliated web site to be detected is special The step of collection, includes:
Obtain the text of the webpage affiliated web site to be detected;
The text to getting carries out Chinese word segmentation and goes stop words to process;
Some key words are extracted from result, the text feature collection of the website is obtained;
Using the text feature collection of the website as the webpage to be detected text feature collection.
Alternatively, the text of the Text eigenvector of the webpage described to be detected that the basis gets and the website is special Levying the step of vector calculates the text similarity between the webpage to be detected and the website includes:
Calculate the cosine value of the Text eigenvector of the webpage to be detected and the Text eigenvector angle of the website;
Using result of calculation as the text similarity between the webpage to be detected and the website.
Alternatively, the text of the Text eigenvector and the webpage affiliated web site to be detected for obtaining webpage to be detected Before the step of characteristic vector, also include:
Default webpage to be detected is crawled by crawlers timing;
Or, when network access request is detected, using the corresponding webpage of the network access request as survey grid to be checked Page.
Additionally, for achieving the above object, the present invention also provides a kind of detection means of webpage tamper, and described device includes:
Acquisition module, for obtaining the Text eigenvector of webpage to be detected and the text of the webpage affiliated web site to be detected Eigen vector;
Computing module, for according to the Text eigenvector and the text of the website of the webpage described to be detected for getting Characteristic vector calculates the text similarity between the webpage to be detected and the website;
Judge module, for judging the text similarity whether less than predetermined threshold value;If so, then judge described to be detected Webpage is the webpage being tampered.
Alternatively, the acquisition module includes:
Acquiring unit, for obtaining the text feature collection of webpage to be detected and the text of the webpage affiliated web site to be detected Feature set, wherein, the text feature collection of the webpage to be detected and the text feature collection of the website include identical key word;
First computing unit, for according to the key word the webpage to be detected text feature concentrate word frequency and Weight is calculated, and obtains the Text eigenvector of the webpage to be detected;
Second computing unit, the word frequency and weight concentrated according to text feature of the key word in the website is counted Calculate, obtain the Text eigenvector of the website.
Alternatively, the acquiring unit is additionally operable to:
Obtain the text of the webpage affiliated web site to be detected;
The text to getting carries out Chinese word segmentation and goes stop words to process;
Some key words are extracted from result, the text feature collection of the website is obtained;
Using the text feature collection of the website as the webpage to be detected text feature collection.
Alternatively, the computing module is additionally operable to:
Calculate the cosine value of the Text eigenvector of the webpage to be detected and the Text eigenvector angle of the website;
Using result of calculation as the text similarity between the webpage to be detected and the website.
Alternatively, described device also includes:
Module is crawled, for crawling default webpage to be detected by crawlers timing;
The acquisition module is additionally operable to when network access request is detected, by the corresponding webpage of the network access request As webpage to be detected.
The present invention obtains the Text eigenvector of webpage to be detected and the text feature of the webpage affiliated web site to be detected Vector;According to the Text eigenvector of the webpage described to be detected for getting and the Text eigenvector of the website are calculated Text similarity between webpage to be detected and the website;Judge the text similarity whether less than predetermined threshold value;If so, Then judge the webpage to be detected as the webpage being tampered.The present invention detects whether webpage is tampered by text similarity, Wrong report relative to existing blacklist keyword detection, it is not necessary to carry out blacklist key word collection, to webpage tamper detection It is fewer with failing to report, improve the accuracy rate of webpage tamper detection;Compare relative to existing webpage digital finger-print, without the need for carrying out Local disposition, can carry out long-range extensive detection, improve the efficiency of webpage tamper detection.
Description of the drawings
Fig. 1 is the schematic flow sheet of the detection method first embodiment of webpage tamper of the present invention;
Fig. 2 is the refinement step schematic diagram of step S100 in Fig. 1;
Fig. 3 is the refinement step schematic diagram of step S110 in Fig. 2;
Fig. 4 is the schematic flow sheet of the detection method second embodiment of webpage tamper of the present invention;
Fig. 5 is the angled relationships between the Text eigenvector Dk and the Text eigenvector D0 of webpage affiliated web site of webpage Schematic diagram;
Fig. 6 is the schematic flow sheet of the detection method 3rd embodiment of webpage tamper of the present invention;
Fig. 7 is the high-level schematic functional block diagram of the detection means first embodiment of webpage tamper of the present invention;
Fig. 8 is the refinement high-level schematic functional block diagram of acquisition module in Fig. 7;
Fig. 9 is the high-level schematic functional block diagram of the detection means second embodiment of webpage tamper of the present invention.
The realization of the object of the invention, functional characteristics and advantage will be described further referring to the drawings in conjunction with the embodiments.
Specific embodiment
It should be appreciated that specific embodiment described herein is not intended to limit the present invention only to explain the present invention.
The present invention provides a kind of detection method of webpage tamper.
With reference to Fig. 1, Fig. 1 is the schematic flow sheet of the detection method first embodiment of webpage tamper of the present invention.Methods described Comprise the steps:
Step S100, obtains the Text eigenvector of webpage to be detected and the text spy of the webpage affiliated web site to be detected Levy vector;
In the present embodiment, webpage can be carried out by the application firewall being arranged between Web browser and Web server to usurp Change detection.Application firewall obtains the Text eigenvector of webpage to be detected and the text of the webpage affiliated web site to be detected is special Vector is levied, so as to set up vector space model.
In vector space model, text (Document is represented with D) refers to various machine-readable records, characteristic item (Term is represented with t) refers to the basic language unit that occur in text D and can represent text content, mainly by word Or phrase is constituted.Text can be D (T1, T2 ..., Tn) with characteristic item set representations, and wherein Tk is characteristic item, 1<=k<=n, For example there are tetra- characteristic items of a, b, c, d in one document, then this document can just be expressed as D (a, b, c, d).
Further, with reference to Fig. 2, Fig. 2 is the refinement step schematic diagram of step S100 in Fig. 1.As a kind of embodiment, Step S100 can include:
Step S110, obtains the text feature collection of webpage to be detected and the text feature of the webpage affiliated web site to be detected Collection, wherein, the text feature collection of the webpage to be detected and the text feature collection of the website include identical key word;
Step S120, the word frequency and weight concentrated in the text feature of the webpage to be detected according to the key word is carried out Calculate, obtain the Text eigenvector of the webpage to be detected;
Step S130, the word frequency and weight concentrated according to text feature of the key word in the website is calculated, Obtain the Text eigenvector of the website.
First, application firewall obtains the text feature collection of webpage to be detected and the text spy of webpage affiliated web site to be detected Collection, is to ensure that the two text feature collection have comparability, and the two text feature collection include identical key word.Such as, The text feature for obtaining webpage affiliated web site to be detected integrates as D (T1, T2 ..., Tm), then obtain the text feature of webpage to be detected It also should be D (T1, T2 ..., Tm) to integrate, and wherein T1, T2 ..., Tm is characterized item, i.e. key word, and m is the quantity of key word.Net Network management personnel can pre-set text feature in the case where web site contents are familiar with according to the main contents of accessed website The key word of concentration, in most of the cases, application firewall is processed to by the web page text to being accessed for website Automatically key word is obtained.
After key word is got, application firewall is respectively calculated further according to the word frequency and weight of key word, obtains The Text eigenvector of the Text eigenvector of webpage to be detected and webpage affiliated web site to be detected.The present embodiment mainly passes through TF-IDF (term frequency-inverse document frequency, word frequency -- reverse text frequency) technologies are counting Calculation obtains Text eigenvector, and its principle is:Word frequency is calculated with reference to TF formula:TF=N/M, i.e., in the article in a M word In have N number of key word, then TF=N/M is word frequency of the key word in this article;Reverse text frequency is for weighing The index of keyword weight, can be calculated by formula IDF=log (D/Dw) and be obtained, and wherein D is the total number of documents of corpus, and Dw is The number of files that key word occurred, Dw is bigger, illustrates that the key word occurred in more documents, and the key word is more not enough To become the distinguishing characteristicss item of this document, thus its weight is less.Calculate based on the Weighted Term Frequency of IDF, i.e., with key word Tx's Word frequency is multiplied by the reverse text frequency (Wx=TF (Tx) * IDF (Tx)) of Tx, you can obtain text feature collection D (T1, T2 ..., Tm) Corresponding Text eigenvector D (W1, W2 ..., Wm).
It is according to the process of the Text eigenvector of above-mentioned principle calculating webpage to be detected:Obtain the text of webpage to be detected Dk, word sum calculates word frequency of each key word in Dk in the number of times occurred in Dk according to key word and Dk, then will meter The word frequency for obtaining is weighted based on IDF, finally give webpage to be detected Text eigenvector Dk (Wk1, Wk2 ..., Wkm).Especially, the Weighted Term Frequency Wkx without the key word Tx for occurring in webpage to be detected is 0.
It is according to the process of the Text eigenvector of the whole website of above-mentioned principle calculating:By the text of all webpages of website Merge, obtain total text D0, word sum calculates each key word in D0 in the number of times occurred in D0 according to key word and D0 In word frequency, then calculated word frequency is weighted based on IDF, finally give the Text eigenvector D0 of whole website (W01, W02 ..., W0m).
Step S200, according to the Text eigenvector and the text feature of the website of the webpage described to be detected for getting Vector calculates the text similarity between the webpage to be detected and the website;
It should be noted that the webpage being tampered with is probably browser and browsing access and being evident that, it is also possible to no Detectable dark chain, the webpage being generally tampered accounts for the sub-fraction of whole website and webpage, and the web page contents being tampered And have bigger difference in whole website, and the similarity degree between text is generally and the content of text is height correlation , therefore text similarity can be compared by above-mentioned vector space model.
Specifically, after the Text eigenvector of the Text eigenvector for getting webpage to be detected and the website, Application firewall calculates the text similarity between webpage to be detected and website according to the relation between the two characteristic vectors, than The distance between two characteristic vectors, angle are such as calculated, using result of calculation as the text between webpage to be detected and website Similarity.
Whether step S300, judge the text similarity less than predetermined threshold value;
Step S400, if the text similarity is less than predetermined threshold value, judges that the webpage to be detected is tampered Webpage.
Whether application firewall judges calculated text similarity less than predetermined threshold value, wherein, default text phase Self study classification can be carried out like degree threshold value by the webpage of the website to having occurred and that webpage tamper in a large number to obtain, network management Person can also flexibly be arranged according to actual needs to it.If text similarity is less than predetermined threshold value, application firewall The webpage for detecting is judged as the webpage being tampered, now testing result can be reported and be prevented user from accessing by application firewall The webpage;Otherwise judge the webpage for detecting as normal webpage.
In the present embodiment, application firewall obtains the Text eigenvector of webpage to be detected and the webpage institute to be detected The Text eigenvector of category website;According to the Text eigenvector and the text of the website of the webpage described to be detected for getting Characteristic vector calculates the text similarity between the webpage to be detected and the website;Judge whether the text similarity is little In predetermined threshold value;If so, then judge the webpage to be detected as the webpage being tampered.The present embodiment is examined by text similarity Whether survey grid page is tampered, relative to existing blacklist keyword detection, it is not necessary to blacklist key word collection is carried out, to net Page tampering detection wrong report and fail to report it is fewer, improve webpage tamper detection accuracy rate;Relative to existing webpage numeral Fingerprint comparison, without the need for carrying out local disposition, can carry out long-range extensive detection, improve the efficiency of webpage tamper detection.
Further, with reference to Fig. 3, Fig. 3 is the refinement step schematic diagram of step S110 in Fig. 2.Based on the above embodiments, Step S110 can include:
Step S111, obtains the text of the webpage affiliated web site to be detected;
Step S112, the text to getting carries out Chinese word segmentation and goes stop words to process;
Step S113, extracts some key words from result, obtains the text feature collection of the website;
Step S114, using the text feature collection of the website as the webpage to be detected text feature collection.
In the present embodiment, to make the extraction result of key word more accurate, application firewall owns first to website Webpage carries out pretreatment, and removing is included including HTML (HyperText Markup Language, HTML) code All codes, only retain the word content of webpage, form text D1, D2 ..., Dn (wherein n is webpage quantity), by these texts This merging, obtains the text D0 of whole website;Then, D0 is carried out Chinese word segmentation and going stop words to process, Chinese word segmentation be by One Chinese character sequence is cut into single word one by one, go stop words be according to disable vocabulary in word by language material to text This content recognition has little significance but the very high word of the frequency of occurrences, symbol, punctuate and mess code etc. remove, as ", and, it is, this " Occur nearly in any Chinese text Deng word, but they are to the almost no any contribution of the meaning expressed by text, These words are arranged in deactivation vocabulary, it is possible to remove these words for not having practical significance in text according to vocabulary is disabled. Thus, the pre-processed results of the text D0 of whole website have been obtained.
Application firewall can calculate the word frequency of the word in pre-processed results, if the word frequency of certain word reaches one presetting Thus value, then extract all key words of text D0 using the word as the key word of text D0, and then obtains the text of website Feature set D (T1, T2 ..., Tm), text feature set is simultaneously as the text feature collection of webpage to be detected.
Further, with reference to Fig. 4, Fig. 4 is the schematic flow sheet of the detection method second embodiment of webpage tamper of the present invention. Based on the embodiment shown in above-mentioned Fig. 1, step S200 can include:
Step S210, calculates the Text eigenvector of the webpage to be detected and the Text eigenvector angle of the website Cosine value;
Step S220, using result of calculation as the text similarity between the webpage to be detected and the website.
In the present embodiment, application firewall calculate the Text eigenvector of webpage to be detected and the text feature of website to The cosine value of amount angle, if the Text eigenvector of website is D0 (W01, W02 ..., W0m), the Text eigenvector of webpage is Dk (Wk1, Wk2 ..., Wkm), wherein k are k-th webpage, then the cosine value computing formula of the angle of vector D0 and vector Dk is:
Using above-mentioned cosine value as the text similarity value between webpage to be detected and whole website, the value is bigger, then to The angle of amount D0 and vector Dk is less, represents that webpage to be detected is higher with the text similarity of website;The value is less, then vector D0 It is bigger with the angle of vectorial Dk, represent that webpage to be detected is lower with the text similarity of website.As shown in figure 5, Fig. 5 is webpage Angled relationships schematic diagram between Text eigenvector Dk and the Text eigenvector D0 of webpage affiliated web site.
The present embodiment is by between the Text eigenvector and the Text eigenvector of webpage affiliated web site that calculate webpage The cosine value of angle, can realize the text similarity of quantitative analyses webpage to be detected and whole website, and analysis mode is more closed Reason is reliable.
Further, with reference to Fig. 6, Fig. 6 is the schematic flow sheet of the detection method 3rd embodiment of webpage tamper of the present invention. Based on the above embodiments, before step S100, can also include:
Step S500, by crawlers timing default webpage to be detected is crawled;
Or step S600, when network access request is detected, using the corresponding webpage of the network access request as Webpage to be detected.
In the present embodiment, application firewall can carry out the active detecting of webpage tamper.Specifically, can be in application firewall One crawlers of middle setting, crawlers according to set crawl target, the webpage on timer access WWW to it is related Link, and web page contents are downloaded, wherein, the crawl target of crawlers can be the net related to a certain particular topic content Page, it is also possible to expand crawl scope as needed, can be in advance configured by network management personnel in being embodied as.Afterwards, should The webpage crawled crawlers with fire wall judges one by one whether these webpages are tampered as webpage to be detected.
Additionally, application firewall can also carry out the passive detection of webpage tamper.Specifically, application firewall is detecting net During network access request, using the corresponding webpage of the network access request as webpage to be detected, so, user accesses the flow of website During by application firewall, it is possible to which real-time detection goes out whether the webpage of user's current accessed is tampered.In more embodiments In, it is to improve passive detection efficiency, passive detection can also depend on the testing result of active detecting, application firewall carrying out During active detecting, the information such as website text feature collection, website Text eigenvector are stored in default text feature data base, When user accesses Web server, HTTP (HyperText Transfer Protocol, HTML (Hypertext Markup Language)) flow leads to Cross application firewall, fire wall record URL (Uniform Resoure Locator, URL) and accordingly Http response content, and the Text eigenvector of the corresponding webpage of http response content is obtained, by the text of the webpage for getting Characteristic vector carries out text similarity comparison with the Text eigenvector of corresponding website in text feature data base, to judge the net Whether page is tampered.
In the present embodiment, set webpage is crawled by arranging crawlers timing, and then carries out the master of webpage tamper Dynamic detection, without the need for manual intervention, and can carry out long-range extensive detection, improve the efficiency of webpage tamper detection;Pass through Using the webpage of user's current accessed as webpage to be detected, the real-time of webpage tamper detection is realized.
The present invention also provides a kind of detection means of webpage tamper.
With reference to Fig. 7, Fig. 7 is the high-level schematic functional block diagram of the detection means first embodiment of webpage tamper of the present invention.It is described Device includes:
Acquisition module 10, for obtaining the Text eigenvector and the webpage affiliated web site to be detected of webpage to be detected Text eigenvector;
In the present embodiment, webpage can be carried out by the application firewall being arranged between Web browser and Web server to usurp Change detection.Acquisition module 10 obtains the Text eigenvector of webpage to be detected and the text of the webpage affiliated web site to be detected is special Vector is levied, so as to set up vector space model.
In vector space model, text (Document is represented with D) refers to various machine-readable records, characteristic item (Term is represented with t) refers to the basic language unit that occur in text D and can represent text content, mainly by word Or phrase is constituted.Text can be D (T1, T2 ..., Tn) with characteristic item set representations, and wherein Tk is characteristic item, 1<=k<=n, For example there are tetra- characteristic items of a, b, c, d in one document, then this document can just be expressed as D (a, b, c, d).
With reference to Fig. 8, Fig. 8 is the refinement high-level schematic functional block diagram of acquisition module in Fig. 7.It is described as a kind of embodiment Acquisition module 10 can include:
Acquiring unit 11, for obtaining the text feature collection of webpage to be detected and the text of the webpage affiliated web site to be detected Eigen collection, wherein, the text feature collection of the webpage to be detected and the text feature collection of the website are crucial comprising identical Word;
First computing unit 12, for the word frequency concentrated in the text feature of the webpage to be detected according to the key word Calculated with weight, obtained the Text eigenvector of the webpage to be detected;
Second computing unit 13, the word frequency and weight concentrated according to text feature of the key word in the website is carried out Calculate, obtain the Text eigenvector of the website.
First, acquiring unit 11 obtains the text feature collection of webpage to be detected and the text spy of webpage affiliated web site to be detected Collection, is to ensure that the two text feature collection have comparability, and the two text feature collection include identical key word.Such as, The text feature for obtaining webpage affiliated web site to be detected integrates as D (T1, T2 ..., Tm), then obtain the text feature of webpage to be detected It also should be D (T1, T2 ..., Tm) to integrate, and wherein T1, T2 ..., Tm is characterized item, i.e. key word, and m is the quantity of key word.Net Network management personnel can pre-set text feature in the case where web site contents are familiar with according to the main contents of accessed website The key word of concentration, in most of the cases, application firewall is processed to by the web page text to being accessed for website Automatically key word is obtained.
After key word is got, the word frequency and power of the first computing unit 12 and the second computing unit 13 further according to key word Be respectively calculated again, obtain the Text eigenvector of webpage to be detected and the text feature of webpage affiliated web site to be detected to Amount.The present embodiment mainly by TF-IDF (term frequency inverse document frequency, word frequency -- it is inverse To text frequency) being calculated Text eigenvector, its principle is technology:Word frequency is calculated with reference to TF formula:TF=N/M, i.e., There is N number of key word in the article in a M word, then TF=N/M is word frequency of the key word in this article;Inversely Text frequency is the index for weighing keyword weight, can be calculated by formula IDF=log (D/Dw) and be obtained, and wherein D is language material The total number of documents in storehouse, the number of files that Dw occurred for key word, Dw is bigger, illustrates that the key word occurs in more documents Cross, the key word more is not enough to become the distinguishing characteristicss item of this document, thus its weight is less.Calculate based on the weighting of IDF Word frequency, i.e., be multiplied by the reverse text frequency (Wx=TF (Tx) * IDF (Tx)) of Tx, you can obtain text with the word frequency of key word Tx The corresponding Text eigenvector D (W1, W2 ..., Wm) of feature set D (T1, T2 ..., Tm).
According to above-mentioned principle, the process of the Text eigenvector that the first computing unit 12 calculates webpage to be detected is:Obtain The text Dk of webpage to be detected, word sum calculates each key word in Dk in the number of times occurred in Dk according to key word and Dk In word frequency, then calculated word frequency is weighted based on IDF, finally give the Text eigenvector Dk of webpage to be detected (Wk1, Wk2 ..., Wkm).Especially, the Weighted Term Frequency Wkx without the key word Tx for occurring in webpage to be detected is 0.
According to above-mentioned principle, the process of the Text eigenvector that the second computing unit 13 calculates whole website is:By website All webpages text merge, obtain total text D0, in the number of times occurred in D0 according to key word and D0 word sum counts Word frequency of each key word in D0 is calculated, then calculated word frequency is weighted based on IDF, finally give whole website Text eigenvector D0 (W01, W02 ..., W0m).
Computing module 20, for according to the Text eigenvector and the text of the website of the webpage described to be detected for getting Eigen vector calculates the text similarity between the webpage to be detected and the website;
It should be noted that the webpage being tampered with is probably browser and browsing access and being evident that, it is also possible to no Detectable dark chain, the webpage being generally tampered accounts for the sub-fraction of whole website and webpage, and the web page contents being tampered And have bigger difference in whole website, and the similarity degree between text is generally and the content of text is height correlation , therefore text similarity can be compared by above-mentioned vector space model.
Specifically, the text feature of the Text eigenvector of webpage to be detected and the website is got in acquisition module 10 After vector, computing module 20 calculates the text between webpage to be detected and website according to the relation between the two characteristic vectors Similarity, such as calculate the distance between two characteristic vectors, angle etc., using result of calculation as webpage to be detected and website it Between text similarity.
Judge module 30, for judging the text similarity whether less than predetermined threshold value;If so, then judge described to be checked Survey grid page is the webpage being tampered.
Whether judge module 30 judges calculated text similarity less than predetermined threshold value, wherein, default text phase Self study classification can be carried out like degree threshold value by the webpage of the website to having occurred and that webpage tamper in a large number to obtain, network management Person can also flexibly be arranged according to actual needs to it.If text similarity is less than predetermined threshold value, judge module 30 The webpage for detecting is judged as the webpage being tampered, now testing result can be reported and be prevented user from accessing by application firewall The webpage;Otherwise judge the webpage for detecting as normal webpage.
In the present embodiment, acquisition module 10 obtains the Text eigenvector of webpage to be detected and the webpage institute to be detected The Text eigenvector of category website;Computing module 20 is according to the Text eigenvector of the webpage described to be detected for getting and described The Text eigenvector of website calculates the text similarity between the webpage to be detected and the website;Judge module 30 judges Whether the text similarity is less than predetermined threshold value;If so, then judge the webpage to be detected as the webpage being tampered.This enforcement Example detects whether webpage is tampered by text similarity, relative to existing blacklist keyword detection, it is not necessary to carry out Blacklist key word collect, to webpage tamper detection wrong report and fail to report it is fewer, improve webpage tamper detection accuracy rate; Compare relative to existing webpage digital finger-print, without the need for carrying out local disposition, long-range extensive detection can be carried out, improve The efficiency of webpage tamper detection.
Further, with continued reference to Fig. 8, the acquiring unit 11 is additionally operable to:Obtain the webpage affiliated web site to be detected Text;The text to getting carries out Chinese word segmentation and goes stop words to process;Some keys are extracted from result Word, obtains the text feature collection of the website;The text feature collection of the website is special as the text of the webpage to be detected Collection.
In the present embodiment, to make the extraction result of key word more accurate, acquiring unit 11 owns first to website Webpage carries out pretreatment, and removing is included including HTML (HyperText Markup Language, HTML) code All codes, only retain the word content of webpage, form text D1, D2 ..., Dn (wherein n is webpage quantity), by these texts This merging, obtains the text D0 of whole website;Then, D0 is carried out Chinese word segmentation and going stop words to process, Chinese word segmentation be by One Chinese character sequence is cut into single word one by one, go stop words be according to disable vocabulary in word by language material to text This content recognition has little significance but the very high word of the frequency of occurrences, symbol, punctuate and mess code etc. remove, as ", and, it is, this " Occur nearly in any Chinese text Deng word, but they are to the almost no any contribution of the meaning expressed by text, These words are arranged in deactivation vocabulary, it is possible to remove these words for not having practical significance in text according to vocabulary is disabled. Thus, the pre-processed results of the text D0 of whole website have been obtained.
Acquiring unit 11 can calculate the word frequency of the word in pre-processed results, if the word frequency of certain word reaches one presetting Thus value, then extract all key words of text D0 using the word as the key word of text D0, and then obtains the text of website Feature set D (T1, T2 ..., Tm), text feature set is simultaneously as the text feature collection of webpage to be detected.
Further, with continued reference to Fig. 7, the computing module 20 is additionally operable to:The text for calculating the webpage to be detected is special Levy the cosine value of the Text eigenvector angle of website described in vector sum;Using result of calculation as the webpage to be detected and described Text similarity between website.
In the present embodiment, computing module 20 calculate the Text eigenvector of webpage to be detected and the text feature of website to The cosine value of amount angle, if the Text eigenvector of website is D0 (W01, W02 ..., W0m), the Text eigenvector of webpage is Dk (Wk1, Wk2 ..., Wkm), wherein k are k-th webpage, then the cosine value computing formula of the angle of vector D0 and vector Dk is:
Using above-mentioned cosine value as the text similarity value between webpage to be detected and whole website, the value is bigger, then to The angle of amount D0 and vector Dk is less, represents that webpage to be detected is higher with the text similarity of website;The value is less, then vector D0 It is bigger with the angle of vectorial Dk, represent that webpage to be detected is lower with the text similarity of website.As shown in figure 5, Fig. 5 is webpage Angled relationships schematic diagram between Text eigenvector Dk and the Text eigenvector D0 of webpage affiliated web site.
The present embodiment is by between the Text eigenvector and the Text eigenvector of webpage affiliated web site that calculate webpage The cosine value of angle, can realize the text similarity of quantitative analyses webpage to be detected and whole website, and analysis mode is more closed Reason is reliable.
Further, with reference to Fig. 9, Fig. 9 is that the functional module of the detection means second embodiment of webpage tamper of the present invention is shown It is intended to.Based on the above embodiments, described device can also include:
Module 40 is crawled, for crawling default webpage to be detected by crawlers timing;
The acquisition module 10 is additionally operable to when network access request is detected, by the corresponding net of the network access request Page is used as webpage to be detected.
In the present embodiment, application firewall can carry out the active detecting of webpage tamper.Specifically, can be in application firewall One crawlers of middle setting, crawlers according to set crawl target, the webpage on timer access WWW to it is related Link, and web page contents are downloaded, wherein, the crawl target of crawlers can be the net related to a certain particular topic content Page, it is also possible to expand crawl scope as needed, can be in advance configured by network management personnel in being embodied as.Afterwards, should The webpage crawled crawlers with fire wall judges one by one whether these webpages are tampered as webpage to be detected.
Additionally, application firewall can also carry out the passive detection of webpage tamper.Specifically, acquisition module 10 is detecting net During network access request, using the corresponding webpage of the network access request as webpage to be detected, so, user accesses the flow of website During by application firewall, it is possible to which real-time detection goes out whether the webpage of user's current accessed is tampered.In more embodiments In, it is to improve passive detection efficiency, the passive detection can also depend on the testing result of active detecting, application firewall entering During row active detecting, the information such as website text feature collection, website Text eigenvector are stored in into default text feature data base In, when user accesses Web server, HTTP (HyperText Transfer Protocol, HTML (Hypertext Markup Language)) flow By application firewall, fire wall record URL (Uniform Resoure Locator, URL) and accordingly Http response content, and the Text eigenvector of the corresponding webpage of http response content is obtained, by the text of the webpage for getting Characteristic vector carries out text similarity comparison with the Text eigenvector of corresponding website in text feature data base, to judge the net Whether page is tampered.
In the present embodiment, set webpage is crawled by arranging crawlers timing, and then carries out the master of webpage tamper Dynamic detection, without the need for manual intervention, and can carry out long-range extensive detection, improve the efficiency of webpage tamper detection;Pass through Using the webpage of user's current accessed as webpage to be detected, the real-time of webpage tamper detection is realized.
The preferred embodiments of the present invention are these are only, the scope of the claims of the present invention is not thereby limited, it is every using this Equivalent structure or equivalent flow conversion that bright description and accompanying drawing content are made, or directly or indirectly it is used in other related skills Art field, is included within the scope of the present invention.

Claims (10)

1. a kind of detection method of webpage tamper, it is characterised in that methods described comprises the steps:
Obtain the Text eigenvector of webpage to be detected and the Text eigenvector of the webpage affiliated web site to be detected;
According to the Text eigenvector of the webpage described to be detected for getting and the Text eigenvector of the website are calculated Text similarity between webpage to be detected and the website;
Judge the text similarity whether less than predetermined threshold value;
If so, then judge the webpage to be detected as the webpage being tampered.
2. the method for claim 1, it is characterised in that the Text eigenvector of acquisition webpage to be detected and described The step of Text eigenvector of webpage affiliated web site to be detected, includes:
The text feature collection of webpage to be detected and the text feature collection of the webpage affiliated web site to be detected are obtained, wherein, it is described The text feature collection of webpage to be detected and the text feature collection of the website include identical key word;
The word frequency and weight concentrated in the text feature of the webpage to be detected according to the key word is calculated, and obtains described The Text eigenvector of webpage to be detected;
The word frequency and weight concentrated according to text feature of the key word in the website is calculated, and obtains the website Text eigenvector.
3. method as claimed in claim 2, it is characterised in that the text feature collection of acquisition webpage to be detected and described treat The step of text feature collection of detection webpage affiliated web site, includes:
Obtain the text of the webpage affiliated web site to be detected;
The text to getting carries out Chinese word segmentation and goes stop words to process;
Some key words are extracted from result, the text feature collection of the website is obtained;
Using the text feature collection of the website as the webpage to be detected text feature collection.
4. the method as described in any one of claims 1 to 3, it is characterised in that the survey grid described to be checked that the basis gets The Text eigenvector of page and the Text eigenvector of the website calculate the text between the webpage to be detected and the website The step of this similarity, includes:
Calculate the cosine value of the Text eigenvector of the webpage to be detected and the Text eigenvector angle of the website;
Using result of calculation as the text similarity between the webpage to be detected and the website.
5. method as claimed in claim 4, it is characterised in that the Text eigenvector of acquisition webpage to be detected and described Before the step of Text eigenvector of webpage affiliated web site to be detected, also include:
Default webpage to be detected is crawled by crawlers timing;
Or, when network access request is detected, using the corresponding webpage of the network access request as webpage to be detected.
6. a kind of detection means of webpage tamper, it is characterised in that described device includes:
Acquisition module, for obtaining the Text eigenvector of webpage to be detected and the text spy of the webpage affiliated web site to be detected Levy vector;
Computing module, for according to the Text eigenvector and the text feature of the website of the webpage described to be detected for getting Vector calculates the text similarity between the webpage to be detected and the website;
Judge module, for judging the text similarity whether less than predetermined threshold value;If so, the webpage to be detected is then judged For the webpage being tampered.
7. device as claimed in claim 6, it is characterised in that the acquisition module includes:
Acquiring unit, for obtaining the text feature collection of webpage to be detected and the text feature of the webpage affiliated web site to be detected Collection, wherein, the text feature collection of the webpage to be detected and the text feature collection of the website include identical key word;
First computing unit, for according to the key word the webpage to be detected text feature concentrate word frequency and weight Calculated, obtained the Text eigenvector of the webpage to be detected;
Second computing unit, the word frequency and weight concentrated according to text feature of the key word in the website is calculated, Obtain the Text eigenvector of the website.
8. device as claimed in claim 7, it is characterised in that the acquiring unit is additionally operable to:
Obtain the text of the webpage affiliated web site to be detected;
The text to getting carries out Chinese word segmentation and goes stop words to process;
Some key words are extracted from result, the text feature collection of the website is obtained;
Using the text feature collection of the website as the webpage to be detected text feature collection.
9. the device as described in any one of claim 6 to 8, it is characterised in that the computing module is additionally operable to:
Calculate the cosine value of the Text eigenvector of the webpage to be detected and the Text eigenvector angle of the website;
Using result of calculation as the text similarity between the webpage to be detected and the website.
10. device as claimed in claim 9, it is characterised in that described device also includes:
Module is crawled, for crawling default webpage to be detected by crawlers timing;
The acquisition module is additionally operable to when network access request is detected, using the corresponding webpage of the network access request as Webpage to be detected.
CN201611158763.4A 2016-12-14 2016-12-14 Webpage tampering detection method and device Active CN106685936B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201611158763.4A CN106685936B (en) 2016-12-14 2016-12-14 Webpage tampering detection method and device

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201611158763.4A CN106685936B (en) 2016-12-14 2016-12-14 Webpage tampering detection method and device

Publications (2)

Publication Number Publication Date
CN106685936A true CN106685936A (en) 2017-05-17
CN106685936B CN106685936B (en) 2020-07-31

Family

ID=58868121

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201611158763.4A Active CN106685936B (en) 2016-12-14 2016-12-14 Webpage tampering detection method and device

Country Status (1)

Country Link
CN (1) CN106685936B (en)

Cited By (12)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107301355A (en) * 2017-06-20 2017-10-27 深信服科技股份有限公司 A kind of webpage tamper monitoring method and device
CN107566415A (en) * 2017-10-25 2018-01-09 国家电网公司 Homepage method for pushing and device
CN107580075A (en) * 2017-10-25 2018-01-12 国家电网公司 Homepage method for pushing and system
CN108306878A (en) * 2018-01-30 2018-07-20 平安科技(深圳)有限公司 Detection method for phishing site, device, computer equipment and storage medium
CN108520185A (en) * 2018-04-16 2018-09-11 深信服科技股份有限公司 Detect method, apparatus, equipment and the computer readable storage medium of webpage tamper
CN109165529A (en) * 2018-08-14 2019-01-08 杭州安恒信息技术股份有限公司 A kind of dark chain altering detecting method, device and computer readable storage medium
CN109981555A (en) * 2017-12-28 2019-07-05 腾讯科技(深圳)有限公司 To the processing method of web data, device, equipment, terminal and storage medium
CN110134901A (en) * 2019-04-30 2019-08-16 哈尔滨英赛克信息技术有限公司 A kind of multilink webpage tamper determination method based on flow analysis
CN110532784A (en) * 2019-09-04 2019-12-03 杭州安恒信息技术股份有限公司 A kind of dark chain detection method, device, equipment and computer readable storage medium
CN111563276A (en) * 2019-01-25 2020-08-21 深信服科技股份有限公司 Webpage tampering detection method, detection system and related equipment
CN113806732A (en) * 2020-06-16 2021-12-17 深信服科技股份有限公司 Webpage tampering detection method, device, equipment and storage medium
EP3703329B1 (en) * 2017-10-26 2024-03-20 New H3C Security Technologies Co., Ltd. Webpage request identification

Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
KR100622129B1 (en) * 2005-04-14 2006-09-19 한국전자통신연구원 Dynamically changing web page defacement validation system and method
CN102170446A (en) * 2011-04-29 2011-08-31 南京邮电大学 Fishing webpage detection method based on spatial layout and visual features
CN102708186A (en) * 2012-05-11 2012-10-03 上海交通大学 Identification method of phishing sites
CN102999638A (en) * 2013-01-05 2013-03-27 南京邮电大学 Phishing website detection method excavated based on network group
CN103077348A (en) * 2012-12-28 2013-05-01 华为技术有限公司 Method and device for vulnerability scanning of Web site
CN103927480A (en) * 2013-01-14 2014-07-16 腾讯科技(深圳)有限公司 Method, device and system for identifying malicious web page
CN104166725A (en) * 2014-08-26 2014-11-26 哈尔滨工业大学(威海) Phishing website detection method

Patent Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
KR100622129B1 (en) * 2005-04-14 2006-09-19 한국전자통신연구원 Dynamically changing web page defacement validation system and method
CN102170446A (en) * 2011-04-29 2011-08-31 南京邮电大学 Fishing webpage detection method based on spatial layout and visual features
CN102708186A (en) * 2012-05-11 2012-10-03 上海交通大学 Identification method of phishing sites
CN103077348A (en) * 2012-12-28 2013-05-01 华为技术有限公司 Method and device for vulnerability scanning of Web site
CN102999638A (en) * 2013-01-05 2013-03-27 南京邮电大学 Phishing website detection method excavated based on network group
CN103927480A (en) * 2013-01-14 2014-07-16 腾讯科技(深圳)有限公司 Method, device and system for identifying malicious web page
CN104166725A (en) * 2014-08-26 2014-11-26 哈尔滨工业大学(威海) Phishing website detection method

Cited By (17)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107301355A (en) * 2017-06-20 2017-10-27 深信服科技股份有限公司 A kind of webpage tamper monitoring method and device
CN107301355B (en) * 2017-06-20 2021-07-02 深信服科技股份有限公司 Webpage tampering monitoring method and device
CN107566415A (en) * 2017-10-25 2018-01-09 国家电网公司 Homepage method for pushing and device
CN107580075A (en) * 2017-10-25 2018-01-12 国家电网公司 Homepage method for pushing and system
CN107580075B (en) * 2017-10-25 2021-07-20 国家电网公司 Homepage pushing method and system
EP3703329B1 (en) * 2017-10-26 2024-03-20 New H3C Security Technologies Co., Ltd. Webpage request identification
CN109981555A (en) * 2017-12-28 2019-07-05 腾讯科技(深圳)有限公司 To the processing method of web data, device, equipment, terminal and storage medium
CN108306878A (en) * 2018-01-30 2018-07-20 平安科技(深圳)有限公司 Detection method for phishing site, device, computer equipment and storage medium
CN108520185A (en) * 2018-04-16 2018-09-11 深信服科技股份有限公司 Detect method, apparatus, equipment and the computer readable storage medium of webpage tamper
CN109165529A (en) * 2018-08-14 2019-01-08 杭州安恒信息技术股份有限公司 A kind of dark chain altering detecting method, device and computer readable storage medium
CN111563276A (en) * 2019-01-25 2020-08-21 深信服科技股份有限公司 Webpage tampering detection method, detection system and related equipment
CN111563276B (en) * 2019-01-25 2024-04-09 深信服科技股份有限公司 Webpage tampering detection method, detection system and related equipment
CN110134901A (en) * 2019-04-30 2019-08-16 哈尔滨英赛克信息技术有限公司 A kind of multilink webpage tamper determination method based on flow analysis
CN110134901B (en) * 2019-04-30 2023-06-16 哈尔滨英赛克信息技术有限公司 Multilink webpage tampering judging method based on flow analysis
CN110532784A (en) * 2019-09-04 2019-12-03 杭州安恒信息技术股份有限公司 A kind of dark chain detection method, device, equipment and computer readable storage medium
CN113806732B (en) * 2020-06-16 2023-11-03 深信服科技股份有限公司 Webpage tampering detection method, device, equipment and storage medium
CN113806732A (en) * 2020-06-16 2021-12-17 深信服科技股份有限公司 Webpage tampering detection method, device, equipment and storage medium

Also Published As

Publication number Publication date
CN106685936B (en) 2020-07-31

Similar Documents

Publication Publication Date Title
CN106685936A (en) Webpage defacement detection method and apparatus
CN104572977B (en) A kind of agricultural product quality and safety event online test method
CN104077396A (en) Method and device for detecting phishing website
CN103544436B (en) System and method for distinguishing phishing websites
CN104899508B (en) A kind of multistage detection method for phishing site and system
CN103685174B (en) A kind of detection method for phishing site of independent of sample
Hara et al. Visual similarity-based phishing detection without victim site information
CN102436563B (en) Method and device for detecting page tampering
US8561185B1 (en) Personally identifiable information detection
CN107241352A (en) A kind of net security accident classificaiton and Forecasting Methodology and system
CN102591965B (en) Method and device for detecting black chain
CN102446255B (en) Method and device for detecting page tamper
CN110233849A (en) The method and system of network safety situation analysis
CN104156490A (en) Method and device for detecting suspicious fishing webpage based on character recognition
CN102833270A (en) Method and device for detecting SQL (structured query language) injection attacks and firewall with device
CN103679053B (en) A kind of detection method of webpage tamper and device
CN110727766A (en) Method for detecting sensitive words
CN108337269A (en) A kind of WebShell detection methods
CN109918621A (en) Newsletter archive infringement detection method and device based on digital finger-print and semantic feature
Katragadda et al. Framework for real-time event detection using multiple social media sources
CN104158828A (en) Method and system for identifying doubtful phishing webpage on basis of cloud content rule base
CN104036190A (en) Method and device for detecting page tampering
Liu et al. Multi-scale semantic deep fusion models for phishing website detection
Mythreya et al. Prediction and prevention of malicious URL using ML and LR techniques for network security: machine learning
CN101471781A (en) Method and system for processing script injection event

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
CB02 Change of applicant information

Address after: Nanshan District Xueyuan Road in Shenzhen city of Guangdong province 518052 No. 1001 Nanshan Chi Park building A1 layer

Applicant after: SANGFOR TECHNOLOGIES Inc.

Address before: Nanshan District Xueyuan Road in Shenzhen city of Guangdong province 518052 No. 1001 Nanshan Chi Park building A1 layer

Applicant before: Sangfor Technologies Co.,Ltd.

CB02 Change of applicant information
GR01 Patent grant
GR01 Patent grant