CN104063494B - Page altering detecting method and black chain data library generating method - Google Patents

Page altering detecting method and black chain data library generating method Download PDF

Info

Publication number
CN104063494B
CN104063494B CN201410318946.2A CN201410318946A CN104063494B CN 104063494 B CN104063494 B CN 104063494B CN 201410318946 A CN201410318946 A CN 201410318946A CN 104063494 B CN104063494 B CN 104063494B
Authority
CN
China
Prior art keywords
page
black chain
characteristic
black
layout
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201410318946.2A
Other languages
Chinese (zh)
Other versions
CN104063494A (en
Inventor
刘起
郭峰
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing Qizhi Business Consulting Co ltd
Beijing Qihoo Technology Co Ltd
360 Digital Security Technology Group Co Ltd
Original Assignee
Beijing Qihoo Technology Co Ltd
Qizhi Software Beijing Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing Qihoo Technology Co Ltd, Qizhi Software Beijing Co Ltd filed Critical Beijing Qihoo Technology Co Ltd
Priority to CN201410318946.2A priority Critical patent/CN104063494B/en
Priority claimed from CN201110457654.3A external-priority patent/CN102446255B/en
Publication of CN104063494A publication Critical patent/CN104063494A/en
Application granted granted Critical
Publication of CN104063494B publication Critical patent/CN104063494B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/95Retrieval from the web
    • G06F16/955Retrieval from the web using information identifiers, e.g. uniform resource locators [URL]
    • G06F16/9566URL specific, e.g. using aliases, detecting broken or misspelled links

Landscapes

  • Engineering & Computer Science (AREA)
  • Databases & Information Systems (AREA)
  • Theoretical Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Information Transfer Between Computers (AREA)

Abstract

This application provides a kind of method and device for detecting the page and distorting, wherein, methods described includes:Black chain property data base is generated, and the black chain property data base is disposed in multiple servers, the black chain property data base includes black chain characteristic;Obtain the characteristic information of the current detection page;According to destination server corresponding to the characteristic information determination of the page;Matched using the black chain property data base in the destination server with the current detection page, judge the black chain characteristic in the black chain property data base whether is included in the current detection page, if so, then judging current page to be tampered the page.The application can be on the premise of manual intervention be reduced as far as possible, the detection page efficiency and accuracy rate distorted are improved, especially need to detect page quantity numerous, and, in the case that the black chain characteristic of required matching is more, efficiency and accuracy rate that the detection page is distorted are improved.

Description

Page altering detecting method and black chain data library generating method
Present patent application is the applying date on December 30th, 2011, Application No. 201110457654.3, entitled A kind of divisional application of the Chinese invention patent application of " method and device for detecting the page and distorting ".
Technical field
The application is related to the technical field of computer security, more particularly to a kind of method for detecting the page and distorting, and A kind of device for detecting the page and distorting.
Background technology
WWW turns into the carrier of bulk information, to efficiently extract and utilizing these information, search engine (Search Engine the instrument of information) is retrieved as an auxiliary people, turns into entrance and guide that user accesses WWW.
SEO (Search Engine Optimization, search engine optimization), it is more popular network marketing side Formula, main purpose be the exposure rate for increasing special key words to increase the visibility of website, its is improved search engine ranking, from And website visiting amount is improved, the final sales force or publicity capacity for lifting website.Website SEO data represent the content at Home Network station The quantity being included in other search engines, include to be easier to be searched by user more.
For this characteristic of search engine, some instruments provide black chain technology at present, and black chain is in the black cap gimmicks of SEO A kind of fairly common means, generally, it just refers to the reverse of other websites that some are obtained with improper means Link, most common black chain are exactly to obtain search engine weight or PR (PageRank, webpage by various procedure site leaks Rank), the WEBSHELL of higher website (anonymous (invader) by website port to Website server to a certain degree The authority of upper operation), and then it is connected to oneself website being hacked website cochain.
Black chain enters mainly for search engine for example, searching out the most forward several websites of the ranking come to search engine The simple analysis of row, check its web site architecture, keyword distribution, and exterior chain etc., it is possible to find that number of site ranking is non- Chang Hao, and keyword webpage dependency number all reaches millions of, but web site architecture is general, and Keyword Density is not very properly, most Important is some websites not to have any derived link, and by checking that its backward chaining is just found, number exterior chain big absolutely both is from In black chain.SEO is mainly to determine ranking by the exterior chain of high quality, should be more than 50%, therefore in weight for percentage Black chain is made on higher website and is advantageous to website ranking.In addition black chain is typically to hide the pattern of link, so in website Keeper is difficult to find that black chain has been made in website in routine inspection.At present, black chain is generally used for black (ash) the color industry of sudden huge profits, example Such as private clothes, medical treatment, unexpected winner high profit industry etc..Black chain has also formed industrialization.In actual applications, if user does not do Good security protection work, then open and be hacked the page that chain is distorted and will will easily infect virus on website.
In the prior art, the detection for black chain is typically such as the head of a station of website by artificial, by largely artificially collecting Distort keyword, such as hack, hacked by, lottery ticket, property experience, plug-in, the HTML texts in the matching webpage such as private clothes, Distorted with judging whether it is hacked chain.It is divided into feature that hacker shows off such as example, being hacked chain and distorting the common feature of webpage:So And the mode of this artificial detection depend critically upon artificially collect distort keyword and artificial periodic detection, efficiency is very Lowly.
Furthermore for numerous in required detection page quantity, also, the black chain characteristic of required matching (such as distorts pass Keyword) it is more in the case of, artificial mode obviously can not be tackled completely.
Therefore, a technical problem for needing those skilled in the art to solve at present is just to provide a kind of detection page and usurped The mechanism changed, on the premise of manual intervention is reduced as far as possible, to improve the efficiency and accuracy rate that the detection page is distorted, especially That page quantity is numerous need to detect, also, required matching black chain characteristic it is more in the case of, improve the detection page and usurp The efficiency and accuracy rate changed.
The content of the invention
The application provides a kind of method for detecting the page and distorting, on the premise of manual intervention is reduced as far as possible, to carry The efficiency and accuracy rate that the high detection page is distorted, especially need to detect, page quantity is numerous, also, the black chain of required matching is special In the case that sign data are more, efficiency and accuracy rate that the detection page is distorted are improved.
Detect the device distorted of the page present invention also provides a kind of, to ensure above method application in practice and Realize.
In order to solve the above problems, this application discloses a kind of method for detecting the page and distorting, including:
Black chain property data base is generated, and the black chain property data base is disposed in multiple servers, the black chain is special Sign database includes black chain characteristic;
Obtain the characteristic information of the current detection page;
According to destination server corresponding to the characteristic information determination of the page;
Matched using the black chain property data base in the destination server with the current detection page, judge current inspection The black chain characteristic whether included in the black chain property data base is surveyed in the page, if so, then judging current page to be usurped Change the page.
Preferably, the server has server identification, and the characteristic information includes page classifications information, the foundation Include corresponding to the characteristic information determination of the page the step of destination server:
According to the corresponding relation of preset page classifications information and server identification, extraction current page classification information is corresponding Server identification;
Server corresponding to the server identification is defined as destination server.
Preferably, the characteristic information includes the URL of the page, and the server has numerical identity, described according to the page Characteristic information determine corresponding to server identification the step of include:
The URL of the current detection page is converted to by numerical value using preset algorithm;
The server that corresponding numerical identity is extracted by the numerical value is destination server.
Preferably, the page classifications information includes the content category message of the page, the classification of type information of the page, the page Attributive classification information.
Preferably, the step of generation black chain property data base includes:
The page is characterized using the existing black page of the chain characteristic search comprising the black chain characteristic;
Layout of the black chain characteristic in characteristics page is analyzed, when finding that layout is abnormal, from this feature page It is middle to extract the page elements for including the black chain characteristic;
Black chain rule is generated according to the page elements, is carried out using the black chain rule in the further feature page Match somebody with somebody, and new black chain characteristic is extracted in the characteristics page of matching;
Preserve the black chain characteristic and form black chain property data base.
Preferably, the black chain characteristic includes distorting keyword and black chain URL.
Preferably, the step of layout of the analysis black chain characteristic in characteristics page includes:
The position of page element of the black chain characteristic is judged whether in preset threshold range, if so, then judging institute It is abnormal to state layout of the black chain characteristic in characteristics page;
And/or
Whether the page elements attribute for judging the black chain characteristic is invisible attribute, if so, then judging described black Layout of the chain characteristic in characteristics page is abnormal;
And/or
Whether the page elements attribute for judging the black chain characteristic is the attribute hidden to browser, if so, then sentencing Layout of the fixed black chain characteristic in characteristics page is abnormal.
Preferably, described the step of generating black chain rule according to page elements, is:
From comprising the page elements for distorting keyword and/or black chain URL, regular expression is taken out as black chain Rule.
Preferably, described method, in addition to:
Interval updates the black chain property data base at preset timed intervals.
Disclosed herein as well is a kind of device for detecting the page and distorting, including:
Database generation module, for generating black chain property data base, it is special that the black chain property data base includes black chain Levy data;
Database deployment module, for disposing the black chain property data base in multiple servers;
Characteristic information acquisition module, for obtaining the characteristic information of the current detection page;
Destination server determining module, for destination server corresponding to the characteristic information determination according to the page;
Tampering detection module, for being entered using the black chain property data base in the destination server with the current detection page Row matching, judge the black chain characteristic in the black chain property data base whether is included in the current detection page, if so, then sentencing Current page is determined to be tampered the page.
Preferably, the server has server identification, and the characteristic information includes page classifications information, the target Server determining module includes:
Marker extraction submodule, for the corresponding relation according to preset page classifications information and server identification, extraction Server identification corresponding to current page classification information;
Mark location submodule, for server corresponding to the server identification to be defined as into destination server.
Preferably, the characteristic information includes the URL of the page, and the server has numerical identity, the destination service Device determining module includes:
URL transform subblocks, for the URL of the current detection page to be converted into numerical value using preset algorithm;
The corresponding submodule of mark, the server for extracting corresponding numerical identity by the numerical value is destination server.
Preferably, the database generation module includes:
Characteristics page searches for submodule, for including the black chain characteristic using existing black chain characteristic search The page be characterized the page;
Topological analysis's submodule, for analyzing layout of the black chain characteristic in characteristics page;
Page elements extracting sub-module, for when finding that layout is abnormal, extraction to be comprising described black from this feature page The page elements of chain characteristic;
Black chain rule generates submodule, for generating black chain rule according to the page elements;
Black chain characteristic extracting sub-module, for being matched using the black chain rule in the further feature page, And new black chain characteristic is extracted in the characteristics page of matching, preserve the black chain characteristic and form black chain characteristic Storehouse.
Preferably, topological analysis's submodule further comprises:
First judging unit, for judging the position of page element of the black chain characteristic whether in preset threshold range It is interior, if so, then judging that layout of the black chain characteristic in characteristics page is abnormal;
And/or
Second judging unit, for judging whether the page elements attribute of the black chain characteristic is invisible attribute, If so, then judge that layout of the black chain characteristic in characteristics page is abnormal;
And/or
3rd judging unit, for judging whether the page elements attribute of the black chain characteristic is to be hidden to browser Attribute, if so, then judging that layout of the black chain characteristic in characteristics page is abnormal.
Preferably, the black chain characteristic includes distorting keyword and black chain URL, the black chain rule generation submodule Including:
Regular expression extracting unit, for from comprising the page elements for distorting keyword and/or black chain URL, Regular expression is taken out as black chain rule.
Preferably, described device, in addition to:
Database update module, for being spaced the renewal black chain property data base at preset timed intervals.
Compared with prior art, the application has advantages below:
The application disperses individually service by the way that the black chain property data base of generation is disposed in multiple servers The pressure of device or client process, when receiving concurrent multiple page tampering detections request, according to institute's request detection page Characteristic information determine the server of processing current detection, specific tampering detection processing is carried out by the server, so as to Need to detect that page quantity is numerous, also, required matching black chain characteristic it is more in the case of, effectively improve the detection page and usurp The efficiency and accuracy rate changed.
Furthermore the application judges whether include black chain characteristic in the current detection page according to black chain property data base, The page comprising black chain characteristic is defined as being tampered the page.In the embodiment of the present application, in black chain property data base Black chain feature can not can collect automatically in the following ways all by artificially collecting:Pass through known black chain characteristic With reference to search engine technique, using the page of the web crawlers crawl comprising this black chain characteristic as characteristics page, by dividing This layout of the black chain characteristic in these characteristics pages is analysed, extracts bag from the abnormal characteristics page if exception is laid out Page elements containing the black chain characteristic, a set of general regular expression is formed as black chain rule, the black chain is advised Then matched in the further feature page, and new black chain characteristic is extracted in the characteristics page of matching.So collect Black chain characteristic is not required to manual intervention, and very quickly, also, the accuracy rate of collected black chain characteristic is also very high, So as to which when being used in page tampering detection, the efficiency and accuracy rate of detection can be effectively improved.
Also, the embodiment of the present application, with reference to search engine technique, is captured using web crawlers and wrapped according to black chain characteristic The page containing this black chain characteristic, then layout of the analysis bag containing this black chain characteristic page, so as to whether judge the page It is tampered, and the page elements that the black chain characteristic is included in the page is tampered described in extraction, ultimately forms a set of general Regular expression as black chain rule.The application is set system without extra, made using regular expression without manual intervention Matched for black chain rule in the page, to extract more black chain characteristics, train the mode of more black chain rules, energy Preferably it is applied to the situation of current black chain industrialization, cost can not only be reduced, moreover it is possible to find the page being tampered faster and more Face, effectively improve the efficiency that the detection page is distorted.Also, sandbox technology is isolated based on web crawlers technology and browser kernel Realize, security, confidence level and the degree of accuracy that the detection page is distorted also has been effectively ensured.
Brief description of the drawings
Fig. 1 is the flow chart for the embodiment of the method that a kind of detection page of the application is distorted;
Fig. 2 is the structured flowchart for the device embodiment that a kind of detection page of the application is distorted.
Embodiment
It is below in conjunction with the accompanying drawings and specific real to enable the above-mentioned purpose of the application, feature and advantage more obvious understandable Mode is applied to be described in further detail the application.
Black chain, also referred to as " network psoriasis ".It is well known that search engine has a ranking system, search engine is recognized Website preferably, will be forward in the ranking of search result, and correspondingly, the clicking rate of website will be higher.Search engine weighs The quality of one website of amount has many indexs, and wherein very important point is exactly the external linkage of website.If one The external linkage of website is all well and good, then the ranking of this website in a search engine will be improved correspondingly.
For example, the ranking of certain website for newly opening in a search engine is very rearward, high (ranking is good, quality for some right afterwards It is high) website and this website newly opened link, then since search engine just will be considered that website that this newly opens can be with Upper link is done in high website with such weight, then its weight also will not be low, so the row of this website in a search engine Name will be lifted.If the high website of multiple weights also all links with this website, then its ranking will rise Obtain very fast.
, whereas if a website newly opened, without any background, without any relation, its weight will not be very high, institute Its very high ranking will not be given with search engine, its ranking in search result will compare rearward.For search engine This characteristic, at present some instruments provide black chain technology, i.e., the high website by invading some weights, by net after invading successfully The link stood is inserted into by the page of invasion website, so as to realize the effect of link, and by hiding web site url, is made not People is that can't see any link on by the page of invasion website.
However, realizing search rank lifting using black chain technology at present, quite a few is that game private takes website, stolen The dangerous websites such as number wooden horse website, fishing website and advertiser website.For these dangerous websites, search engine will not give it Very high ranking, but by " black chain ", their ranking will be very forward, in this case, when using search engine When, the probability for clicking on these websites of opening will be very high, if user does not carry out security protection work, then will be easy The virus on website will be infected.
Exactly inventor herein has found the seriousness of this problem, proposes that one of core idea of the embodiment of the present application exists In the application is by the way that the black chain property data base of generation is disposed in multiple servers to disperse alone server or visitor The pressure of family end processing, when receiving concurrent multiple page tampering detections request, the feature according to institute's request detection page Information determines the server of processing current detection, specific tampering detection processing is carried out by the server, so as to need to detect Page quantity is numerous, in the case that the black chain characteristic of required matching is more, effectively improve efficiency that the detection page distorts and Accuracy rate.Also, in the embodiment of the present application, the black chain feature in black chain property data base can not all by artificially collecting, It can collect automatically in the following ways:By known black chain characteristic combination search engine technique, web crawlers is used The page of the crawl comprising this black chain characteristic is as characteristics page, by analyzing this black chain characteristic in these characteristics pages In layout, if be laid out it is abnormal if extraction includes the black chain characteristic from the abnormal characteristics page page elements, A set of general regular expression is formed as black chain rule, the black chain rule is matched in the further feature page, and New black chain characteristic is extracted in the characteristics page of matching.So collect black chain characteristic and be not required to manual intervention, very Quickly, also, collected black chain characteristic accuracy rate it is also very high, so as to be used in the page tampering detection when, can be effective Improve the efficiency and accuracy rate of detection.
Reference picture 1, the step flow chart for the embodiment of the method that a kind of detection page of the application is distorted is shown, specifically may be used To comprise the following steps:
Step 11, the black chain property data base of generation, and the black chain property data base is disposed in multiple servers, it is described Black chain property data base includes black chain characteristic;
In the specific implementation, the black chain characteristic can include distorting keyword and black chain URL.Such as distort keyword " the private clothes issue of legend ", black chain URL " http://www.45u.com " etc..
In a preferred embodiment of the present application, black chain property data base can be generated by following sub-step:
Sub-step 111, the page for including the black chain characteristic using existing black chain characteristic search are characterized The page;
The layout of sub-step 112, the analysis black chain characteristic in characteristics page, when finding that layout is abnormal, from Extraction includes the page elements of the black chain characteristic in this feature page;
Sub-step 113, according to the page elements black chain rule is generated, using the black chain rule in the further feature page In matched, and new black chain characteristic is extracted in the characteristics page of matching;
Sub-step 114, the preservation black chain characteristic form black chain property data base;
In the specific implementation, the existing black chain characteristic can include distorting keyword and black chain URL.According to institute Existing black chain characteristic is stated, the page of the black chain characteristic is included using web crawlers crawl, and by these pages As characteristics page.
It is well known that the function that search engine automatically extracts webpage from WWW is realized by web crawlers.Net Network reptile is also known as Web Spider, i.e. Web Spider, and Web Spider is to find webpage by the chained address of webpage, from net Some page (being typically homepage) of standing starts, and reads the content of webpage, finds other chained addresses in webpage, Ran Houtong Cross these chained addresses and find next webpage, so circulation is gone down always, is all captured until all webpages in this website Untill complete.If a website is treated as in whole internet, then Web Spider can is with this principle institute on internet Some webpages all capture.
Current web crawlers can be divided into general reptile and focused crawler.General reptile is based on BFS Thought, opened from the URL (Uniform Resource Locator, URL) of one or several Initial pages Begin, obtain the URL on Initial page, during webpage is captured, new URL is constantly extracted from current page and is put into team Row, certain stop condition until meeting system.And focused crawler is an automatic program for downloading webpage, captured for orienting Related pages resource., according to set crawl target, the webpage selectively accessed on WWW is linked to related, obtained for it Required information.Different from general reptile, focused crawler does not pursue big covering, but will be targeted by crawl with it is a certain The related webpage of particular topic content, it is that the user of subject-oriented inquires about preparation data resource.
In existing black chain technology, hiding chain is connected to some and fixes skill, such as knowledge of the search engine to javascript It is not fine, hiding div is exported by javascript.Like this, this directly manually can not be seen by the page A little links, and it is effective that search engine, which confirms as these links,.Code is:Div above is write by javascript first, Setting display is none.Then a table is exported, the black chain to be hung is contained in table.Finally pass through again Javascript output latter halfs div.
It can quickly and efficiently discover page-out using the isolation sandbox technology of browser kernel to be tampered.Specifically, The isolation sandbox technology of browser kernel is browser kernel, such as IE or firefox, constructs the virtual execution of a safety Environment.Any disk write operation that user is made by browser, all it will be redirected in a specific temporary folder.This Sample, even if after installing by force, being also simply installed into temporary file comprising rogue programs such as virus, wooden horse, advertisements in webpage In folder, user equipment will not be damaged.Browser kernel is responsible for the explanation (such as HTML, JavaScript) to webpage grammer And render (display) webpage.So commonly referred browser kernel is namely downloaded to the page, parses, performs, rendered Engine, the engine determines how browser shows the content of webpage and the format information of the page.
According to the aforesaid operations characteristic of browser kernel, using isolation sandbox technology, black chain feature can be safely analyzed Whether layout of the data in characteristics page occurs exception, specifically, can be by analyzing the page of the black chain characteristic Surface element position and attribute, to judge whether layout of the black chain characteristic in characteristics page be abnormal, for example, judging described black Whether not in preset threshold range, the page elements of the black chain characteristic are for the position of the page elements of chain characteristic No have a sightless attribute, and/or, whether the page elements of the black chain characteristic have the category hidden to browser Property, if so, then judging that layout of the black chain characteristic in characteristics page is abnormal.If for example, detect the hyperlink of some page It is sightless to connect, or, the length, width and height of some html tag element are negative values in the page, then can determine that the layout of the page is different Often, it is the page that is tampered.
When finding that layout is abnormal, extracted in the characteristics page abnormal from the layout and distort keyword comprising described And/or black chain URL page elements;Then from comprising the page elements for distorting keyword and/or black chain URL, it is abstracted Go out regular expression as black chain rule.
It is well known that regular expression is the instrument for carrying out text matches, generally by some general characters and some Metacharacter (metacharacters) forms.General character includes the letter and number of capital and small letter, and metacharacter is then with special Implication.The matching of regular expression is found and given regular expression phase it is to be understood that in given character string The part matched somebody with somebody.It is possible to have more than one part to meet given regular expression in character string, at this moment each such portion Divide and be referred to as a matching.Matching can include three kinds of implications in this paper:A kind of is Adjective, such as a character One expression formula of String matching;A kind of is verb character, such as regular expression is matched in character string;Also one kind is noun Property, it is exactly " part for meeting given regular expression in character string " just mentioned.
The create-rule of regular expression is illustrated below by way of citing.
Assuming that to search hi, then regular expression hi can be used.This regular expression can accurately match such Character string:It is made up of two characters, previous character is h, and the latter is i.In practice, regular expression is can to ignore greatly Small letter.If all comprising the two continuous characters of hi, such as him, history, high etc. in many words.With hi come If lookup, the hi inside this this word, which can be also found, to be come.If accurately searching hi this word, then it should make With bhi b.Wherein, b be regular expression a metacharacter, it represents the start or end of word, that is, word Boundary.Although generally the word of English is separated by space or punctuation mark or line feed, b simultaneously mismatch this Any one in word separators a bit, it only matches a position.If what is looked for is that one is nearby followed behind hi Lucy, then should use bhi b.* bLucy b.Wherein, is another metacharacter, matches any word in addition to newline Symbol.* be equally metacharacter, what it was represented is quantity --- specify * contents in front can continuously repeat appearance any time with Matched whole expression formula.Now bhi b.* bLucy b the meaning with regard to apparent:A word hi before this, then It is any any character (but can not be line feed), is finally this word of Lucy.
For example, in the html fragments of the abnormal A pages of page layout, extraction includes the page elements of black chain characteristic It is as follows:
<script>document.write('<D'+'iv st'+'yle'+'=" po'+'si'+'tio'+'n:a'+' bso'+'lu'+'te;l'+'ef'+'t:'+'-'+'10'+'00'+'0'+'p'+'x;'+″″+'>')>××××<script >document.write('<'+'/d'+'i'+'v>');</script>
It is as the regular expression of black chain rule according to the generation of above-mentioned page elements:
<script.*>document\.write.*\(.*\+.*\+.*\+.*\+.*\+.*\).*</ script>([\S\s]+)</div>
Or such as, in the html fragments of the abnormal B pages of page layout, extraction includes the page elements of black chain characteristic It is as follows:
<A href=" http://www.45u.com " style=" margin-left:-83791;”>;
It is as the regular expression of black chain rule according to the generation of above-mentioned page elements:
<A s*href s*=[" '] .+[" '] s*style=[" '] [w+ -]+:-[0-9]+.*["\'].* >.*</a>.
Certainly, the method for the black chain rule of above-mentioned generation is solely for example, and those skilled in the art adopt according to actual conditions Generating mode with any black chain rule is all feasible, and the application need not be any limitation as to this.
Matched using black chain rule in the further feature page, more black chain characteristics, training can be extracted More black chain rules, can finally form the black chain property data base for the black chain of the whole network.
An industrial chain is nowadays formed due to hanging black chain, so identical distorts keyword and/or black chain URL can be a large amount of Appear in other pages being tampered.Matched using regular expression as black chain rule in the page, to extract more More black chain characteristics, more black chain rules are trained, be more suitable for the situation of current black chain industrialization, can send out faster and more The page being now tampered, effectively improve the efficiency that the detection page is distorted.
It is numerous for detection page quantity needed for being applicable, also, the more situation of black chain characteristic of required matching, at this , it is necessary to which the black chain property data base generated is deployed in multiple servers in application embodiment, the 10 of backstage is such as deployed to In platform server, the black chain property data base content disposed in every server is identical.
In the specific implementation, because black chain characteristic has necessarily ageing, initiation can be spaced at preset timed intervals Renewal to the black chain property data base, specifically it can complete black chain characteristic by repeating above-mentioned sub-step S111-S114 According to the renewal in storehouse.
Step 12, the characteristic information for obtaining the current detection page;
Step 13, according to the page characteristic information determine corresponding to destination server;
In the specific implementation, for black chain feature place deployment server, server identification can be set respectively, it is described Mark can use any rule and form to set, such as, numeric sorting, character sequence etc., the application is not restricted to this.
As a kind of example of the embodiment of the present application concrete application, the characteristic information can include page classifications information, In this case, the step 103 can specifically include following sub-step:
Sub-step S311, the corresponding relation according to preset page classifications information and server identification, extract current page Server identification corresponding to classification information;
Sub-step S312, server corresponding to the server identification is defined as destination server.
In the specific implementation, the page classifications information can be the content category message of the page, for example, according in the page Hold and the page is divided into game class, film class, novel class, video class, music class, shopping class, mailbox class, life kind, bank's class, trip Swim class etc.;Preset above-mentioned all kinds of content of pages are corresponding with server identification as shown in the table respectively:
With reference to upper table, if the classifying content for getting the current detection page is game class, it is determined that destination server aaa The server of mark, if the classifying content for getting the current detection page is GT grand touring, it is determined that destination server identifies for kkk Server.
In a particular application, the page classifications information can also be the classification information of page type, for example, according to the page The page is divided into by type:HTML types homepage, Flash types homepage, import block in homepage, HTML types first level pages, the HTML type pages The three-level page, general first level pages, the general two level page, row corresponding to block content in the corresponding two level page, the HTML type pages Table first level pages, the list two level page;Preset above-mentioned all kinds of page types are corresponding with server identification as shown in the table respectively:
With reference to upper table, if the type for getting the current detection page is general first level pages, it is determined that destination server is The server of 777 marks, if the type for getting the current detection page is HTML type homepages, it is determined that destination server 111 The server of mark.
In practice, those skilled in the art are feasible using any page classifications information, for example, it is also possible to adopt With the attributive classification information, the labeling information of the page etc. of the page, the embodiment of the present application need not be limited to this.
In another preferred embodiment of the present application, the characteristic information can include the URL of the page, the server With numerical identity, in this case, the step 103 can specifically include following sub-step:
Sub-step S321, the URL of the current detection page is converted to by numerical value using preset algorithm;
Sub-step S322, the server for extracting by the numerical value corresponding numerical identity are destination server.
For example, it is assumed that current black chain database portion is deployed on n platform servers, the URL (systems of the current detection page are being got One URLs, web page address) when, using the URL as input, random algorithm is called, such as MD5 algorithms, obtains a certain character Go here and there (such as character strings of 32 bytes), character string is then mapped to a numerical value using certain mapping ruler, using the numerical value as pair The server n answered value, the numerical value such as obtained are 2, that is, the server identification that obtaining to preserve is 2, you can determine that target takes Device be engaged in identify 2 server.
Certainly, the method for destination server corresponding to the above-mentioned characteristic information determination according to the page is solely for example, this Art personnel can use any method according to actual conditions, such as solid using the tag characters string of the page is converted to Method of definite value etc., the application need not be any limitation as to this.
Step 14, matched using the black chain property data base in the destination server, sentenced with the current detection page Whether black chain characteristic in the black chain property data base is included in the disconnected current detection page, if so, then judging current page Face is to be tampered the page.
In practice, if not including the black chain characteristic in the black chain property data base in the current detection page, It can determine that current page is not tampered with.
The embodiment of the present invention works as presence by using the framework that distributed treatment and application are carried out to black chain property data base During the concurrently detection request of multiple pages, can effective distributing server detection pressure, so as to effectively save system resource.
It should be noted that for embodiment of the method, in order to be briefly described, therefore it is all expressed as to a series of action group Close, but those skilled in the art should know, the application is not limited by described sequence of movement, because according to this Shen Please, some steps can use other orders or carry out simultaneously.Secondly, those skilled in the art should also know, specification Described in embodiment belong to preferred embodiment, necessary to involved action and module not necessarily the application.
With reference to figure 2, the structured flowchart for the device embodiment distorted it illustrates a kind of detection page of the application, specifically may be used With including with lower module:
Database generation module 21, for generating black chain property data base, the black chain property data base includes black chain Characteristic;
Database deployment module 22, for disposing the black chain property data base in multiple servers;
Characteristic information acquisition module 23, for obtaining the characteristic information of the current detection page;
Destination server determining module 24, for destination server corresponding to the characteristic information determination according to the page;
Tampering detection module 25, for using the black chain property data base in the destination server and the current detection page Matched, judge the black chain characteristic in the black chain property data base whether is included in the current detection page, if so, then Current page is judged to be tampered the page.
In a preferred embodiment of the present application, the server has server identification, and the characteristic information includes Page classifications information, in this case, the destination server determining module 24 can include following submodule:
Marker extraction submodule, for the corresponding relation according to preset page classifications information and server identification, extraction Server identification corresponding to current page classification information;
Mark location submodule, for server corresponding to the server identification to be defined as into destination server.
In another preferred embodiment of the present application, the characteristic information includes the URL of the page, and the server has Numerical identity, in this case, the destination server determining module 24 can include following submodule:
URL transform subblocks, for the URL of the current detection page to be converted into numerical value using preset algorithm;
The corresponding submodule of mark, the server for extracting corresponding numerical identity by the numerical value is destination server.
In the specific implementation, the embodiment of the present application can also include database update module, for being spaced at preset timed intervals Update the black chain property data base.
In a preferred embodiment of the present application, the database generation module 21 can include following submodule:
Characteristics page searches for submodule, for including the black chain characteristic using existing black chain characteristic search The page be characterized the page;
Topological analysis's module, for analyzing layout of the black chain characteristic in characteristics page;
Page elements extraction module, for when finding that layout is abnormal, being extracted from this feature page and including the black chain The page elements of characteristic;
Black chain rule generation module, for generating black chain rule according to the page elements;
Black chain characteristic extraction module, for being matched using the black chain rule in the further feature page, and New black chain characteristic is extracted in the characteristics page of matching, the black chain characteristic is preserved and forms black chain characteristic Storehouse.
In the specific implementation, the black chain characteristic can include distorting keyword and black chain URL.
As a kind of example of the embodiment of the present application concrete application, topological analysis's submodule can include such as placing an order Member:
First judging unit, for judging the position of page element of the black chain characteristic whether in preset threshold range It is interior, if so, then judging that layout of the black chain characteristic in characteristics page is abnormal;
And/or
Second judging unit, for judging whether the page elements attribute of the black chain characteristic is invisible attribute, If so, then judge that layout of the black chain characteristic in characteristics page is abnormal;
And/or
3rd judging unit, for judging whether the page elements attribute of the black chain characteristic is to be hidden to browser Attribute, if so, then judging that layout of the black chain characteristic in characteristics page is abnormal.
In a particular application, the black chain rule generation submodule can include such as lower unit:
Regular expression extracting unit, for from comprising the page elements for distorting keyword and/or black chain URL, Regular expression is taken out as black chain rule.
Because described device embodiment essentially corresponds to the embodiment of the method shown in earlier figures 1, therefore the description of the present embodiment In not detailed part, may refer to the related description in previous embodiment, just do not repeat herein.
The application can be used in numerous general or special purpose computing system environments or configuration.Such as:Personal computer, service Device computer, handheld device or portable set, laptop device, multicomputer system, the system based on microprocessor, machine top Box, programmable consumer-elcetronics devices, network PC, minicom, mainframe computer including any of the above system or equipment DCE etc..
The application can be described in the general context of computer executable instructions, such as program Module.Usually, program module includes performing particular task or realizes routine, program, object, the group of particular abstract data type Part, data structure etc..The application can also be put into practice in a distributed computing environment, in these DCEs, by Task is performed and connected remote processing devices by communication network.In a distributed computing environment, program module can be with In the local and remote computer-readable storage medium including storage device.
Finally, it is to be noted that, herein, such as first and second or the like relational terms be used merely to by One entity or operation make a distinction with another entity or operation, and not necessarily require or imply these entities or operation Between any this actual relation or order be present.Moreover, term " comprising ", "comprising" or its any other variant meaning Covering including for nonexcludability, so that process, method, article or equipment including a series of elements not only include that A little key elements, but also the other element including being not expressly set out, or also include for this process, method, article or The intrinsic key element of equipment.In the absence of more restrictions, the key element limited by sentence "including a ...", is not arranged Except other identical element in the process including the key element, method, article or equipment being also present.
The method distorted above to a kind of detection page provided herein, and, a kind of dress for detecting the page and distorting Put and be described in detail, specific case used herein is set forth to the principle and embodiment of the application, the above The explanation of embodiment is only intended to help and understands the present processes and its core concept;Meanwhile for the general skill of this area Art personnel, according to the thought of the application, there will be changes in specific embodiments and applications, in summary, this Description should not be construed as the limitation to the application.

Claims (22)

1. a kind of page altering detecting method, it includes:
The characteristic information of the current detection page is obtained, the characteristic information includes page classifications information;
According to destination server corresponding to acquired characteristic information determination;
The black chain property data base in the destination server is used to be matched with the current detection page to judge current detection Whether black chain characteristic in the black chain property data base is included in the page;
If comprising it is to be tampered the page to judge current page.
2. the method for claim 1, wherein the black chain property data base is deployed on multiple servers.
3. method as claimed in claim 1 or 2, wherein, the server has server identification, the spy according to the page Include corresponding to reference breath determination the step of destination server:
According to the corresponding relation of preset page classifications information and server identification, taken corresponding to extraction current page classification information Business device mark;
Server corresponding to the server identification is defined as destination server.
4. method as claimed in claim 3, wherein, the page classifications information includes content category message, the page of the page Classification of type information, the attributive classification information of the labeling information of the page and/or the page.
5. method as claimed in claim 1 or 2, wherein, the characteristic information also includes the URL of the page, the service utensil There is numerical identity, include corresponding to the characteristic information determination according to the page the step of server identification:
The URL of the current detection page is converted to by numerical value using preset algorithm;
The server of corresponding numerical identity is extracted by the numerical value and as destination server.
6. the method as any one of claim 1 to 5, wherein, the black chain property data base is given birth to according to below step Into:
The page is characterized using the existing black page of the chain characteristic search comprising the black chain characteristic;
Layout of the black chain characteristic in characteristics page is analyzed, when finding that layout is abnormal, is carried from this feature page Take the page elements for including the black chain characteristic;
Black chain rule is generated according to the page elements, is matched using the black chain rule in the further feature page, and New black chain characteristic is extracted in the characteristics page of matching;
Preserve the black chain characteristic and form black chain property data base.
7. method as claimed in claim 6, wherein, the black chain characteristic includes distorting keyword and black chain URL.
8. method as claimed in claim 6, wherein, layout of the analysis black chain characteristic in characteristics page is entered One step includes:
The position of page element of the black chain characteristic is judged whether in preset threshold range, if so, then judging described black Layout of the chain characteristic in characteristics page is abnormal;
And/or
Whether the page elements attribute for judging the black chain characteristic is invisible attribute, if so, then judging that the black chain is special It is abnormal to levy layout of the data in characteristics page;
And/or
Whether the page elements attribute for judging the black chain characteristic is the attribute hidden to browser, if so, then judging institute It is abnormal to state layout of the black chain characteristic in characteristics page.
9. method as claimed in claim 7, wherein, described the step of generating black chain rule according to page elements is:
From comprising the page elements for distorting keyword and/or black chain URL, take out regular expression and advised as black chain Then.
10. method as claimed in claim 7, wherein, in addition to:
Interval updates the black chain property data base at preset timed intervals.
11. a kind of black chain data library generating method, it includes:
The page is characterized using the existing black page of the chain characteristic search comprising the black chain characteristic;
Layout of the black chain characteristic in characteristics page is analyzed, when finding that layout is abnormal, is carried from this feature page Take the page elements for including the black chain characteristic;
Black chain rule is generated according to the page elements, is matched using the black chain rule in the further feature page, and New black chain characteristic is extracted in the characteristics page of matching;
Preserve the black chain characteristic and form black chain property data base.
12. method as claimed in claim 11, wherein, the black chain characteristic includes distorting keyword and black chain URL.
13. method as claimed in claim 12, wherein, layout of the analysis black chain characteristic in characteristics page Further comprise:
The position of page element of the black chain characteristic is judged whether in preset threshold range, if so, then judging described black Layout of the chain characteristic in characteristics page is abnormal;
And/or
Whether the page elements attribute for judging the black chain characteristic is invisible attribute, if so, then judging that the black chain is special It is abnormal to levy layout of the data in characteristics page;
And/or
Whether the page elements attribute for judging the black chain characteristic is the attribute hidden to browser, if so, then judging institute It is abnormal to state layout of the black chain characteristic in characteristics page.
14. method as claimed in claim 13, wherein, described the step of generating black chain rule according to page elements is:
From comprising the page elements for distorting keyword and/or black chain URL, take out regular expression and advised as black chain Then.
15. the method as any one of claim 11 to 14, wherein, in addition to:
Interval updates the black chain property data base at preset timed intervals.
16. a kind of page altering detecting method, it includes:
Obtain the URL of the current detection page;
The URL of the current detection page is converted to by numerical value using preset algorithm;
By server of the numerical value extraction with corresponding numerical identity and as destination server;
The black chain property data base in the destination server is used to be matched with the current detection page to judge current detection Whether black chain characteristic in the black chain property data base is included in the page;
If comprising it is to be tampered the page to judge current page.
17. method as claimed in claim 16, wherein, the black chain property data base is deployed on multiple servers.
18. the method as described in claim 16 or 17, wherein, the black chain property data base generates according to below step:
The page is characterized using the existing black page of the chain characteristic search comprising the black chain characteristic;
Layout of the black chain characteristic in characteristics page is analyzed, when finding that layout is abnormal, is carried from this feature page Take the page elements for including the black chain characteristic;
Black chain rule is generated according to the page elements, is matched using the black chain rule in the further feature page, and New black chain characteristic is extracted in the characteristics page of matching;
Preserve the black chain characteristic and form black chain property data base.
19. method as claimed in claim 18, wherein, the black chain characteristic includes distorting keyword and black chain URL.
20. method as claimed in claim 18, wherein, layout of the analysis black chain characteristic in characteristics page Further comprise:
The position of page element of the black chain characteristic is judged whether in preset threshold range, if so, then judging described black Layout of the chain characteristic in characteristics page is abnormal;
And/or
Whether the page elements attribute for judging the black chain characteristic is invisible attribute, if so, then judging that the black chain is special It is abnormal to levy layout of the data in characteristics page;
And/or
Whether the page elements attribute for judging the black chain characteristic is the attribute hidden to browser, if so, then judging institute It is abnormal to state layout of the black chain characteristic in characteristics page.
21. method as claimed in claim 18, wherein, described the step of generating black chain rule according to page elements is:
From comprising the page elements for distorting keyword and/or black chain URL, take out regular expression and advised as black chain Then.
22. the method as any one of claim 16 to 21, in addition to:
Interval updates the black chain property data base at preset timed intervals.
CN201410318946.2A 2011-12-30 2011-12-30 Page altering detecting method and black chain data library generating method Active CN104063494B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201410318946.2A CN104063494B (en) 2011-12-30 2011-12-30 Page altering detecting method and black chain data library generating method

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN201110457654.3A CN102446255B (en) 2011-12-30 2011-12-30 Method and device for detecting page tamper
CN201410318946.2A CN104063494B (en) 2011-12-30 2011-12-30 Page altering detecting method and black chain data library generating method

Related Parent Applications (1)

Application Number Title Priority Date Filing Date
CN201110457654.3A Division CN102446255B (en) 2011-12-30 2011-12-30 Method and device for detecting page tamper

Publications (2)

Publication Number Publication Date
CN104063494A CN104063494A (en) 2014-09-24
CN104063494B true CN104063494B (en) 2017-11-14

Family

ID=51551208

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201410318946.2A Active CN104063494B (en) 2011-12-30 2011-12-30 Page altering detecting method and black chain data library generating method

Country Status (1)

Country Link
CN (1) CN104063494B (en)

Families Citing this family (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108600247A (en) * 2018-05-02 2018-09-28 尚谷科技(天津)有限公司 A kind of website fishing camouflage recognition methods
CN111389012B (en) * 2020-02-26 2021-01-15 完美世界征奇(上海)多媒体科技有限公司 Method, device and system for anti-plug-in

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101656711A (en) * 2008-08-22 2010-02-24 中国科学院计算机网络信息中心 System and method for verifying website information
CN101919219A (en) * 2007-09-19 2010-12-15 阿尔卡特朗讯美国公司 Method and apparatus for preventing phishing attacks
CN102096781A (en) * 2011-01-18 2011-06-15 南京邮电大学 Fishing detection method based on webpage relevance
CN102110198A (en) * 2009-12-28 2011-06-29 北京安码科技有限公司 Anti-counterfeiting method for web page

Family Cites Families (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20100205215A1 (en) * 2009-02-11 2010-08-12 Cook Robert W Systems and methods for enforcing policies to block search engine queries for web-based proxy sites
US8438642B2 (en) * 2009-06-05 2013-05-07 At&T Intellectual Property I, L.P. Method of detecting potential phishing by analyzing universal resource locators

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101919219A (en) * 2007-09-19 2010-12-15 阿尔卡特朗讯美国公司 Method and apparatus for preventing phishing attacks
CN101656711A (en) * 2008-08-22 2010-02-24 中国科学院计算机网络信息中心 System and method for verifying website information
CN102110198A (en) * 2009-12-28 2011-06-29 北京安码科技有限公司 Anti-counterfeiting method for web page
CN102096781A (en) * 2011-01-18 2011-06-15 南京邮电大学 Fishing detection method based on webpage relevance

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
"一种基于TSVM的phishing网页检测方法";赵留军;《中国优秀硕士学位论文全文数据库信息科技辑》;20110915;摘要,正文第3页第4段-第4页第6段、第6页第1段-第7页第10段、第16页第2段-第28页第4段,附图3.1 *
"基于客户端的恶意网页收集系统";陆璐 等;《计算机工程》;20101231;第113页第3段-第115页第8段,附图1 *

Also Published As

Publication number Publication date
CN104063494A (en) 2014-09-24

Similar Documents

Publication Publication Date Title
CN102436563B (en) Method and device for detecting page tampering
CN102446255B (en) Method and device for detecting page tamper
CN108965245B (en) Phishing website detection method and system based on self-adaptive heterogeneous multi-classification model
CN103544436B (en) System and method for distinguishing phishing websites
CN102591965B (en) Method and device for detecting black chain
CN103544176B (en) Method and apparatus for generating the page structure template corresponding to multiple pages
CN106685936B (en) Webpage tampering detection method and device
CN103559235B (en) A kind of online social networks malicious web pages detection recognition methods
US20150295942A1 (en) Method and server for performing cloud detection for malicious information
US11907644B2 (en) Detecting compatible layouts for content-based native ads
CN104077396A (en) Method and device for detecting phishing website
CN104156490A (en) Method and device for detecting suspicious fishing webpage based on character recognition
CN104036190A (en) Method and device for detecting page tampering
CN103679053B (en) A kind of detection method of webpage tamper and device
Zhang et al. Web phishing detection based on page spatial layout similarity
Yang et al. Scalable detection of promotional website defacements in black hat {SEO} campaigns
CN107786537A (en) A kind of lonely page implantation attack detection method based on internet intersection search
CN103647767A (en) Website information display method and apparatus
CN113032655A (en) Method for extracting and fixing dark network electronic data
CN104036189A (en) Page distortion detecting method and black link database generating method
CN104077353B (en) A kind of method and device of detecting black chain
CN106446123A (en) Webpage verification code element identification method
CN108183902A (en) A kind of recognition methods of malicious websites and device
CN104063494B (en) Page altering detecting method and black chain data library generating method
Borgolte et al. Relevant change detection: a framework for the precise extraction of modified and novel web-based content as a filtering technique for analysis engines

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant
CP01 Change in the name or title of a patent holder
CP01 Change in the name or title of a patent holder

Address after: 100088 room 112, block D, 28 new street, new street, Xicheng District, Beijing (Desheng Park)

Patentee after: BEIJING QIHOO TECHNOLOGY Co.,Ltd.

Patentee after: Beijing Qizhi Business Consulting Co.,Ltd.

Address before: 100088 room 112, block D, 28 new street, new street, Xicheng District, Beijing (Desheng Park)

Patentee before: BEIJING QIHOO TECHNOLOGY Co.,Ltd.

Patentee before: Qizhi software (Beijing) Co.,Ltd.

TR01 Transfer of patent right
TR01 Transfer of patent right

Effective date of registration: 20220329

Address after: 100016 1773, 15 / F, 17 / F, building 3, No.10, Jiuxianqiao Road, Chaoyang District, Beijing

Patentee after: Sanliu0 Digital Security Technology Group Co.,Ltd.

Address before: 100088 room 112, block D, 28 new street, new street, Xicheng District, Beijing (Desheng Park)

Patentee before: BEIJING QIHOO TECHNOLOGY Co.,Ltd.

Patentee before: Beijing Qizhi Business Consulting Co.,Ltd.