CN104063491B - A kind of method and device that the detection page is distorted - Google Patents

A kind of method and device that the detection page is distorted Download PDF

Info

Publication number
CN104063491B
CN104063491B CN201410318916.1A CN201410318916A CN104063491B CN 104063491 B CN104063491 B CN 104063491B CN 201410318916 A CN201410318916 A CN 201410318916A CN 104063491 B CN104063491 B CN 104063491B
Authority
CN
China
Prior art keywords
page
black chain
characteristic
black
server
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201410318916.1A
Other languages
Chinese (zh)
Other versions
CN104063491A (en
Inventor
刘起
郭峰
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing Qizhi Business Consulting Co ltd
Beijing Qihoo Technology Co Ltd
360 Digital Security Technology Group Co Ltd
Original Assignee
Beijing Qihoo Technology Co Ltd
Qizhi Software Beijing Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing Qihoo Technology Co Ltd, Qizhi Software Beijing Co Ltd filed Critical Beijing Qihoo Technology Co Ltd
Priority to CN201410318916.1A priority Critical patent/CN104063491B/en
Priority claimed from CN201110457654.3A external-priority patent/CN102446255B/en
Publication of CN104063491A publication Critical patent/CN104063491A/en
Application granted granted Critical
Publication of CN104063491B publication Critical patent/CN104063491B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F21/00Security arrangements for protecting computers, components thereof, programs or data against unauthorised activity
    • G06F21/60Protecting data
    • G06F21/64Protecting data integrity, e.g. using checksums, certificates or signatures
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/95Retrieval from the web
    • G06F16/951Indexing; Web crawling techniques
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F2221/00Indexing scheme relating to security arrangements for protecting computers, components thereof, programs or data against unauthorised activity
    • G06F2221/21Indexing scheme relating to G06F21/00 and subgroups addressing additional information or applications relating to security arrangements for protecting computers, components thereof, programs or data against unauthorised activity
    • G06F2221/2119Authenticating web pages, e.g. with suspicious links

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Databases & Information Systems (AREA)
  • Computer Security & Cryptography (AREA)
  • Data Mining & Analysis (AREA)
  • Health & Medical Sciences (AREA)
  • Bioethics (AREA)
  • General Health & Medical Sciences (AREA)
  • Computer Hardware Design (AREA)
  • Software Systems (AREA)
  • Information Transfer Between Computers (AREA)

Abstract

This application provides a kind of method and devices that the detection page is distorted, wherein the method includes:Black chain property data base is generated, and disposes the black chain property data base in multiple servers, the black chain property data base includes black chain characteristic;Obtain the characteristic information of the current detection page;Characteristic information according to the page determines corresponding destination server;It is matched, is judged whether comprising the black chain characteristic in the black chain property data base in the current detection page, if so, judgement current page is to be tampered the page with the current detection page using the black chain property data base in the destination server.The application can be under the premise of reducing manual intervention as far as possible, the detection page efficiency and accuracy rate distorted are improved, especially need to detect page quantity numerous, and, in the case that required matched black chain characteristic is more, efficiency and accuracy rate that the detection page is distorted are improved.

Description

A kind of method and device that the detection page is distorted
Present patent application be the applying date be on December 30th, 2011, application No. is 201110457654.3, it is entitled A kind of divisional application of the Chinese invention patent application of " method and device that the detection page is distorted ".
Technical field
This application involves the technical fields of computer security, more particularly to a kind of method that the detection page is distorted, and A kind of device that the detection page is distorted.
Background technology
WWW becomes the carrier of bulk information, to efficiently extract and use these information, search engine (Search Engine the tool that information) is retrieved as an auxiliary people, becomes entrance and guide that user accesses WWW.
SEO (Search Engine Optimization, search engine optimization) is the network marketing side of more prevalence Formula, main purpose are to increase the exposure rate of special key words to increase the visibility of website, it is made to improve search engine ranking, from And website visiting amount is improved, the final sales force or publicity capacity for promoting website.Website SEO data indicate the content at Home Network station The quantity being included in other search engines includes to be easier to be searched by user more.
For this characteristic of search engine, some tools provide black chain technology at present, and black chain is in the black cap gimmicks of SEO A kind of fairly common means, generally, other websites that it, which refers to just some, is obtained with improper means it is reversed Link, most common black chain are exactly to obtain search engine weight or PR (PageRank, webpage by various procedure site loopholes Rank), the WEBSHELL of higher website (anonymous (invader) by website port to Website server to a certain degree The permission of upper operation), and then it is connected to oneself website being hacked website cochain.
Black chain mainly for search engine, for example, to search engine search out come ranking near preceding several websites into Its web site architecture, keyword distribution and exterior chain etc., it is possible to find that number of site ranking is non-are checked in the simple analysis of row Chang Hao, and keyword webpage dependency number all reaches millions of, but web site architecture is general, and Keyword Density is not very properly, most Important is some websites not to have any derived link, and by checking that its backward chaining is just found, number exterior chain big absolutely both is from In black chain.SEO is mainly exterior chain by high quality to determine ranking, should be more than 50%, therefore in weight for percentage Black chain is made on higher website is conducive to website ranking.In addition black chain is generally to hide the pattern of link, so in website Administrator is difficult to find that black chain has been made in website in routine inspection.Currently, black chain is generally used for black (ash) the color industry of sudden huge profits, example If private takes, medical treatment, unexpected winner high profit industry etc..Black chain has also formed industrialization.In practical applications, if user does not do Good security protection work, then opening and being hacked the page that chain is distorted will be easy that the virus on website will be infected.
In the prior art, for the detection of black chain be typically by artificial, such as website the head of a station, by largely artificially collecting Distort keyword, such as hack, hacked by, lottery ticket, property experience, plug-in, the HTML texts in the matching webpage such as private clothes, It is distorted with judging whether it is hacked chain.It is divided into feature that hacker shows off such as example, being hacked chain and distorting the common feature of webpage:So And the mode of this artificial detection depend critically upon artificially collect distort keyword and artificial periodic detection, efficiency is very Lowly.
Furthermore for numerous in required detection page quantity, also, required matched black chain characteristic (such as distorts pass Keyword) it is more in the case of, artificial mode obviously can not be coped with completely.
Therefore, it needs the technical problem that those skilled in the art solve just to be to provide a kind of detection page at present to usurp The mechanism changed, under the premise of reducing manual intervention as far as possible, to improve the efficiency and accuracy rate that the detection page is distorted, especially It is that need to detect in the case that page quantity is numerous also, required matched black chain characteristic is more, improve the detection page and usurp The efficiency and accuracy rate changed.
Invention content
The application provides a kind of method that the detection page is distorted, under the premise of reducing manual intervention as far as possible, to carry The efficiency and accuracy rate that the high detection page is distorted, especially need to detect page quantity numerous, also, required matched black chain is special In the case that sign data are more, efficiency and accuracy rate that the detection page is distorted are improved.
Present invention also provides a kind of device for distorting of the detection page, to ensure above method application in practice and It realizes.
To solve the above-mentioned problems, this application discloses a kind of methods that the detection page is distorted, including:
Black chain property data base is generated, and disposes the black chain property data base in multiple servers, the black chain is special It includes black chain characteristic to levy database;
Obtain the characteristic information of the current detection page;
Characteristic information according to the page determines corresponding destination server;
It is matched with the current detection page using the black chain property data base in the destination server, judges current inspection It surveys whether comprising the black chain characteristic in the black chain property data base in the page, if so, judging current page to be usurped Change the page.
Preferably, the server has server identification, and the characteristic information includes page classifications information, the foundation The step of characteristic information of the page determines corresponding destination server include:
According to the correspondence of preset page classifications information and server identification, extraction current page classification information corresponds to Server identification;
The corresponding server of the server identification is determined as destination server.
Preferably, the characteristic information includes the URL of the page, and the server has numerical identity, described according to the page Characteristic information the step of determining corresponding server identification include:
The URL of the current detection page is converted to by numerical value using preset algorithm;
The server that corresponding numerical identity is extracted by the numerical value is destination server.
Preferably, the page classifications information includes the content category message of the page, the classification of type information of the page, the page Attributive classification information.
Preferably, the step of generation black chain property data base includes:
The page is characterized using the existing black page of the chain characteristic search comprising the black chain characteristic;
Layout of the black chain characteristic in characteristics page is analyzed, when finding that layout is abnormal, from this feature page It is middle to extract the page elements for including the black chain characteristic;
Black chain rule is generated according to the page elements, is carried out in the other feature page using the black chain rule Match, and extracts new black chain characteristic in matched characteristics page;
It preserves the black chain characteristic and forms black chain property data base.
Preferably, the black chain characteristic includes distorting keyword and black chain URL.
Preferably, the step of layout for analyzing the black chain characteristic in characteristics page includes:
The position of page element of the black chain characteristic is judged whether in preset threshold range, if so, judgement institute It is abnormal to state layout of the black chain characteristic in characteristics page;
And/or
Judge whether the page elements attribute of the black chain characteristic is invisible attribute, if so, judgement is described black Layout of the chain characteristic in characteristics page is abnormal;
And/or
Judge whether the page elements attribute of the black chain characteristic is the attribute hidden to browser, if so, sentencing Layout of the fixed black chain characteristic in characteristics page is abnormal.
Preferably, described the step of generating black chain rule according to page elements, is:
From comprising the page elements for distorting keyword and/or black chain URL, regular expression is taken out as black chain Rule.
Preferably, the method further includes:
Interval updates the black chain property data base at preset timed intervals.
Disclosed herein as well is a kind of devices that the detection page is distorted, including:
Database generation module, for generating black chain property data base, the black chain property data base includes that black chain is special Levy data;
Database deployment module, for disposing the black chain property data base in multiple servers;
Characteristic information acquisition module, the characteristic information for obtaining the current detection page;
Destination server determining module, for determining corresponding destination server according to the characteristic information of the page;
Tampering detection module, for using black chain property data base and the current detection page in the destination server into Whether row matching judges comprising the black chain characteristic in the black chain property data base in the current detection page, if so, sentencing It is to be tampered the page to determine current page.
Preferably, the server has server identification, and the characteristic information includes page classifications information, the target Server determining module includes:
Marker extraction submodule, for the correspondence according to preset page classifications information and server identification, extraction The corresponding server identification of current page classification information;
Mark location submodule, for the corresponding server of the server identification to be determined as destination server.
Preferably, the characteristic information includes the URL of the page, and the server has numerical identity, the destination service Device determining module includes:
URL transform subblocks, for the URL of the current detection page to be converted to numerical value using preset algorithm;
The corresponding submodule of mark, for being destination server by the server of the corresponding numerical identity of numerical value extraction.
Preferably, the database generation module includes:
Characteristics page searches for submodule, for including the black chain characteristic using existing black chain characteristic search The page be characterized the page;
Topological analysis's submodule, for analyzing layout of the black chain characteristic in characteristics page;
Page elements extracting sub-module, for when finding that layout is abnormal, extraction to be comprising described black from this feature page The page elements of chain characteristic;
Black chain rule generates submodule, for generating black chain rule according to the page elements;
Black chain characteristic extracting sub-module, for being matched in the other feature page using the black chain rule, And new black chain characteristic is extracted in matched characteristics page, it preserves the black chain characteristic and forms black chain characteristic Library.
Preferably, topological analysis's submodule further comprises:
First judging unit, for judging the position of page element of the black chain characteristic whether in preset threshold range It is interior, if so, layout of the judgement black chain characteristic in characteristics page is abnormal;
And/or
Second judgment unit, for judging whether the page elements attribute of the black chain characteristic is invisible attribute, If so, layout of the judgement black chain characteristic in characteristics page is abnormal;
And/or
Third judging unit, for judging whether the page elements attribute of the black chain characteristic is to be hidden to browser Attribute, if so, layout of the judgement black chain characteristic in characteristics page is abnormal.
Preferably, the black chain characteristic includes distorting keyword and black chain URL, and the black chain rule generates submodule Including:
Regular expression extracting unit, for from comprising the page elements for distorting keyword and/or black chain URL, Regular expression is taken out as black chain rule.
Preferably, the device further includes:
Database update module updates the black chain property data base for interval at preset timed intervals.
Compared with prior art, the application has the following advantages:
The application disperses individually to service by disposing the black chain property data base generated in multiple servers The pressure of device or client process, when receiving concurrent multiple page tampering detections request, according to institute's request detection page Characteristic information determine the server of processing current detection, specific tampering detection processing is carried out by the server, so as to It is numerous that page quantity need to be detected, also, in the case that required matched black chain characteristic is more, effectively improve the detection page and usurp The efficiency and accuracy rate changed.
Furthermore whether the application judges in the current detection page to include black chain characteristic according to black chain property data base, It is determined as the page comprising black chain characteristic to be tampered the page.In the embodiment of the present application, in black chain property data base Black chain feature can not may be used following manner and collect automatically all by artificially collecting:Pass through known black chain characteristic In conjunction with search engine technique, using the page of the web crawlers crawl comprising this black chain characteristic as characteristics page, by dividing This layout of the black chain characteristic in these characteristics pages is analysed, extracts packet from the characteristics page of the exception if being laid out exception Page elements containing the black chain characteristic form a set of general regular expression as black chain rule, which are advised It is then matched in the other feature page, and extracts new black chain characteristic in matched characteristics page.It collects in this way Black chain characteristic is not required to manual intervention, and very quickly, also, the accuracy rate of collected black chain characteristic is also very high, When to used in page tampering detection, the efficiency and accuracy rate of detection can be effectively improved.
Also, the embodiment of the present application is captured using web crawlers and is wrapped in conjunction with search engine technique according to black chain characteristic The page containing this black chain characteristic, then analysis includes the layout of this black chain characteristic page, to whether judge the page It is tampered, and the page elements for including the black chain characteristic in the page is tampered described in extraction, ultimately form a set of general Regular expression as black chain rule.The application is not necessarily to manual intervention, without additional setting system, is made using regular expression It is matched in the page for black chain rule, to extract more black chain characteristics, the mode of the more black chain rules of training, energy Preferably it is suitable for the situation of current black chain industrialization, cost can not only be reduced, moreover it is possible to find the page being tampered faster and more Face effectively improves the efficiency that the detection page is distorted.Also, sandbox technology is isolated based on web crawlers technology and browser kernel It realizes, safety, confidence level and accuracy that the detection page is distorted also has been effectively ensured.
Description of the drawings
Fig. 1 is a kind of flow chart of embodiment of the method that the detection page is distorted of the application;
Fig. 2 is a kind of structure diagram of device embodiment that the detection page is distorted of the application.
Specific implementation mode
In order to make the above objects, features, and advantages of the present application more apparent, below in conjunction with the accompanying drawings and it is specific real Applying mode, the present application will be further described in detail.
Black chain, also referred to as " network psoriasis ".It is well known that search engine, there are one ranking system, search engine is recognized Website preferably, will be forward in the ranking of search result, and correspondingly, the clicking rate of website will be higher.Search engine weighs The quality of one website of amount has various indexs, and wherein very important point is exactly the external linkage of website.If one The external linkage of website is all well and good, then the ranking of this website in a search engine will correspondingly improve.
For example, the ranking of certain website for newly opening in a search engine is very rearward, high (ranking is good, quality for some right later It is high) website and this website newly opened link since then search engine just will be considered that website that this is newly opened can be with Upper link is done in high website with such weight, then its weight will not be low, so the row of this website in a search engine Name will be promoted.If there is the high website of multiple weights is also all linked with this website, then its ranking will rise It obtains very fast.
, whereas if a website newly opened, without any background, without any relationship, its weight will not be very high, institute It will not give its very high ranking, the ranking in search result that will compare rearward with search engine.For search engine This characteristic, at present some tools provide black chain technology, the i.e. website high by invading some weights, by net after invading successfully The link stood is inserted by the page of invasion website, to realize the effect of link, and by hiding web site url, is made not People is that can't see any link on by the page of invasion website.
However, realizing what search rank was promoted using black chain technology at present, quite a few is that game private takes website, steals The dangerous websites such as number wooden horse website, fishing website and advertiser website.For these dangerous websites, search engine will not give it Very high ranking, but by " black chain ", their ranking will be very forward, in this case, when using search engine When, the probability for clicking these websites of opening will be very high, if user does not carry out security protection work, will be easy The virus on website will be infected.
Exactly inventor herein has found the seriousness of this problem, proposes that one of core idea of the embodiment of the present application exists In the application in multiple servers by disposing the black chain property data base generated to disperse individual server or visitor The pressure of family end processing, when receiving concurrent multiple page tampering detections request, the feature according to institute's request detection page Information determines the server of processing current detection, specific tampering detection processing is carried out by the server, so as to need to detect Page quantity is numerous, in the case that required matched black chain characteristic is more, effectively improve the efficiency distorted of the detection page and Accuracy rate.Also, in the embodiment of the present application, the black chain feature in black chain property data base can not all by artificially collecting, Following manner may be used to collect automatically:By known black chain characteristic combination search engine technique, web crawlers is used The page of the crawl comprising this black chain characteristic is as characteristics page, by analyzing this black chain characteristic in these characteristics pages In layout, if be laid out it is abnormal if the page elements for including the black chain characteristic are extracted from the characteristics page of exception, A set of general regular expression is formed as black chain rule, which is matched in the other feature page, and New black chain characteristic is extracted in matched characteristics page.Black chain characteristic is collected in this way and is not required to manual intervention, very Quickly, also, the accuracy rate of collected black chain characteristic is also very high, can be effective when to used in page tampering detection Improve the efficiency and accuracy rate of detection.
Referring to Fig.1, the step flow chart for showing a kind of embodiment of the method that the detection page is distorted of the application, specifically may be used To include the following steps:
Step 11 generates black chain property data base, and disposes the black chain property data base in multiple servers, described Black chain property data base includes black chain characteristic;
In the concrete realization, the black chain characteristic may include distorting keyword and black chain URL.Such as distort keyword " publication of legend private clothes ", black chain URL " http://www.45u.com " etc..
In a preferred embodiment of the present application, black chain property data base can be generated by following sub-step:
Sub-step 111 is characterized using the existing black page of the chain characteristic search comprising the black chain characteristic The page;
The layout of sub-step 112, the analysis black chain characteristic in characteristics page, when finding that layout is abnormal, from Extraction includes the page elements of the black chain characteristic in this feature page;
Sub-step 113 generates black chain rule according to the page elements, using the black chain rule in the other feature page In matched, and new black chain characteristic is extracted in matched characteristics page;
Sub-step 114, the preservation black chain characteristic form black chain property data base;
In the concrete realization, the existing black chain characteristic may include distorting keyword and black chain URL.According to institute Existing black chain characteristic is stated, includes the page of the black chain characteristic using web crawlers crawl, and by these pages As characteristics page.
It is well known that the function that search engine automatically extracts webpage from WWW is realized by web crawlers.Net Network reptile is also known as Web Spider, i.e. Web Spider, and Web Spider is to find webpage by the chained address of webpage, from net Some page (being typically homepage) of standing starts, and reads the content of webpage, finds other chained addresses in webpage, then lead to It crosses these chained addresses and finds next webpage, cycle is gone down always in this way, is all captured until all webpages in this website Until complete.If a website is treated as in entire internet, Web Spider can be with this principle institute on internet Some webpages all capture.
Current web crawlers can be divided into general reptile and focused crawler.General reptile is based on breadth first search Thought is opened from the URL (Uniform Resource Locator, uniform resource locator) of one or several Initial pages Begin, the URL obtained on Initial page constantly extracts new URL from current page and be put into team during capturing webpage Row, certain stop condition until meeting system.And focused crawler is an automatic program for downloading webpage, for orienting crawl Related pages resource., according to set crawl target, the webpage selectively accessed on WWW is linked with relevant, is obtained for it Required information.Different from general reptile, focused crawler does not pursue big covering, but will be targeted by crawl with it is a certain The relevant webpage of specific subject content is that the user of subject-oriented inquires preparation data resource.
In existing black chain technology, hiding chain is connected to some and fixes skill, such as knowledge of the search engine to javascript It is not fine, hiding div is exported by javascript.In this case, this directly manually can not be seen by the page A little links, and it is effective that search engine, which is confirmed as these links,.Code is:The div of front is write by javascript first, Setting display is none.Then a table is exported, the black chain to be hung is contained in table.Finally pass through again Javascript exports latter half div.
It can quickly and efficiently discover page-out using the isolation sandbox technology of browser kernel to be tampered.Specifically, The isolation sandbox technology of browser kernel is browser kernel, such as IE or firefox, constructs the virtual execution of a safety Environment.User will be redirected to by any disk write operation made by browser in a specific temporary folder.This Sample, even if comprising virus in webpage, wooden horse, the rogue programs such as advertisement are also to be installed into temporary file after installing by force In folder, it will not cause damages to user equipment.Browser kernel is responsible for the explanation (such as HTML, JavaScript) to webpage grammer And render (display) webpage.So commonly referred browser kernel is namely downloaded the page, parses, executes, renders Engine, which determines how browser shows the content of webpage and the format information of the page.
Black chain feature can safely be analyzed using isolation sandbox technology according to the aforesaid operations characteristic of browser kernel Whether layout of the data in characteristics page is abnormal, specifically, can pass through the page of the analysis black chain characteristic Surface element position and attribute, to judge whether layout of the black chain characteristic in characteristics page be abnormal, for example, judging described black Whether not in preset threshold range, the page elements of the black chain characteristic are for the position of the page elements of chain characteristic No have a sightless attribute, and/or, whether the page elements of the black chain characteristic have the category hidden to browser Property, if so, judging that layout of the black chain characteristic in characteristics page is abnormal.If for example, detecting the hyperlink of some page It is sightless to connect, alternatively, the length, width and height of some html tag element are negative values in the page, then can determine that the layout of the page is different Often, it is the page being tampered.
When finding that layout is abnormal, is extracted from the characteristics page of layout exception and distort keyword comprising described And/or the page elements of black chain URL;Then it from comprising the page elements for distorting keyword and/or black chain URL, is abstracted Go out regular expression as black chain rule.
It is well known that regular expression is the tool for carrying out text matches, usually by some general characters and some Metacharacter (metacharacters) forms.General character includes the letter and number of capital and small letter, and metacharacter is then with special Meaning.The matching of regular expression is found and given regular expression phase it is to be understood that in given character string The part matched.It is possible that there is more than one part to meet given regular expression in character string, at this moment each such portion Divide and is referred to as a matching.Matching may include three kinds of meanings in this paper:One is Adjective, such as a character One expression formula of String matching;One is verb characters, such as regular expression is matched in character string;Also one is nouns Property, it is exactly " part for meeting given regular expression in character string " just mentioned.
The create-rule of regular expression is illustrated below by way of citing.
Assuming that search hi, then regular expression hi can be used.This regular expression can accurately match such Character string:It is made of two characters, previous character is h, and the latter is i.In practice, regular expression is can to ignore greatly Small letter.If in many words all including the two continuous characters of hi, such as him, history, high etc..With hi come If lookup, the hi inside this this word, which can be also found, to be come.If accurately searching hi this word, it should make With bhi b.Wherein, b be regular expression a metacharacter, it represents the start or end of word, that is, word Boundary.Although usually the word of English is separated by space or punctuation mark or line feed, b simultaneously mismatch this Any one of a little word separators, it only matches a position.If what is looked for is nearby to follow one behind hi Lucy, then should use bhi b.* bLucy b.Wherein, is another metacharacter, matches the arbitrary word other than newline Symbol.* be equally metacharacter, what it was represented is quantity --- i.e. specified * contents in front can continuously repeat appearance arbitrary time with Entire expression formula is set to be matched.Now bhi b.* bLucy b the meaning with regard to apparent:A word hi before this, then It is arbitrary any character (but cannot be line feed), is finally this word of Lucy.
For example, in the html segments of the A pages of page layout exception, extraction includes the page elements of black chain characteristic It is as follows:
<script>document.write('<D'+'iv st'+'yle'+'=" po'+'si'+'tio'+'n:a'+' bso'+'lu'+'te;l'+'ef'+'t:'+'-'+'10'+'00'+'0'+'p'+'x;'+'"'+'>')>××××< script>document.write('<'+'/d'+'i'+'v>');</script>
It is generated according to above-mentioned page elements and is as the regular expression of black chain rule:
<script.*>document\.write.*\(.*\+.*\+.*\+.*\+.*\+.*\).*</ script>([\S\s]+)</div>
Or such as, in the html segments of the B pages of page layout exception, extraction includes the page elements of black chain characteristic It is as follows:
<A href=" http://www.45u.com " style=" margin-left:-83791;”>;
It is generated according to above-mentioned page elements and is as the regular expression of black chain rule:
<A s*href s*=[" '] .+[" '] s*style=[" '] [w+ -]+:-[0-9]+.*["\'].* >.*</a>.
Certainly, the method for the black chain rule of above-mentioned generation is solely for example, and those skilled in the art adopt according to actual conditions Generating mode with any black chain rule is all feasible, and the application is to this without limiting.
It is matched in the other feature page using black chain rule, more black chain characteristics, training can be extracted More black chain rules, can finally form the black chain property data base for the black chain of the whole network.
An industrial chain is nowadays formed due to hanging black chain, so identical distort keyword and/or black chain URL meetings largely It appears in other pages being tampered.It is matched in the page as black chain rule using regular expression, to extract more More black chain characteristic, the more black chain rules of training, is more suitable for the situation of current black chain industrialization, can send out faster and more The page being now tampered effectively improves the efficiency that the detection page is distorted.
It is numerous to be applicable in required detection page quantity, also, the situation that required matched black chain characteristic is more, at this Apply in embodiment, needs the black chain property data base that will be generated to be deployed in multiple servers, be such as deployed to the 10 of backstage In platform server, the black chain property data base content disposed in every server is identical.
In the concrete realization, since black chain characteristic has certain timeliness, it can be spaced initiation at preset timed intervals Update to the black chain property data base specifically can complete black chain characteristic by repeating above-mentioned sub-step S111-S114 According to the update in library.
Step 12, the characteristic information for obtaining the current detection page;
Step 13 determines corresponding destination server according to the characteristic information of the page;
In the concrete realization, server identification can be respectively set in the server disposed for black chain feature database, described Any rule and form setting may be used in mark, for example, numeric sorting, character sequence etc., the application is not restricted this.
As a kind of example of the embodiment of the present application concrete application, the characteristic information may include page classifications information, In this case, the step 103 can specifically include following sub-step:
Sub-step S311, according to the correspondence of preset page classifications information and server identification, extract current page The corresponding server identification of classification information;
Sub-step S312, the corresponding server of the server identification is determined as destination server.
In the concrete realization, the page classifications information can be the content category message of the page, for example, according in the page Hold and the page is divided into game class, film class, novel class, video class, music class, shopping class, mailbox class, life kind, bank's class, trip Swim class etc.;Preset above-mentioned all kinds of content of pages are corresponding with server identification as shown in the table respectively:
With reference to upper table, if the classifying content for getting the current detection page is game class, it is determined that destination server aaa The server of mark, if the classifying content for getting the current detection page is GT grand touring, it is determined that destination server identifies for kkk Server.
In a particular application, the page classifications information can also be the classification information of page type, for example, according to the page The page is divided by type:HTML types homepage, imports block in homepage, HTML types first level pages, the HTML type pages at Flash types homepage The corresponding three-level page of block content, general first level pages, the general two level page, row in the corresponding two level page, the HTML type pages Table first level pages, the list two level page;Preset above-mentioned all kinds of page types are corresponding with server identification as shown in the table respectively:
With reference to upper table, if the type for getting the current detection page is general first level pages, it is determined that destination server is The server of 777 marks, if the type for getting the current detection page is HTML type homepages, it is determined that destination server 111 The server of mark.
In practice, those skilled in the art are feasible using any page classifications information, for example, it is also possible to adopt With the attributive classification information of the page, labeling information of the page etc., the embodiment of the present application is to this without being limited.
In another preferred embodiment of the present application, the characteristic information may include the URL of the page, the server With numerical identity, in this case, the step 103 can specifically include following sub-step:
Sub-step S321, the URL of the current detection page is converted to by numerical value using preset algorithm;
Sub-step S322, the server that corresponding numerical identity is extracted by the numerical value are destination server.
For example, it is assumed that current black chain database portion is deployed on n platform servers, in the URL (systems for getting the current detection page One Resource Locator, web page address) when, using the URL as input, random algorithm is called to obtain a certain character such as MD5 algorithms Go here and there (such as character strings of 32 bytes), character string is then mapped to a numerical value using certain mapping ruler, using the numerical value as pair The value of the server n answered, the numerical value such as obtained are 2, that is, the server identification that obtaining will preserve is 2, you can determine that target takes Business device is the server for identifying 2.
Certainly, the method that the above-mentioned characteristic information according to the page determines corresponding destination server is solely for example, this Field technology personnel can use any method according to actual conditions, such as solid using the tag characters string of the page is converted to The method etc. of definite value, the application is to this without limiting.
Step 14 is matched using the black chain property data base in the destination server with the current detection page, is sentenced Whether comprising the black chain characteristic in the black chain property data base in the disconnected current detection page, if so, judgement current page Face is to be tampered the page.
In practice, if in the current detection page not including the black chain characteristic in the black chain property data base, It can determine that current page is not tampered with.
The embodiment of the present invention works as presence by using the framework for carrying out distributed treatment and application to black chain property data base When the concurrently detection request of multiple pages, can effective distributing server detection pressure, to effectively saving system resource.
It should be noted that for embodiment of the method, for simple description, therefore it is all expressed as a series of action group It closes, but those skilled in the art should understand that, the application is not limited by the described action sequence, because according to this Shen Please, certain steps can be performed in other orders or simultaneously.Next, those skilled in the art should also know that, specification Described in embodiment belong to preferred embodiment, necessary to involved action and module not necessarily the application.
With reference to figure 2, it illustrates a kind of structure diagrams of device embodiment that the detection page is distorted of the application, and specifically may be used To comprise the following modules:
Database generation module 21, for generating black chain property data base, the black chain property data base includes black chain Characteristic;
Database deployment module 22, for disposing the black chain property data base in multiple servers;
Characteristic information acquisition module 23, the characteristic information for obtaining the current detection page;
Destination server determining module 24, for determining corresponding destination server according to the characteristic information of the page;
Tampering detection module 25, for using the black chain property data base and the current detection page in the destination server It is matched, is judged whether comprising the black chain characteristic in the black chain property data base in the current detection page, if so, Judgement current page is to be tampered the page.
In a preferred embodiment of the present application, the server includes with server identification, the characteristic information Page classifications information, in this case, the destination server determining module 24 may include following submodule:
Marker extraction submodule, for the correspondence according to preset page classifications information and server identification, extraction The corresponding server identification of current page classification information;
Mark location submodule, for the corresponding server of the server identification to be determined as destination server.
In another preferred embodiment of the present application, the characteristic information includes the URL of the page, and the server has Numerical identity, in this case, the destination server determining module 24 may include following submodule:
URL transform subblocks, for the URL of the current detection page to be converted to numerical value using preset algorithm;
The corresponding submodule of mark, for being destination server by the server of the corresponding numerical identity of numerical value extraction.
In the concrete realization, the embodiment of the present application can also include database update module, for being spaced at preset timed intervals Update the black chain property data base.
In a preferred embodiment of the present application, the database generation module 21 may include following submodule:
Characteristics page searches for submodule, for including the black chain characteristic using existing black chain characteristic search The page be characterized the page;
Topological analysis's module, for analyzing layout of the black chain characteristic in characteristics page;
Page elements extraction module, for when finding that layout is abnormal, extraction to include the black chain from this feature page The page elements of characteristic;
Black chain rule generation module, for generating black chain rule according to the page elements;
Black chain characteristic extraction module, for being matched in the other feature page using the black chain rule, and New black chain characteristic is extracted in matched characteristics page, is preserved the black chain characteristic and is formed black chain characteristic Library.
In the concrete realization, the black chain characteristic may include distorting keyword and black chain URL.
As a kind of example of the embodiment of the present application concrete application, topological analysis's submodule may include as placed an order Member:
First judging unit, for judging the position of page element of the black chain characteristic whether in preset threshold range It is interior, if so, layout of the judgement black chain characteristic in characteristics page is abnormal;
And/or
Second judgment unit, for judging whether the page elements attribute of the black chain characteristic is invisible attribute, If so, layout of the judgement black chain characteristic in characteristics page is abnormal;
And/or
Third judging unit, for judging whether the page elements attribute of the black chain characteristic is to be hidden to browser Attribute, if so, layout of the judgement black chain characteristic in characteristics page is abnormal.
In a particular application, it may include such as lower unit that the black chain rule, which generates submodule,:
Regular expression extracting unit, for from comprising the page elements for distorting keyword and/or black chain URL, Regular expression is taken out as black chain rule.
Since described device embodiment essentially corresponds to aforementioned embodiment of the method shown in FIG. 1, therefore the description of the present embodiment In not detailed place, may refer to the related description in previous embodiment, just do not repeat herein.
The application can be used in numerous general or special purpose computing system environments or configuration.Such as:Personal computer, service Device computer, handheld device or portable device, laptop device, multicomputer system, microprocessor-based system, machine top Box, programmable consumer-elcetronics devices, network PC, minicomputer, mainframe computer including any of the above system or equipment Distributed computing environment etc..
The application can describe in the general context of computer-executable instructions executed by a computer, such as program Module.Usually, program module includes routines performing specific tasks or implementing specific abstract data types, program, object, group Part, data structure etc..The application can also be put into practice in a distributed computing environment, in these distributed computing environments, by Task is executed by the connected remote processing devices of communication network.In a distributed computing environment, program module can be with In the local and remote computer storage media including storage device.
Finally, it is to be noted that, herein, relational terms such as first and second and the like be used merely to by One entity or operation are distinguished with another entity or operation, without necessarily requiring or implying these entities or operation Between there are any actual relationship or orders.Moreover, the terms "include", "comprise" or its any other variant meaning Covering non-exclusive inclusion, so that the process, method, article or equipment including a series of elements includes not only that A little elements, but also include other elements that are not explicitly listed, or further include for this process, method, article or The intrinsic element of equipment.In the absence of more restrictions, the element limited by sentence "including a ...", is not arranged Except there is also other identical elements in the process, method, article or apparatus that includes the element.
Above to a kind of method that the detection page is distorted provided herein, and, a kind of dress that the detection page is distorted It sets and is described in detail, specific examples are used herein to illustrate the principle and implementation manner of the present application, above The explanation of embodiment is merely used to help understand the present processes and its core concept;Meanwhile for the general skill of this field Art personnel, according to the thought of the application, there will be changes in the specific implementation manner and application range, in conclusion this Description should not be construed as the limitation to the application.

Claims (14)

1. a kind of method that the detection page is distorted, which is characterized in that including:
Black chain property data base is generated, and disposes the black chain property data base in multiple servers, the black chain characteristic Include black chain characteristic according to library;
Obtain the characteristic information of the current detection page;
When receiving multiple page tampering detection requests, the characteristic information according to the page determines corresponding destination service Device;Wherein, the server has server identification, and the characteristic information includes page classifications information, the foundation page Characteristic information determines that the step of corresponding destination server includes:According to pair of preset page classifications information and server identification It should be related to, the corresponding server identification of extraction current page classification information;The corresponding server of the server identification is determined For destination server;
It is matched with the current detection page using the black chain property data base in the destination server, judges current detection page Whether comprising the black chain characteristic in the black chain property data base in face, if so, judgement current page is to be tampered page Face.
2. the method as described in claim 1, which is characterized in that the characteristic information includes the URL of the page, the service utensil There is a numerical identity, the step of characteristic information according to the page determines corresponding server identification includes:
The URL of the current detection page is converted to by numerical value using preset algorithm;
The server that corresponding numerical identity is extracted by the numerical value is destination server.
3. the method as described in claim 1, which is characterized in that the page classifications information includes the classifying content letter of the page Breath, the classification of type information of the page, the attributive classification information of the page.
4. the method as described in claims 1 or 2 or 3, which is characterized in that the step of generation black chain property data base wraps It includes:
The page is characterized using the existing black page of the chain characteristic search comprising the black chain characteristic;
Layout of the black chain characteristic in characteristics page is analyzed, when finding that layout is abnormal, is carried from this feature page Take the page elements for including the black chain characteristic;
Black chain rule is generated according to the page elements, is matched in the other feature page using the black chain rule, and New black chain characteristic is extracted in matched characteristics page;
It preserves the black chain characteristic and forms black chain property data base.
5. method as claimed in claim 4, which is characterized in that the black chain characteristic includes distorting keyword and black chain URL。
6. method as claimed in claim 4, which is characterized in that the analysis black chain characteristic is in characteristics page The step of layout includes:
The position of page element of the black chain characteristic is judged whether in preset threshold range, if so, judgement is described black Layout of the chain characteristic in characteristics page is abnormal;
And/or
Judge whether the page elements attribute of the black chain characteristic is invisible attribute, if so, the judgement black chain is special It is abnormal to levy layout of the data in characteristics page;
And/or
Judge whether the page elements attribute of the black chain characteristic is the attribute hidden to browser, if so, judgement institute It is abnormal to state layout of the black chain characteristic in characteristics page.
7. method as claimed in claim 5, which is characterized in that it is described according to page elements generate black chain rule the step of be:
From comprising the page elements for distorting keyword and/or black chain URL, takes out regular expression and advised as black chain Then.
8. method as claimed in claim 5, which is characterized in that further include:
Interval updates the black chain property data base at preset timed intervals.
9. a kind of device that the detection page is distorted, which is characterized in that including:
Database generation module, for generating black chain property data base, the black chain property data base includes black chain characteristic According to;
Database deployment module, for disposing the black chain property data base in multiple servers;
Characteristic information acquisition module, the characteristic information for obtaining the current detection page;
Destination server determining module, for when receiving the request of multiple page tampering detections, the feature according to the page Information determines corresponding destination server;Wherein, it includes the page point that the server, which has server identification, the characteristic information, Category information, the destination server determining module include:
Marker extraction submodule, for the correspondence according to preset page classifications information and server identification, extraction is current The corresponding server identification of page classifications information;
Mark location submodule, for the corresponding server of the server identification to be determined as destination server;
Tampering detection module, for using the black chain property data base in the destination server and the progress of the current detection page Match, judges whether comprising the black chain characteristic in the black chain property data base in the current detection page, if so, judgement is worked as The preceding page is to be tampered the page.
10. device as claimed in claim 9, which is characterized in that the characteristic information includes the URL of the page, the server With numerical identity, the destination server determining module includes:
URL transform subblocks, for the URL of the current detection page to be converted to numerical value using preset algorithm;
The corresponding submodule of mark, for being destination server by the server of the corresponding numerical identity of numerical value extraction.
11. the device as described in claim 9 or 10, which is characterized in that the database generation module includes:
Characteristics page searches for submodule, the page for including the black chain characteristic using existing black chain characteristic search Face is characterized the page;
Topological analysis's submodule, for analyzing layout of the black chain characteristic in characteristics page;
Page elements extracting sub-module, for when finding that layout is abnormal, extraction to be special comprising the black chain from this feature page Levy the page elements of data;
Black chain rule generates submodule, for generating black chain rule according to the page elements;
Black chain characteristic extracting sub-module, for being matched in the other feature page using the black chain rule, and New black chain characteristic is extracted in matched characteristics page, is preserved the black chain characteristic and is formed black chain property data base.
12. device as claimed in claim 11, which is characterized in that topological analysis's submodule further comprises:
First judging unit, for judging the position of page element of the black chain characteristic whether in preset threshold range, If so, layout of the judgement black chain characteristic in characteristics page is abnormal;
And/or
Second judgment unit, for judging whether the page elements attribute of the black chain characteristic is invisible attribute, if so, Then judge that layout of the black chain characteristic in characteristics page is abnormal;
And/or
Third judging unit, for judging whether the page elements attribute of the black chain characteristic is the category hidden to browser Property, if so, layout of the judgement black chain characteristic in characteristics page is abnormal.
13. device as claimed in claim 12, which is characterized in that the black chain characteristic includes distorting keyword and black chain URL, the black chain rule generate submodule and include:
Regular expression extracting unit, for from comprising the page elements for distorting keyword and/or black chain URL, being abstracted Go out regular expression as black chain rule.
14. device as claimed in claim 13, which is characterized in that further include:
Database update module updates the black chain property data base for interval at preset timed intervals.
CN201410318916.1A 2011-12-30 2011-12-30 A kind of method and device that the detection page is distorted Active CN104063491B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201410318916.1A CN104063491B (en) 2011-12-30 2011-12-30 A kind of method and device that the detection page is distorted

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN201110457654.3A CN102446255B (en) 2011-12-30 2011-12-30 Method and device for detecting page tamper
CN201410318916.1A CN104063491B (en) 2011-12-30 2011-12-30 A kind of method and device that the detection page is distorted

Related Parent Applications (1)

Application Number Title Priority Date Filing Date
CN201110457654.3A Division CN102446255B (en) 2011-12-30 2011-12-30 Method and device for detecting page tamper

Publications (2)

Publication Number Publication Date
CN104063491A CN104063491A (en) 2014-09-24
CN104063491B true CN104063491B (en) 2018-07-24

Family

ID=51551205

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201410318916.1A Active CN104063491B (en) 2011-12-30 2011-12-30 A kind of method and device that the detection page is distorted

Country Status (1)

Country Link
CN (1) CN104063491B (en)

Families Citing this family (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107870914B (en) * 2016-09-23 2020-07-31 北京京东尚科信息技术有限公司 Method and device for preventing page from being tampered
CN108600247A (en) * 2018-05-02 2018-09-28 尚谷科技(天津)有限公司 A kind of website fishing camouflage recognition methods

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101471818A (en) * 2007-12-24 2009-07-01 北京启明星辰信息技术股份有限公司 Detection method and system for malevolence injection script web page
CN101808093A (en) * 2010-03-15 2010-08-18 北京安天电子设备有限公司 System and method for automatically detecting WEB security
CN101888312A (en) * 2009-05-15 2010-11-17 北京启明星辰信息技术股份有限公司 Attack detection and response method and device of WEB page
CN102129528A (en) * 2010-01-19 2011-07-20 北京启明星辰信息技术股份有限公司 WEB page tampering identification method and system

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101471818A (en) * 2007-12-24 2009-07-01 北京启明星辰信息技术股份有限公司 Detection method and system for malevolence injection script web page
CN101888312A (en) * 2009-05-15 2010-11-17 北京启明星辰信息技术股份有限公司 Attack detection and response method and device of WEB page
CN102129528A (en) * 2010-01-19 2011-07-20 北京启明星辰信息技术股份有限公司 WEB page tampering identification method and system
CN101808093A (en) * 2010-03-15 2010-08-18 北京安天电子设备有限公司 System and method for automatically detecting WEB security

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
"基于MySQL上网负载均衡的搭建与研究";李海兵;《信息安全与技术》;20110531(第5期);摘要及第1页左栏第1段至右栏第1段 *

Also Published As

Publication number Publication date
CN104063491A (en) 2014-09-24

Similar Documents

Publication Publication Date Title
Vishwakarma et al. Detection and veracity analysis of fake news via scrapping and authenticating the web search
CN102446255B (en) Method and device for detecting page tamper
CN102436563B (en) Method and device for detecting page tampering
CN103544436B (en) System and method for distinguishing phishing websites
CN103544176B (en) Method and apparatus for generating the page structure template corresponding to multiple pages
CN103559235B (en) A kind of online social networks malicious web pages detection recognition methods
CN106685936B (en) Webpage tampering detection method and device
CN102591965B (en) Method and device for detecting black chain
CN105138907B (en) A kind of active probe is attacked the method and system of website
CN104156490A (en) Method and device for detecting suspicious fishing webpage based on character recognition
Zhang et al. Web phishing detection based on page spatial layout similarity
Yang et al. Scalable detection of promotional website defacements in black hat {SEO} campaigns
JP2014502753A (en) Web page information detection method and system
CN104036190A (en) Method and device for detecting page tampering
CN103647767A (en) Website information display method and apparatus
WO2014029318A1 (en) Method and apparatus for identifying webpage type
CN105868290A (en) Search result presentation method and apparatus
CN107786537A (en) A kind of lonely page implantation attack detection method based on internet intersection search
CN113032655A (en) Method for extracting and fixing dark network electronic data
CN104036189A (en) Page distortion detecting method and black link database generating method
CN112818200A (en) Data crawling and event analyzing method and system based on static website
CN104077353B (en) A kind of method and device of detecting black chain
WO2017000659A1 (en) Enriched uniform resource locator (url) identification method and apparatus
CN104063491B (en) A kind of method and device that the detection page is distorted
KR20120090131A (en) Method, system and computer readable recording medium for providing search results

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant
CP01 Change in the name or title of a patent holder
CP01 Change in the name or title of a patent holder

Address after: 100088 room 112, block D, 28 new street, new street, Xicheng District, Beijing (Desheng Park)

Patentee after: BEIJING QIHOO TECHNOLOGY Co.,Ltd.

Patentee after: Beijing Qizhi Business Consulting Co.,Ltd.

Address before: 100088 room 112, block D, 28 new street, new street, Xicheng District, Beijing (Desheng Park)

Patentee before: BEIJING QIHOO TECHNOLOGY Co.,Ltd.

Patentee before: Qizhi software (Beijing) Co.,Ltd.

TR01 Transfer of patent right
TR01 Transfer of patent right

Effective date of registration: 20220402

Address after: 100016 1773, 15 / F, 17 / F, building 3, No.10, Jiuxianqiao Road, Chaoyang District, Beijing

Patentee after: Sanliu0 Digital Security Technology Group Co.,Ltd.

Address before: 100088 room 112, block D, 28 new street, new street, Xicheng District, Beijing (Desheng Park)

Patentee before: BEIJING QIHOO TECHNOLOGY Co.,Ltd.

Patentee before: Beijing Qizhi Business Consulting Co.,Ltd.