CN102790700B

CN102790700B - Method and device for recognizing webpage crawler

Info

Publication number: CN102790700B
Application number: CN201110130432.0A
Authority: CN
Inventors: 叶润国; 肖小剑
Original assignee: Beijing Venus Information Security Technology Co Ltd; Beijing Venus Information Technology Co Ltd
Current assignee: Beijing Venus Information Security Technology Co Ltd; Venus Info Tech Inc; Beijing Venus Information Technology Co Ltd
Priority date: 2011-05-19
Filing date: 2011-05-19
Publication date: 2015-06-10
Anticipated expiration: 2031-05-19
Also published as: CN102790700A

Abstract

The invention discloses a method and a device for recognizing webpage crawler, which belongs to the technical field of network security. The method comprises counting the average response time of a Web server to all Webpage requests; acquiring Webpage requests from a Web client to the Web server in a period of time; measuring the time interval between adjacent webpage requests as well as the response time of each Webpage request; correcting the time interval between adjacent webpage requests according to the webpage request response time; determining whether the corrected time interval between the adjacent webpage requests is larger than or equal to a predetermined threshold value delta of the time interval between the adjacent webpage requests; and determining whether the operation of the web client is the network crawler according to whether the determination result meets a preset condition. The method provided by the embodiment of the invention can simply and rapidly determine the hidden webpage crawlers, has high adaptability, and can provide previous response time for subsequent security responses.

Description

A kind of method and apparatus identifying spiders

Technical field

The present invention relates to technical field of network security, particularly relate to a kind of method and apparatus identifying spiders.

Background technology

Due to convenience and the ease for use of Web service, at present increasing Network has adopted private client and private server pattern (C/S model) to transfer to adopt standard web browsers as the browser of client and Web server pattern (B/S pattern) from tradition.These Networks that have employed B/S pattern are commonly referred to as Web application system.Web application system, bringing easily simultaneously, also brings a lot of safety problem, and more common safety problem comprises webpage Trojan horse virus, SQL injection attacks, XSS attack etc.The root that these safety problems of Web application system exist is because Web application system itself exists the defect on program code mostly, introduces Web security breaches, thus hacker is had an opportunity to take advantage of.

When network attack person attacks a Web application system (sometimes also becoming a Web site), first need to carry out vulnerability scanning to whole Web application system, find the Web security breaches can attacking utilization, then this leak is attacked, thus reach its malicious intent.For a brand-new Web application system, network attack person needs to take spiders technology to scan this Web application system, find the webpage that likely there is safety problem, then attack is carried out to this webpage and attempt, thus confirm whether this webpage exists leak.

Research through attacking various common Web finds, when a lot of Web attacks and occurs, the Web attack tool that they use mostly has the behavior of a kind of Web reptile.Comprise:

CC attacks (DDoS): adopt and multiplely act on behalf of the Web page that on concurrent access Web server, those resource consumptions are more, cause Web service DDoS;

Corpse DDoS: the corpse adopting a group to run Web reptile continually climbs Web server, and other Web of the not free reception of Web server is asked;

Web vulnerability scanning (comprising SQL implantation tool): hacker adopts common hole scanner to carry out vulnerability scanning to Web server.

From Web server defence angle, if can identify in early days these malice spiders, and continuous surveillance they, then may carry out flow control to them early, thus guarantee the safety of Web server.

Current common spiders recognition methods judges it whether as spiders by a series of Web page requests monitored a certain Web client and send, it detects basic ideas: if this Web client is reptile, then the probability that time interval of two continuous Web page requests that it sends gets smaller value is larger; If this Web client is normal users, then the probability that time interval of two continuous Web page requests that it sends gets higher value is larger; By monitoring the time interval of n the Web page request that this Web client sends continuously and adopting hypothesis test, just can whether it be that spiders is still manually browsed at certain confidence declaration.

Summary of the invention

The technical problem to be solved in the present invention is to provide a kind of method and apparatus identifying spiders, can detect the CC client that disguise is stronger, thus is that follow-up HTTP flow control provides the valuable response time.

In order to solve the problem, the invention provides a kind of method identifying spiders, comprising the following steps:

A0, statistics Web server are to the average response time η of all Web page requests;

A1, to collect in a period of time Web client to the web-page requests sequence W of Web server;

A2, calculate the time interval Vt between each adjacent web-page requests in described web-page requests sequence W _iwith the response time μ of each Web page request _i; A3, based on each Web page request response time μ _iwith Web page request average response time η is by adjacent Web page request time interval Vt _icarry out being modified to Vt ' _i, wherein, modification rule is: if Web page request response time μ _ibe greater than average Web page request response time η, revised adjacent Web page request time interval Vt ' _ifor Vt _ithe product of the penalty factor k of 1 is less than with one;

A4, judge revised each time interval Vt ' respectively _iwhether be more than or equal to the adjacent webpage request time interval threshold δ preset, if it is by the Event element the e corresponding revised time interval _ibe designated as 0, otherwise be designated as 1; The Event element e that each time interval is corresponding _iform an elementary event sequence E;

A5, with described elementary event sequence E mate respectively hypothesis H ₀and H ₁, wherein H ₀what represent web client is operating as normal web page browsing behavior, H ₁what represent web client is operating as spiders; If described elementary event sequence E mates hypothesis H ₁degree, mate with elementary event sequence E and suppose H ₀degree between gap be greater than a degree threshold value, then what judge web client is operating as spiders, otherwise is normal web page browsing behavior.

Preferably, in said method, the penalty factor k being less than 1 in described steps A 3 is 1-(μ _i-η)/μ _i.

Preferably, in said method, described steps A 5 comprises:

A51, proposition two hypothesis H ₀and H ₁, wherein H ₀what represent Web client is operating as normal web page browsing behavior, H ₁what represent Web client is operating as spiders;

A52, set between two adjacent web-page requests producing in normal web page browsing process interval greater than or equal the probability P r [e of δ _i=0|H ₀] be θ ₀, be less than the probability P r [e of δ _i=1|H ₀] be 1-θ ₀, between two the adjacent web-page requests produced in setting spiders process interval greater than the probability P r [e equaling δ _i=0|H ₁] be θ ₁, be less than the probability P r [e of δ _i=1|H ₁] be 1-θ ₁; θ ₀> θ ₁, and conditional random variable e _i| H _jmeet independent same distribution;

A53, calculate two hypothesis H ₀and H ₁the likelihood ratio V (E) of lower generation elementary event sequence E:

V (E) = \frac{\Pr [E | H_{1}]}{\Pr [E | H_{0}]} = Π_{i = 1}^{n - 1} \frac{\Pr [e_{i} | H_{1}]}{\Pr [e_{i} | H_{0}]}

A54, by V (E) respectively with two fixed threshold η ₀and η ₁relatively.Wherein η ₀< η ₁if: V (E)>=η ₁, then what judge Web client is operating as spiders; If V (E)≤η ₀, then what judge Web client is operating as normal web page browsing.

Preferably, in said method, in described steps A 54:

When continuous m web-page requests from Web client to Web server all meet adjacent webpage request time interval be more than or equal to the web-page requests time interval threshold value δ time, obtain described threshold value η ₀:

η_{0} = Π_{i = 1}^{m - 1} \frac{\Pr [e_{i} = 0 | H_{1}]}{\Pr [e_{i} = 0 | H_{1}]} = {(\frac{\Pr [e_{i} = 0 | H_{1}]}{\Pr [e_{i} = 0 | H_{1}]})}^{m - 1}

When continuous m web-page requests from Web client to Web server all meet adjacent webpage request time interval be less than the web-page requests time interval threshold value δ time, obtain described threshold value η ₁:

η_{1} = Π_{i = 1}^{m - 1} \frac{\Pr [e_{i} = 1 | H_{1}]}{\Pr [e_{i} = 1 | H_{1}]} = {(\frac{\Pr [e_{i} = 1 | H_{1}]}{\Pr [e_{i} = 1 | H_{1}]})}^{m - 1}

Wherein, m is positive integer.

Preferably, in said method:

Described δ is 1 second, 2 seconds or 3 seconds;

When δ is 3 seconds, described θ ₀and θ ₁be respectively 0.6 and 0.4.

The invention also discloses a kind of device identifying spiders, comprising:

Acquiring unit, for obtaining in a period of time Web client to the web-page requests sequence W of Web server, and statistics Web server is to the average response time η of all Web page requests;

Correcting process unit, measure each adjacent Web page request time interval and each Web page request response time, according to Web page request response time correction adjacent Web page request time interval, and judge whether the time interval of revised each adjacent web-page requests is more than or equal to a predetermined adjacent webpage request time interval threshold δ; Wherein, correcting process unit comprises: computing module, correcting module and logging modle;

Computing module, measures each adjacent Web page request time interval and each Web page request response time in the web-page requests that described acquiring unit obtains;

Correcting module, according to average response time and the Web page request response time correction adjacent Web page request time interval of added up Web page request, wherein, modification rule is: if Web page request response time μ _ibe greater than average Web page request response time η, revised adjacent Web page request time interval Vt ' _ifor Vt _ithe product of the penalty factor k of 1 is less than with one;

Logging modle, for judging whether each time interval is more than or equal to the adjacent webpage request time interval threshold δ preset, if it is by the Event element e corresponding this time interval respectively _ibe designated as 0, otherwise be designated as 1; Obtain comprising Event element e corresponding to each time interval _ian elementary event sequence E;

Recognition unit, for mating hypothesis H respectively with described elementary event sequence E ₀and H ₁, wherein H ₀represent that Web client r's is operating as normal web page browsing behavior, H ₁represent that Web client r's is operating as spiders; If described elementary event sequence E mates hypothesis H ₁degree, mate with elementary event sequence E and suppose H ₀degree between gap be greater than a degree threshold value, then judging that Web client r's is operating as spiders, otherwise is normal web page browsing behavior.

Preferably, in said apparatus, described in be less than 1 penalty factor k be 1-(μ _i-η)/μ _i.

Preferably, in said apparatus, described recognition unit comprises:

Suppose module, for proposing two hypothesis H ₀and H ₁, wherein H ₀represent that Web client r's is operating as normal web page browsing behavior, H ₁represent that Web client r's is operating as spiders;

Setting module, for set between two adjacent web-page requests producing in normal web page browsing process interval greater than or equal the probability P r [e of δ _i=0|H ₀] be θ ₀, be less than the probability P r [e of δ _i=1|H ₀] be 1-θ ₀, between two the adjacent web-page requests produced in setting spiders process interval greater than the probability P r [e equaling δ _i=0|H ₁] be θ ₁, be less than the probability P r [e of δ _i=1|H ₁] be 1-θ ₁; θ ₀> θ ₁, and conditional random variable e _i| H _jmeet independent same distribution;

Likelihood ratio computing module, for calculating at two hypothesis H ₀and H ₁the likelihood ratio V (E) of lower generation elementary event sequence E:

V (E) = \frac{\Pr [E | H_{1}]}{\Pr [E | H_{0}]} = Π_{i = 1}^{n - 1} \frac{\Pr [e_{i} | H_{1}]}{\Pr [e_{i} | H_{0}]}

Judging module, for by V (E) respectively with two fixed threshold η ₀and η ₁relatively, wherein η ₀< η ₁if: V (E)>=η ₁, then judge that Web client r's is operating as spiders; If V (E)≤η ₀, then judge that Web client r's is operating as normal web page browsing.

Preferably, in said apparatus, described recognition unit also comprises:

Threshold setting module, for arranging described fixed threshold η ₀and η ₁; When continuous m the web-page requests from Web client r to Web server s all meet adjacent webpage request time interval be more than or equal to the web-page requests time interval threshold value δ time, obtain described threshold value η ₀:

η_{0} = Π_{i = 1}^{m - 1} \frac{\Pr [e_{i} = 0 | H_{1}]}{\Pr [e_{i} = 0 | H_{1}]} = {(\frac{\Pr [e_{i} = 0 | H_{1}]}{\Pr [e_{i} = 0 | H_{1}]})}^{m - 1}

When continuous m the web-page requests from Web client r to Web server s all meet adjacent webpage request time interval be less than the web-page requests time interval threshold value δ time, obtain described threshold value η ₁:

η_{1} = Π_{i = 1}^{m - 1} \frac{\Pr [e_{i} = 1 | H_{1}]}{\Pr [e_{i} = 1 | H_{1}]} = {(\frac{\Pr [e_{i} = 1 | H_{1}]}{\Pr [e_{i} = 1 | H_{1}]})}^{m - 1}

Wherein, m is positive integer.

Preferably, in said apparatus:

Described δ is 1 second, 2 seconds or 3 seconds;

When δ is 3 seconds, described θ ₀and θ ₁be respectively 0. δ and 0.4.

Embodiments of the invention propose the thought identified attack early-stage preparations activity, prepare, or stop to attack warming-up exercise, thus enhance the security reliability of network just can carry out defence before the commence firing; Early-stage preparations activity is attacked in order to identify, embodiments of the invention revise continuous webpage request time spacing value according to the response time interval of web-page requests, thus identify that the normal Web client access frequency of simulation attacks the CC client of Web server, namely hiding spiders is identified, its prioritization scheme adopts rigorous Mathematical Modeling, can detect hiding spiders simply, rapidly, and applicability is strong, the valuable response time can be provided for follow-up security response.

Accompanying drawing explanation

Fig. 1 is the schematic flow sheet of the method identifying spiders in embodiment 1;

Fig. 2 is the schematic flow sheet of steps A 4 in the method identifying spiders in embodiment 1.

Embodiment

Below in conjunction with drawings and the specific embodiments, technical solution of the present invention is described in further details.It should be noted that, when not conflicting, the embodiment in the application and the feature in embodiment can combination in any mutually.

At present, from the angle of Web defender, if just can identify at the initial stage that network attack person scans Web application system takes spiders technology to scan this abnormal behaviour to Web application system, so just can make the attack of hacker and responding timely, such as stop the spiders of hacker to the further scanning behavior of this Web application system, or record the web access behavior that it is follow-up, and Web attack that it is initiated is on the defensive.

Spiders is a software module run, it is automatically from downloading web pages Web application system, then the hyperlink in automatic analysis webpage, then according to the hyperlink automatic acquisition next stage webpage extracted, until the webpage of whole Web application system has all been downloaded.Because spiders simulates the web page browsing behavior of people completely, therefore, accurately to identify that spiders exists very large difficulty.

In some technical scheme, judge that the operation of this web client r is whether as spiders by observing in a period of time from the web-page requests sequence between certain web client r and certain Web server s; Normal web page browsing behavior and the behavior of automatic web reptile is distinguished by the time interval analyzed between adjacent two web-page requests, its Main Basis is: during artificial browsing page, be switched to another Web page from a Web page and need the long time, generally be greater than 2 seconds, and spiders to be switched to another webpage from a webpage be automatic, switching time obviously switches short than manual webpage.

Based on collected from web client r to Web server web-page requests sequence analyze the webpage switching behavior of web client r, adopt the sequence hypothesis method of inspection, first propose two hypothesis H ₀and H ₁, wherein H ₀represent that web client r's is operating as normal web page browsing behavior, H ₁represent that web client r's is operating as spiders, then check which hypothesis to set up based on viewed webpage switching behavior, as discovery hypothesis H ₁during establishment, then judge that web client r's is operating as spiders.

But, applicant finds that a series of Web page requests sent by a certain Web client of above-mentioned monitoring judge that whether it have some limitations as the technical scheme of spiders, namely the CC client that a kind of disguise common is at present stronger cannot be detected, the request frequency of the normal Web client of this CC client simulation goes to access Web page, but the Web page of asking will consume the more computational resource of Web server (such as text retrieval page request), Web server computational resource is made to be consumed totally, thus cannot respond the access request of validated user.

Therefore, while applicant in this case considers the time interval situation of the continuous Web page request sent in certain Web client of monitoring, monitoring Web server is to the response time situation of these Web page requests, revise continuous webpage request time spacing value according to Web page request response time, thus accurately can detect that the normal Web client access frequency of that simulation attacks the CC client of Web server.

Embodiment 1

The present embodiment, based on above-mentioned thought, provides a kind of method identifying spiders, comprising:

Statistics Web server is to the average response time of all Web page requests, monitoring a period of time (i.e. a setting-up time section) interior Web client is to a series of Web page requests of Web server, measure each adjacent webpage request time interval and each Web page request response time, according to Web page request response time correction adjacent webpage request time interval, finally judge whether revised adjacent Web page request time interval is more than or equal to a predetermined adjacent webpage request time interval threshold δ, whether meet pre-conditioned according to each judged result, whether the operation judging described web client is web crawlers.

As shown in Figure 1, described method specifically comprises the following steps:

A0, statistics Web server are to the average response time of all Web page requests;

In the present embodiment, Web server, to the average response time of all Web page requests, can think that Web server is to the average response time of all Web page requests within a period of time.In other application scenarioss, also can refer to a set point.

A1, collect the web-page requests sequence of web client r to Web server s in a period of time;

A2, calculate the time interval Vt between each adjacent web-page requests in above-mentioned web-page requests sequence _iand the response time μ of each Web page request _i;

Wherein, to web-page requests sequence W (each element w sequence comprising n web-page requests from web client r to Web server s collected _irepresent, wherein i value is 1 to each integer of n, comprises 1 and n), calculates the time interval between adjacent two web-page requests, obtains an adjacent webpage request time intervening sequence T comprising (n-1) individual element (each element t in sequence _irepresent, wherein i value is 1 to each integer of n-1, comprise 1 and n-1);

A3, based on each Web page request response time μ _iwith Web page request average response time η is by adjacent Web page request time interval Vt _icarry out being modified to Vt ' _i, wherein, modification rule is: if Web page request response time μ _ibe greater than average Web page request response time η, revised adjacent Web page request time interval Vt ' _ifor Vt _ithe product of the penalty factor k of 1 is less than with one;

A4, judge revised each time interval Vt ' respectively _iwhether be more than or equal to the adjacent webpage request time interval threshold δ preset, if so, then by the Event element the e corresponding revised time interval _ibe designated as 0, otherwise be designated as 1; The Event element e that each time interval is corresponding _iform an elementary event sequence E;

A5, with above-mentioned elementary event sequence E mate respectively hypothesis H ₀and H ₁, wherein H ₀what represent web client is operating as normal web page browsing behavior, H ₁what represent web client is operating as spiders; If described elementary event sequence E mates hypothesis H ₁degree, mate with elementary event sequence E and suppose H ₀degree between gap be greater than a degree threshold value, then what judge web client is operating as spiders, otherwise is normal web page browsing behavior.

Mentioned here, elementary event sequence E mates hypothesis H ₁degree, mate with elementary event sequence E and suppose H ₀degree between gap be greater than in a degree threshold value, said degree can be probability, similarity etc., and said gap can be ratio, difference etc.

During practical application, also directly can carry out recognition network reptile according to each judged result, such as preset a condition be equal in elementary event sequence E 0 e _inumber be greater than in E the e equaling 1 _inumber, when each judged result meet this pre-conditioned time, judge that described web client r's is operating as normal main frame, otherwise be web crawlers; For another example preset a condition be equal in elementary event sequence E 0 e _inumber and E in the ratio of total element number be less than a proportion threshold value, when each judged result meet this pre-conditioned time, judge that described web client r's is operating as web crawlers, otherwise be normal main frame.

In the present embodiment, steps A 1 needs to collect in a period of time by the institute of web client r to Web server s once successful web-page requests.Once successful web-page requests process mentioned here refers to: first web client r sends a web-page requests message to Web server s, asks the webpage of specifying; After Web server s receives this web-page requests message, take out the webpage of asking and then send to web client r; If the webpage of asking is a dynamic web page, then Web server s needs first to perform the webpage required for web client r that corresponding external program can obtain.

It should be noted that: webpage common is at present all multimedia page, it comprises writings and image simultaneously, once successfully web-page requests will comprise the acquisition to a html file object and multiple picture concerned object simultaneously, therefore, once successfully web-page requests, by the transmission of multiple HTTP request message that comprises between web client r and Web server s and response (and these HTTP request message may send) simultaneously, but only has one to be used to obtain html file object in these HTTP request message.

Therefore, the method of the invention from web client r to can not simply the single HTTP request web client r and Web server s and relevant response be regarded as during the web-page requests of Web server s as once successful web-page requests, and must check that the Content-Type protocol fields of http response message header is to judge the type of its object obtained in collection.Know according to known http protocol specification, if certain HTTP request message object obtains html file object, the Content-Type field value of so relevant http response message header is " text/html ".Therefore, in the present embodiment, when collecting from web client r to the web-page requests of Web server s, only consider those http response message headers Content-Type field value be that the single HTTP request message of " text/html " and response message regard once successfully web-page requests as, to avoid also being used as a web-page requests by from web client r to the acquisition of the object picture of Web server s.

Suppose that steps A 1 at the appointed time have collected from web client r to the n of Web server s web-page requests in section, this n web-page requests is by formation web-page requests sequence W (each element w in W _irepresent, wherein i value is each integer from 1 to n, comprise 1 and n-1), according to steps A 2, calculate adjacent webpage request time intervening sequence T based on this web-page requests sequence W below: suppose each web-page requests w in web-page requests sequence W _itime of origin be two then adjacent web-page requests w _iand w _i+1between the time interval be therefore, each element in adjacent webpage request time intervening sequence T wherein i value is from 1 to each integer of each integer of (n-1), comprises 1 and n-1.Same, after have collected n web-page requests, the response time μ of this n web-page requests will be collected respectively _i, be about to the response time of collecting n web-page requests, also can form a response time sequence.

In the present embodiment, steps A 3, needs based on each Web page request response time μ _iand the adjacent Web page request time interval Vt that Web page request average response time η calculates steps A 2 _irevise, the process revised particularly, as Web page request response time μ _iwhen being greater than average Web page request response time η, thinking and responded slowly, therefore, need the response time of improving Web page request, be about to the adjacent Web page request time interval Vt ' calculated _ithe penalty factor k being less than 1 with one is multiplied, and product is revised adjacent Web page request time interval, and it is less than the Web page request response time calculated.Wherein, the penalty factor k being less than 1 can be 1-(μ _i-η)/μ _i, wherein μ _ifor Web page request response time, η is the average response time of all Web page requests.

In the present embodiment, steps A 4 needs to generate elementary event sequence E based on revised adjacent webpage request time intervening sequence T.Here need to preset adjacent webpage request time interval threshold δ, to judge that two adjacent web-page requests are that spiders sends or sent by normal web page browsing behavior.This adjacent webpage request time interval threshold δ obtains from empirical data.Observed by the time interval between adjacent two web-page requests of sending normal Web page navigation patterns and find, as a rule, its adjacent webpage request time is spaced apart 3 to 8 seconds; And by finding the observation in adjacent two web-page requests time intervals that spiders in web site scan instrument common at present sends, as a rule, its adjacent webpage request time is spaced apart and was less than for 1 second.Therefore, in the inventive method implementation process, can get adjacent webpage request time interval threshold δ is 1 second, 2 seconds or 3 seconds.

After determining adjacent webpage request time interval threshold δ, the process generating elementary event sequence E by revised adjacent webpage request time intervening sequence T in steps A 4 is as follows: to each element t in revised adjacent webpage request time intervening sequence T _ianalyze, if t _i>=δ, then corresponding element e in elementary event sequence E _i=0, otherwise e _i=1.

In the present embodiment, steps A 5 adopts the sequence hypothesis method of inspection to analyze elementary event sequence E, thus judges that the operation of web client r is whether as spiders, and concrete steps as shown in Figure 2, comprising:

A51, proposition two hypothesis H ₀and H ₁, wherein H ₀represent that web client r's is operating as normal web page browsing behavior, H ₁represent that web client r's is operating as spiders;

A52, set between two adjacent web-page requests producing in normal web page browsing process interval greater than or the probability that equals δ be θ ₀, i.e. Pr [e _i=0|H ₀]=θ ₀, the probability being less than δ is 1-θ ₀, i.e. Pr [e _i=1|H ₀]=1-θ ₀; Between two the adjacent web-page requests produced in setting spiders process interval greater than the probability equaling δ be θ ₁, i.e. Pr [e _i=0|H ₁]=θ ₁, the probability being less than δ is 1-θ ₁, i.e. Pr [e _i=1|H ₁]=1-θ ₁; Suppose θ ₀> θ ₁, and conditional random variable e _i| H _jmeet independent same distribution;

A53, calculate two hypothesis H ₀and H ₁the likelihood ratio V (E) of lower generation elementary event sequence E;

V (E) = \frac{\Pr [E | H_{1}]}{\Pr [E | H_{0}]} = Π_{i = 1}^{n - 1} \frac{\Pr [e_{i} | H_{1}]}{\Pr [e_{i} | H_{0}]}

A54, given two fixed threshold η ₀and η ₁(wherein η ₀< η ₁), by V (E) respectively with η ₀and η ₁relatively: if V (E)>=η ₁, then judge that web client r's is operating as spiders; If V (E)≤η ₀, then judge that web client r's is operating as normal web page browsing; If η ₀< V (E) < η ₁, then need to continue to observe just to make a determination from web client r to the web-page requests of Web server s, now can continue to collect a period of time web-page requests, then concentrate in together with the web-page requests of originally collecting, return steps A 2 and perform.

In the present embodiment, in steps A 51, propose two hypothesis H ₀, and H ₁, wherein H ₀represent that web client r's is operating as normal web page browsing behavior, H ₁represent that web client r's is operating as spiders.Then by viewed elementary event sequence E, the present embodiment judges that the possibility which hypothesis is set up is larger.

In the present embodiment, in steps A 52, suppose θ ₀> θ ₁, this means to produce in normal web page browsing process two adjacent interval greater than or the likelihood ratio spiders of the web-page requests that equals δ want large, this just the present embodiment distinguish the key point of normal web page browsing behavior and spiders behavior; θ ₀and θ ₁value can based on experience value or test, the size in conjunction with δ is determined; When δ is taken as different values, θ ₀and θ ₁value also can change.

In the present embodiment, calculate elementary event sequence E in steps A 53 at two hypothesis H ₀and H ₁under likelihood ratio V (E) time, have employed above-mentioned computing formula, its Main Basis is, conditional random variable e _i| H _jmeet independent same distribution.

Wherein, two fixed threshold η given in advance are needed in steps A 54 ₀and η ₁(wherein η ₀< η ₁).Wherein, lower threshold η ₀be used for judging that the operation of web client r is whether as normal web page browsing behavior, as the described threshold value η of the upper limit ₁be used for judging that the operation of web client r is whether as spiders behavior.

In specific implementation process, the threshold value η estimated with the following method as lower limit can be adopted ₀with the threshold value η as the upper limit ₁: as long as continuous m the web-page requests supposing to observe from web client r to Web server s all meets adjacent webpage request time interval and is more than or equal to web-page requests time interval threshold value δ and just can judges that web client r's is operating as normal webpage behavior, then described threshold value η ₀can value be:

η_{0} = Π_{i = 1}^{m - 1} \frac{\Pr [e_{i} = 0 | H_{1}]}{\Pr [e_{i} = 0 | H_{1}]} = {(\frac{\Pr [e_{i} = 0 | H_{1}]}{\Pr [e_{i} = 0 | H_{1}]})}^{m - 1}

As long as continuous m the web-page requests supposing to observe from web client r to Web server s all meets adjacent webpage request time interval and is less than web-page requests time interval threshold value δ and just can judges that web client r's is operating as spiders, then described threshold value η ₁can value be:

η_{1} = Π_{i = 1}^{m - 1} \frac{\Pr [e_{i} = 1 | H_{1}]}{\Pr [e_{i} = 1 | H_{1}]} = {(\frac{\Pr [e_{i} = 1 | H_{1}]}{\Pr [e_{i} = 1 | H_{1}]})}^{m - 1}

Wherein, m is positive integer, and its value can set according to actual conditions, and obtains η ₀and η ₁time, m can get identical value, also can get different value; η ₀and η ₁also can directly determine according to practical experience or test.

Be illustrated further with several concrete example below:

In several example, suppose that the adjacent webpage request time interval threshold δ value for distinguishing the switching of manual webpage and automatic web switching behavior is 3 seconds (3000 milliseconds); The probability that the time requesting interval supposing between two adjacent webpages producing in normal web page browsing process is more than or equal to 3 seconds is 0.6, so, the probability that the time requesting interval between two adjacent webpages producing of its revised web client is less than 3 seconds is 0.4; The probability that the time requesting interval supposing between two adjacent webpages that spiders produces is more than or equal to 3 seconds is 0.4, and so, the probability that the time requesting interval between two adjacent webpages that its revised web client produces is less than 3 seconds is 0.6; As long as continuous 5 web-page requests supposing to observe from web client r to Web server s all meet " adjacent webpage request time interval is more than or equal to web-page requests time interval threshold value δ " this condition and just can judge that web client r's is operating as normal web page browsing behavior (i.e. m=5), then described threshold value η ₀be set to (0.4/0. δ) ^5=0.132; As long as continuous 5 web-page requests supposing to observe from web client r to Web server s all meet " adjacent webpage request time interval is less than web-page requests time interval threshold value δ " this condition and just can judge that web client r's is operating as spiders, then described threshold value η ₁be set to (0.6/0.4) ^5=7.59.

Such as, suppose according to spiders recognition methods steps A 1, have collected 10 web-page requests that certain CC client mails to protected Web server s, the initiation time of this 10 web-page requests and request response time as shown in table 1.

Table 1 is 10 web-page requests tables that certain CC client mails to protected Web server s

As if statistics is 10 milliseconds to the average response time of Web server to all Web page requests, according to spiders recognition methods steps A 2, and step is revised at neighbor request interval in A4, calculate element number be 9 revised adjacent webpage request time intervening sequence T as shown in table 2ly (suppose modifying factor k=1-(μ _i-η)/μ _i).

Table 2 is for revising adjacent webpage request time spacing sheet in rear shown 10 web-page requests of table 1

According to step spiders recognition methods steps A 4 and adjacent webpage time interval threshold value δ=3000 millisecond that preset, obtain elementary event sequence E as shown in table 3.

Table 3 is elementary event sequence table

Element numbers	1	2	3	4	5	6	7	8	9
										Elementary event	1	1	1	1	1	1	0	1	1

According to step spiders recognition methods steps A 5 and the lower threshold η 0 that presets be 0.132 and upper limit threshold η 1 be 7.59, first calculate the likelihood ratio of elementary event sequence E:

V (E)=(0.6/0.4) * (0.6/0.4) * (0.6/0.4) * (0.6/0.4) * (0.6/0.4) * (0.6/0.4) * (0.4/0.6) * (0.6/0.4) * (0.6/0.4)=17.8, it is greater than upper limit threshold η ₁(its value is 7.59), therefore, judges that this web client r's is operating as spiders behavior.

If do not revised adjacent webpage request time spacing value according to each Web page request response time, traditionally decision method (i.e. existing decision method), then may be judged to be normal Web page navigation patterns.

Embodiment 2

The present embodiment introduces a kind of device identifying spiders, and it can realize the method for the identification spiders shown in embodiment 1.This device comprises:

Acquiring unit, for obtaining in a period of time web client to the web-page requests of Web server, and statistics Web server is to the average response time of all Web page requests;

Correcting process unit, measure each adjacent Web page request time interval and each Web page request response time, according to Web page request response time correction adjacent Web page request time interval, and judge whether the time interval of each adjacent web-page requests is more than or equal to a predetermined adjacent webpage request time interval threshold δ;

Recognition unit, for whether meeting pre-conditioned according to each judged result, judges whether the operation of described web client is web crawlers.

In the present embodiment, correcting process specifically comprises:

Computing module, for calculating time interval in web-page requests sequence W that acquiring unit obtains between each adjacent web-page requests and each Web page request response time;

Wherein, correcting module is to the adjacent Web page request time interval t calculated _ithe detailed process of carrying out revising is, as Web page request response time μ _iwhen being greater than average Web page request response time η, thinking and responded slowly, therefore, need the response time of improving Web page request, be about to the adjacent Web page request time interval Vt ' calculated _ithe penalty factor k being less than 1 with one is multiplied, and product is revised adjacent Web page request time interval, and it is less than the Web page request response time calculated.Wherein, the penalty factor k being less than 1 can be 1-(μ _i-η)/μ _i, wherein μ _ifor Web page request response time, η is the average response time of all Web page requests.

Logging modle, for judging whether revised each time interval is more than or equal to the adjacent webpage request time interval threshold δ preset, if it is by the Event element e corresponding this time interval respectively _ibe designated as 0, otherwise be designated as 1; Obtain comprising Event element e corresponding to each time interval _ian elementary event sequence E;

Whether described recognition unit meets pre-conditioned according to each judged result, judges whether the operation of described web client is that web crawlers refers to:

The described elementary event sequence E of described recognition unit mates hypothesis H respectively ₀and H ₁, wherein H ₀represent that web client r's is operating as normal web page browsing behavior, H ₁represent that web client r's is operating as spiders; If described elementary event sequence E mates hypothesis H ₁degree, mate with elementary event sequence E and suppose H ₀degree between gap be greater than a degree threshold value, then judging that web client r's is operating as spiders, otherwise is normal web page browsing behavior.

In the present embodiment, described recognition unit comprises:

V (E) = \frac{\Pr [E | H_{1}]}{\Pr [E | H_{0}]} = Π_{i = 1}^{n - 1} \frac{\Pr [e_{i} | H_{1}]}{\Pr [e_{i} | H_{0}]}

In the present embodiment, described recognition unit also comprises:

η_{0} = Π_{i = 1}^{m - 1} \frac{\Pr [e_{i} = 0 | H_{1}]}{\Pr [e_{i} = 0 | H_{1}]} = {(\frac{\Pr [e_{i} = 0 | H_{1}]}{\Pr [e_{i} = 0 | H_{1}]})}^{m - 1}

Wherein, m is positive integer; Threshold setting module can obtain web-page requests by described acquiring unit, judge by described judging unit, and judged result is counted, if continuously m web-page requests all meets adjacent webpage request time interval and be less than (or be greater than, equal) web-page requests time interval threshold value δ, then calculate described threshold value η ₁(or η ₀); Can certainly directly obtain web-page requests and judgement.

Other realize details can with embodiment 1.

Certainly; the present invention also can have other various embodiments; when not deviating from the present invention's spirit and essence thereof; those of ordinary skill in the art are when making various corresponding change and distortion according to the present invention, but these change accordingly and are out of shape the protection range that all should belong to claim of the present invention.

Claims

1. identify a method for spiders, it is characterized in that, the method comprises the following steps:

2. the method for claim 1, is characterized in that,

The penalty factor k being less than 1 in described steps A 3 is 1-(μ _i-η)/μ _i.

3. method as claimed in claim 1 or 2, it is characterized in that, described steps A 5 comprises:

(E) = \frac{\Pr [E | H_{1}]}{\Pr [E | H_{0}]} = Π_{i = 1}^{n - 1} \frac{\Pr [e_{i} | H_{1}]}{\Pr [e_{i} | H_{0}]}

A54, by V (E) respectively with two fixed threshold η ₀and η ₁relatively; Wherein η ₀< η ₁if: V (E)>=η ₁, then what judge Web client is operating as spiders; If V (E)≤η ₀, then what judge Web client is operating as normal web page browsing.

4. method as claimed in claim 3, is characterized in that, in described steps A 54:

η_{0} = Π_{i = 1}^{m - 1} \frac{\Pr [e_{i} = 0 | H_{1}]}{\Pr [e_{i} = 0 | H_{1}]} = {(\frac{\Pr [e_{i} = 0 | H_{1}]}{\Pr [e_{i} = 0 | H_{1}]})}^{m - 1}

η_{1} = Π_{i = 1}^{m - 1} \frac{\Pr [e_{i} = 1 | H_{1}]}{\Pr [e_{i} = 1 | H_{1}]} = {(\frac{\Pr [e_{i} = 1 | H_{1}]}{\Pr [e_{i} = 1 | H_{1}]})}^{m - 1}

Wherein, m is positive integer.

5. identify a device for spiders, it is characterized in that, comprising:

Recognition unit, for mating hypothesis H respectively with described elementary event sequence E ₀and H ₁, wherein H ₀what represent Web client is operating as normal web page browsing behavior, H ₁what represent Web client is operating as spiders; If described elementary event sequence E mates hypothesis H ₁degree, mate with elementary event sequence E and suppose H ₀degree between gap be greater than a degree threshold value, then judging that Web client r's is operating as spiders, otherwise is normal web page browsing behavior.

6. device as claimed in claim 5, is characterized in that,

The described penalty factor k being less than 1 is 1-(μ _i-η)/μ _i.

7. the device as described in claim 5 or 6, is characterized in that, described recognition unit comprises:

(E) = \frac{\Pr [E | H_{1}]}{\Pr [E | H_{0}]} = Π_{i = 1}^{n - 1} \frac{\Pr [e_{i} | H_{1}]}{\Pr [e_{i} | H_{0}]}

8. device as claimed in claim 7, it is characterized in that, described recognition unit also comprises:

η_{0} = Π_{i = 1}^{m - 1} \frac{\Pr [e_{i} = 0 | H_{1}]}{\Pr [e_{i} = 0 | H_{1}]} = {(\frac{\Pr [e_{i} = 0 | H_{1}]}{\Pr [e_{i} = 0 | H_{1}]})}^{m - 1}

η_{1} = Π_{i = 1}^{m - 1} \frac{\Pr [e_{i} = 1 | H_{1}]}{\Pr [e_{i} = 1 | H_{1}]} = {(\frac{\Pr [e_{i} = 1 | H_{1}]}{\Pr [e_{i} = 1 | H_{1}]})}^{m - 1}

Wherein, m is positive integer.