CN104902008A

CN104902008A - Crawler data processing method

Info

Publication number: CN104902008A
Application number: CN201510200123.4A
Authority: CN
Inventors: 严澜
Original assignee: Chengdu Chuan Hang Information Technology Co Ltd
Current assignee: Suzhou Chong Xing Mdt InfoTech Ltd
Priority date: 2015-04-26
Filing date: 2015-04-26
Publication date: 2015-09-09

Abstract

The invention discloses a crawler data processing method. The method comprises the steps that 1) login is managed, a white list is input in the background by utilizing a black and white list pool, and the effective operation time of a black list is set as X min; 2) whether the IP of a present login page belongs to the black list or the white list is determined; 3) when the IP of the present login page belongs to the white list, the IP is allowed to carry out normal page logic operation; 4) when the IP of the present login page belongs to the black list, the IP enters identifying code page operation, after an identifying code is written by the IP, the IP is released from the black list, and the IP is allowed to carry out normal page logic operation; and 5) when the e IP of the present login page belongs to neither the black list nor the white list, a counter pool is entered, a Cache counter carries out a plus 1 operation before the Cache counter exceeds the time limit, the accumulated counting value is compared with a threshold, when the counting value is lower than the threshold, the IP is determined not to be a crawler, and when the counting value is greater than the threshold, the IP is determined to be a crawler.

Description

A kind of data processing method for reptile

Technical field

The present invention relates to large technical field of data processing, specifically a kind of data processing method for reptile.

Background technology

Web crawlers is the important component part that grasping system held up in Dissatisfied index.The main purpose of reptile is to the local mirror back-up forming or networking content by the page download on the Internet.Find by analyzing apache daily record, the bandwidth sum server resource of certain system 40% all consumes on reptile, if the reptile of removing 10%-15% search engine, carrying out anti-reptile strategy, can save the resource of 20%-25%, is that break-in optimizes web system in fact.

Reptile request is the mechanism of similar httpClient or the order of curl, wget, and normal users request generally walks browser.Reptile request generally can not perform the asynchronous JavaScript operation in the page, and user's request then performs the asynchronous JavaScript operation that Jquery provides.

Summary of the invention

The object of the present invention is to provide a kind of data processing method for reptile, effectively to distinguish reptile request and normal users request, thus reptile request can be stopped, save system resource.

Object of the present invention is achieved through the following technical solutions: a kind of data processing method for reptile, comprises the following steps:

Step 1: administrative login also utilizes black and white lists pond at backstage typing white list, and is set as X minute the operating time of blacklist effectively;

Step 2: judge that the IP of current login page is blacklist or white list;

Step 3: when the IP of current login page is white list, allows this IP carry out normal page logical operation;

Step 4: when the IP of current login page is blacklist, allows this IP enter identifying code page operation, after this IP writes identifying code, discharges this IP and is being not blacklist, allow this IP carry out normal page logical operation simultaneously;

Step 5: the IP of current login page be non-blacklist and non-white list time, traffic statistics instrument is utilized to carry out examination, enter counter pond, 1 operation can be added before Cache counter is expired at Cache counter, then the count value after relatively adding up and threshold values, when count value is less than threshold values, judge that this IP is as non-reptile, this IP is allowed to carry out normal page logical operation, when count value is greater than threshold values, this IP is arranged to blacklist, and allow this IP enter identifying code page operation, after this IP writes identifying code, discharge this IP and be not blacklist, if this IP does not fill in identifying code, then judge that this IP is as reptile,

Step 6: when this IP logs off the page, asynchronous JS request, cache count value subtracts 1 operation.

The setting principle of said method is: because reptile request is the mechanism of similar httpClient or the order of curl, wget, and normal users request generally walks browser.Reptile request generally can not perform the asynchronous JavaScript operation in the page, and user's request then performs the asynchronous JavaScript operation that Jquery provides.

Therefore, the present invention utilizes asynchronous JavaScript to operate, thus filters out reptile request, and blacklist is arranged in reptile request.Simultaneously due to the change of network IP, sometimes same IP the previous day is blacklist, it within second day, is then normal users logging request, therefore all IP can not be set to unalterable fixing and extremely ask IP, therefore need in real time along current blacklist, utilize asynchronous JavaScript to operate differentiation current blacklist simultaneously and whether continue as reptile request.

WEB system all walks http agreement with WEB reservoir, and each request produces a client to I haven't seen you for ages and is connected with the tcp of server.By netstat order, the IP while of just can viewing current corresponding to connection server and connection amount.Order/bin/netstat-nat-n | the general all hundreds ofs of grep 80 or several thousand.When the linking number that same IP is corresponding exceedes the threshold values that we observe, just can be judged as that improper user asks.Threshold value setting is most important, and large-scale Internet bar or same school, company IP out also may be mistaken for illegal request.This strategy comprises two timing scripts, and IP (tcpForbidCmd.sh) is sealed in a timing, a time controlled released IP (tcpReleaseCmd.sh), is respectively respectively to perform once every 5 minutes and 40 minutes.This strategy has been equivalent to threshold to our default, and in similar highway traffic system, certain road setting limit for height 4 meters of railings, can not pass through at this higher than the cars of 4 meters.This strategy energy preventing malicious or the newly hand-written irregular reptile of request frequency.

Anti-reptile strobe utility of the present invention implements also fairly simple, we can do access counter with memcached or local internal memory, in time period before buffer memory is expired (as 3 minutes), once, counter adds 1 in each IP access, the KEY of buffer memory comprises IP, by the value that counter obtains, judge more than a threshold values, this IP probably has problem, so just can return an identifying code page, require that user fills in identifying code.If reptile, certainly can not fill in identifying code, just refused, protect the resource of rear end.The setting of threshold values is also that very important, different system is different.This strobe utility improves by we, will be more accurate.Namely we add the Asynchronous Request of a JS in the bottom of webpage, and this Asynchronous Request is used for the value of down counter, carries out value added, depreciation during page-out, generate a difference during page-in to IP.According to the analysis before us, reptile can not perform the request of asynchronous JS depreciation.Can judge whether this IP is reptile from the size of the value generated like this.

Within X minute, be set to 30 minutes.

Setting counter pond is expired after Y minute, within Y minute, is arranged to 3 minutes.

The invention has the advantages that: cost is low, effectively to distinguish reptile request and normal users request, thus reptile request can be stopped, save system resource, avoid the request of erroneous judgement normal users simultaneously.

Accompanying drawing explanation

Fig. 1 is schematic flow sheet of the present invention.

Embodiment

Below in conjunction with embodiment and accompanying drawing, the present invention is described in further detail, but embodiments of the present invention are not limited thereto.

Embodiment 1:

As shown in Figure 1.

For a data processing method for reptile, comprise the following steps:

Step 2: judge that the IP of current login page is blacklist or white list;

Within X minute, be set to 30 minutes.

As mentioned above, then well the present invention can be realized.

Claims

1. for a data processing method for reptile, it is characterized in that: comprise the following steps:

Step 2: judge that the IP of current login page is blacklist or white list;

2. a kind of data processing method for reptile according to claim 1, is characterized in that: within X minute, be set to 30 minutes.

3. a kind of data processing method for reptile according to claim 1, is characterized in that: setting counter pond is expired after Y minute, within Y minute, is arranged to 3 minutes.