CN102916935B

CN102916935B - A kind of method of the anti-crawl of web site contents

Info

Publication number: CN102916935B
Application number: CN201110222891.1A
Authority: CN
Inventors: 刘翔; 黄有富; 彭平源; 管燕卿
Original assignee: SHENZHEN HQEW CO Ltd
Current assignee: SHENZHEN HQEW CO Ltd
Priority date: 2011-08-04
Filing date: 2011-08-04
Publication date: 2017-08-25
Anticipated expiration: 2031-08-04
Also published as: CN102916935A

Abstract

The present invention provides a kind of method of the anti-crawl of web site contents, initially sets up the rule for judging crawl behavior；WEB server end obtains client-side information, is passed to after acquisition and prevents grabbing system server；It is anti-to grab the information that system server transmit according to WEB server end and verified, checking mark result is returned into WEB server end, and WEB server end decides whether the data query of execution requests for page or the prompting of output denied access according to the result.The method of the anti-crawl of web site contents proposed by the present invention, by the strict formulation for verifying flow, set about taking precautions against from the request of checking client, website data is effectively prevent to be crawled, simultaneously, timing automatic update mechanism is additionally provided with addition to flow is verified, it is ensured that blacklist table and the immediate updating of customer status table data, the operation of more effective, stabilization maintenance whole flow process.

Description

A kind of method of the anti-crawl of web site contents

Technical field

The present invention relates to a kind of anti-grasping means of web site contents.

Background technology

" crawl " described herein, refers to that program obtains a kind of mode of other website datas according to specified rules orientation.

In those early years, internet occurs in that a kind of system of search engine, to reach mass data shape by crawl web site contents Into platform, the technology is to obtain station address by all means, and the content of webpage is captured according to network address, grabbed Content is analyzed and finally gets corresponding data message；Meanwhile, also there is the data grabber of other non-search engine platforms, it is competing Strive opponent or other related enterprises brings business to be worth by capturing the specific information content to them.

Another crawl belongs to malice, no matter enterprise web site or personal website, all there is rival, in order to will The website of rival is in paralysis, can be attacked using various technical methods, wherein by the way of largely crawl data The Website server of rival excess load is occurred causes paralysis to be even more common attack method.

Search engine collecting, the crawl of business value or malicious attack crawl, are primarily present following two in accordance with the above Individual problem：One is that data are stolen on a large scale, can bring certain service impact to website operation, while it is hidden to have some Private data are exposed, and negative impact is brought to personal or enterprise；Two be either normally to capture or malicious attack crawl, all It is the performance for indirectly or directly affecting Website server, so that the attack for reducing the stability of website, particularly malice is grabbed Take the interests for even more directly compromising website and enterprise.This several operation for the website that is crawled especially using original content as For main website, on the one hand occupy and be crawled the substantial amounts of Internet resources in website, reduce the speed of service and the operation of network Efficiency；On the other hand the intellectual property for being crawled website has also been invaded, thus infringement is crawled the interests of website.

The content of the invention

It is an object of the invention to provide a kind of processing method of the anti-crawl of web site contents, this method can quickly, stably, effectively Website large-scale data is prevented to be crawled.

The technical solution adopted for the present invention to solve the technical problems is as follows：

A kind of method of the anti-crawl of web site contents, comprises the following steps：

1. initially set up the rule for judging crawl behavior；

2.WEB server ends obtain client-side information, are passed to after acquisition and prevent grabbing system server；

Grab the information that system server transmit according to WEB server end 3. anti-and verified, checking is identified into result return WEB server end is given, and WEB server end decides whether to perform data query or the output of requests for page according to the result The prompting of denied access.

Specifically, the number of times at request server end, the request within the client ip unit interval of the rule in the step (1) The path composition of access.

Preferably, the client-side information in the step (2) includes IP address, request URL address and current request Time.

Specifically, prevent grabbing system server setting unit interval and request upper limit number in the step (3), beyond the time And number of times is then judged to belong to crawl behavior.

Preferably, the limitation duration that setting pipes off in the step (3), within the time, the client all belongs to It is rejected.

Preferably, the step (3) further comprises blacklist table and customer status table, is stored in server memory.

Time and limitation duration when blacklist table stores client ip address, is put on the blacklist.

The client-side information of customer status table storage cell time all requests, including client ip address, ask first Time and request total degree.

Further, in the step (3), timing automatic update mechanism is additionally provided with, in the given time to blacklist table And the data of customer status table are updated.

Specifically, during the timing updates, the record that blacklist table all clients are taken out first is circulated, Judged whether are the time and current time interval when judging to be put on the blacklist at that time according to the limitation duration of every notes record More than or equal to limitation duration, if it is, this client records are removed from blacklist table；If it is not, then not dealing with.

Specifically, during the timing updates, customer status table data are taken out first and are circulated, first please be judged Whether seeking time is more than or equal to the unit interval with current time interval, if it is, the client is removed from customer status table； If it is not, then not dealing with.

Preferably, the limitation duration value of all clients is all the limitation duration of the global setting of acquiescence in the blacklist table Value, the limitation duration of client can be changed by the limitation duration value for changing blacklist table.

Beneficial effects of the present invention：The method of the anti-crawl of web site contents proposed by the present invention, by verifying the strict of flow Formulate, set about taking precautions against from the request of checking client, effectively prevent website data and be crawled, meanwhile, in addition to flow is verified It is additionally provided with timing automatic update mechanism, it is ensured that blacklist table and the immediate updating of customer status table data, more effective, stable Maintain the operation of whole flow process..

Brief description of the drawings

Fig. 1 is schematic network structure of the invention.

Fig. 2 is checking request flow chart of the invention.

Fig. 3 is timing automatic update mechanism flow chart of the invention.

Embodiment

Below with reference to drawings and examples, the present invention is described in detail.

As shown in figure 1, describing the schematic network structure of the present invention, i.e., including WEB server end, prevent grabbing system service Device and client, a kind of method of the anti-crawl of web site contents, comprise the following steps：

1. initially set up the rule for judging crawl behavior；

The regular number of times at request server end within the client ip unit interval in the step (1), request are accessed Path is constituted.

Client-side information in the step (2) includes the time of IP address, request URL address and current request.

Prevent grabbing system server setting unit interval and request upper limit number in the step (3), beyond the time and number of times Then it is judged to belong to crawl behavior.

The limitation duration that setting pipes off in the step (3), within the time, the client, which all belongs to, to be rejected.

The step (3) further comprises blacklist table and customer status table, is stored in server memory.

In the step (3), timing automatic update mechanism is additionally provided with, in the given time to blacklist table and customer status The data of table are updated.

During the timing updates, the record that blacklist table all clients are taken out first is circulated, according to every The limitation duration of notes record judged, judge time when being put on the blacklist at that time and current time interval whether more than etc. In limitation duration, if it is, this client records are removed from blacklist table；If it is not, then not dealing with.

During the timing updates, customer status table data are taken out first and are circulated, request time first is judged Whether it is more than or equal to the unit interval with current time interval, if it is, the client is removed from customer status table；If It is no, then do not deal with.

The limitation duration value of all clients is all the limitation duration value of the global setting of acquiescence in the blacklist table, is passed through The limitation duration value of modification blacklist table can change the limitation duration of client.

As shown in Fig. 2 the checking flow of the present invention is described, the programming language used according to website itself platform, such as ASP.NET (C#), PHP, ASP etc. carry out writing WEB server end web program, and program realizes following 2 functions, and one is to obtain The information of client, includes passing to after the time of IP address, request URL address and current request, acquisition and prevents grabbing system clothes Business device.Two be it is anti-grab the information that system server transmit according to WEB server end and verified, will finally verify that identifying result returns Back to WEB server end, and WEB server decides whether that the data query for performing requests for page is still exported according to the result The prompting of denied access.

Anti- system server of grabbing mainly has two global settings, and one is setting " unit interval " and permission " the request upper limit Number ", the setting refer to how long interior allow to access multiple, be then judged to belong to crawl beyond the number of times, such as 5 points of setting Only allow 1000 requests in clock, specific setting value is estimated according to each website real data；Two be that setting pipes off " limitation duration ", within the time, the client all belongs to what is be rejected, and how long concrete restriction must be entered according to actual conditions Row setting.In addition to two setting values, prevent that grabbing system server also creates two tables of data, blacklist table and customer status table, Two data lists can be stored in database table, be stored in server memory, it is considered to effectiveness of performance problem, the present invention Way is that two tables of data are stored in server memory.Blacklist table mainly stores client ip address, is put on the blacklist When time and " limitation duration "；Customer status table is storage proximal segment time content (unit interval) all requested clients Client information, including client ip address, time for asking first and the total degree of request.

Prevent that grabbing system server receives the instruction that WEB server end is transmitted, according to the IP address of client, with blacklist table Record matched, once in the presence of, then show to belong to refusal, then return one refusal mark return WEB server end, WEB server end, which is received, refuses not performing business datum inquiry then and directly export prompt message to return to client, such as Only this signal language received of " you have been put on the blacklist, and please visit again later ", now client, what is do not obtained has Imitate information data.If being not matched to the client in blacklist, equally with the client ip address and customer status table Matched, if it does not exist, then the IP of the client, current time are added into customer status table, and the client is existed The request number of times of customer status table is set to 1, finally returns to WEB server end with the mark being verified, WEB server receives anti- Grab the mark that system server is verified and then continue executing with the business datum of requests for page and inquire about and be back to client；If The client is matched in customer status table, then shows that the client is existing requested within the unit interval, now by the visitor Family end Jia 1 in the request number of times of customer status table, then whether the number of times after judging client request through Jia 1 is more than above-mentioned institute " the request upper limit number " for " unit interval " said, if it is greater, then the client is piped off and from customer status table Remove, when piping off, be mainly stored in the client ip, present system time, also have " during limitation of setting described above It is long " value, finally to verify that the mark not passed through returns to WEB server end, WEB server then exports signal language and is back to client End.If client request sum through Jia 1 after number of times be not have " the request upper limit number " for being more than " unit interval ", it is direct with The instruction being verified returns to WEB server, and WEB server then performs the data query of requests for page and is back to client End.

As shown in figure 3, describing the flow of timing automatic update mechanism of the invention.In addition to flow is verified, blacklist table number According to and customer status table data be all ageing, so the mechanism of timing automatic renewal must be set up, it is preferred that the present invention is set The time is automatically updated to be per minute, it is per minute that all the two table data are updated, all clients of blacklist table are taken out first The record at end is circulated, and is judged according to " the limitation duration " of every notes record, judgement when being put on the blacklist at that time when Between with current time interval whether be more than or equal to " limitation duration ", if it is, by this client records from blacklist table Remove, otherwise do not deal with.The same customer status table data per minute that automatically take out are circulated, judge request time first and Whether current time interval is more than or equal to " unit interval ", and if it is the client is removed from customer status table, is not then Do not deal with.

In blacklist table, the limitation duration value of all clients is all global setting " limitation duration " value of acquiescence, such as The time that must be limited some clients is longer, need only change " limitation duration " value of the client in blacklist table.

Herein referred blacklist is exactly the meaning of blacklist table, technically for, pipe off, be exactly by client Data deposit blacklist table；Blacklist table is a kind of object of technical program, and it is a business action to pipe off.

Embodiment above is only that the preferred embodiment of the present invention is described, and not the scope of the present invention is entered Row is limited, on the premise of design spirit of the present invention is not departed from, technical side of this area ordinary skill technical staff to the present invention In various modifications and improvement that case is made, the protection domain that claims of the present invention determination all should be fallen into.

Part that the present invention does not relate to is same as the prior art or can be realized using prior art.

Claims

1. a kind of method of the anti-crawl of web site contents, comprises the following steps：

(1) rule for judging crawl behavior is initially set up；

(2) WEB server end obtains client-side information, is passed to after acquisition and prevents grabbing system server；

(3) prevent that grabbing the information that system server transmit according to WEB server end is verified, will verify that identifying result returns to WEB server end, and WEB server end decides whether that performing the data query of requests for page or output refuses according to the result The prompting accessed absolutely；

The step (3) includes blacklist table and customer status table, is stored in server memory；

Time and limitation duration when blacklist table stores client ip address, is put on the blacklist；

The client-side information of customer status table storage cell time all requests, including client ip address, ask first when Between and request total degree；

Timing automatic update mechanism is additionally provided with the step (3), in the given time to blacklist table and customer status table Data are updated；During the timing updates, the record that blacklist table all clients are taken out first is circulated, root Limitation duration according to every notes record is judged whether the time and current time interval when judging to be put on the blacklist at that time are big In equal to limitation duration, if it is, this client records are removed from blacklist table；If it is not, then not dealing with.

2. the method for the anti-crawl of web site contents as claimed in claim 1, it is characterised in that：In the step (1) rule by The path composition that the number of times at request server end, request are accessed in the client ip unit interval.

3. the method for the anti-crawl of web site contents as claimed in claim 1, it is characterised in that：Client in the step (2) Information includes the time of IP address, request URL address and current request.

4. the method for the anti-crawl of web site contents as claimed in claim 1, it is characterised in that：Prevent grabbing system in the step (3) Server settings unit interval and request upper limit number, are then judged to belong to crawl behavior beyond the time and number of times.

5. the method for the anti-crawl of web site contents as claimed in claim 1, it is characterised in that：Setting is included in the step (3) The limitation duration of blacklist, within the time, the client, which all belongs to, to be rejected.

6. the method for the anti-crawl of web site contents as claimed in claim 1, it is characterised in that：During the timing updates, Customer status table data are taken out first to be circulated, and judge whether request time is more than or equal to unit with current time interval first Time, if it is, the client is removed from customer status table；If it is not, then not dealing with.

7. the method for the anti-crawl of web site contents as claimed in claim 1, it is characterised in that：All clients in the blacklist table The limitation duration value at end is all the limitation duration value of the global setting of acquiescence, can be repaiied by the limitation duration value for changing blacklist table Change the limitation duration of client.