CN102916935B - A kind of method of the anti-crawl of web site contents - Google Patents

A kind of method of the anti-crawl of web site contents Download PDF

Info

Publication number
CN102916935B
CN102916935B CN201110222891.1A CN201110222891A CN102916935B CN 102916935 B CN102916935 B CN 102916935B CN 201110222891 A CN201110222891 A CN 201110222891A CN 102916935 B CN102916935 B CN 102916935B
Authority
CN
China
Prior art keywords
client
crawl
blacklist
time
request
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201110222891.1A
Other languages
Chinese (zh)
Other versions
CN102916935A (en
Inventor
刘翔
黄有富
彭平源
管燕卿
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
SHENZHEN HQEW CO Ltd
Original Assignee
SHENZHEN HQEW CO Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by SHENZHEN HQEW CO Ltd filed Critical SHENZHEN HQEW CO Ltd
Priority to CN201110222891.1A priority Critical patent/CN102916935B/en
Publication of CN102916935A publication Critical patent/CN102916935A/en
Application granted granted Critical
Publication of CN102916935B publication Critical patent/CN102916935B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Abstract

The present invention provides a kind of method of the anti-crawl of web site contents, initially sets up the rule for judging crawl behavior;WEB server end obtains client-side information, is passed to after acquisition and prevents grabbing system server;It is anti-to grab the information that system server transmit according to WEB server end and verified, checking mark result is returned into WEB server end, and WEB server end decides whether the data query of execution requests for page or the prompting of output denied access according to the result.The method of the anti-crawl of web site contents proposed by the present invention, by the strict formulation for verifying flow, set about taking precautions against from the request of checking client, website data is effectively prevent to be crawled, simultaneously, timing automatic update mechanism is additionally provided with addition to flow is verified, it is ensured that blacklist table and the immediate updating of customer status table data, the operation of more effective, stabilization maintenance whole flow process.

Description

A kind of method of the anti-crawl of web site contents
Technical field
The present invention relates to a kind of anti-grasping means of web site contents.
Background technology
" crawl " described herein, refers to that program obtains a kind of mode of other website datas according to specified rules orientation.
In those early years, internet occurs in that a kind of system of search engine, to reach mass data shape by crawl web site contents Into platform, the technology is to obtain station address by all means, and the content of webpage is captured according to network address, grabbed Content is analyzed and finally gets corresponding data message;Meanwhile, also there is the data grabber of other non-search engine platforms, it is competing Strive opponent or other related enterprises brings business to be worth by capturing the specific information content to them.
Another crawl belongs to malice, no matter enterprise web site or personal website, all there is rival, in order to will The website of rival is in paralysis, can be attacked using various technical methods, wherein by the way of largely crawl data The Website server of rival excess load is occurred causes paralysis to be even more common attack method.
Search engine collecting, the crawl of business value or malicious attack crawl, are primarily present following two in accordance with the above Individual problem:One is that data are stolen on a large scale, can bring certain service impact to website operation, while it is hidden to have some Private data are exposed, and negative impact is brought to personal or enterprise;Two be either normally to capture or malicious attack crawl, all It is the performance for indirectly or directly affecting Website server, so that the attack for reducing the stability of website, particularly malice is grabbed Take the interests for even more directly compromising website and enterprise.This several operation for the website that is crawled especially using original content as For main website, on the one hand occupy and be crawled the substantial amounts of Internet resources in website, reduce the speed of service and the operation of network Efficiency;On the other hand the intellectual property for being crawled website has also been invaded, thus infringement is crawled the interests of website.
The content of the invention
It is an object of the invention to provide a kind of processing method of the anti-crawl of web site contents, this method can quickly, stably, effectively Website large-scale data is prevented to be crawled.
The technical solution adopted for the present invention to solve the technical problems is as follows:
A kind of method of the anti-crawl of web site contents, comprises the following steps:
1. initially set up the rule for judging crawl behavior;
2.WEB server ends obtain client-side information, are passed to after acquisition and prevent grabbing system server;
Grab the information that system server transmit according to WEB server end 3. anti-and verified, checking is identified into result return WEB server end is given, and WEB server end decides whether to perform data query or the output of requests for page according to the result The prompting of denied access.
Specifically, the number of times at request server end, the request within the client ip unit interval of the rule in the step (1) The path composition of access.
Preferably, the client-side information in the step (2) includes IP address, request URL address and current request Time.
Specifically, prevent grabbing system server setting unit interval and request upper limit number in the step (3), beyond the time And number of times is then judged to belong to crawl behavior.
Preferably, the limitation duration that setting pipes off in the step (3), within the time, the client all belongs to It is rejected.
Preferably, the step (3) further comprises blacklist table and customer status table, is stored in server memory.
Time and limitation duration when blacklist table stores client ip address, is put on the blacklist.
The client-side information of customer status table storage cell time all requests, including client ip address, ask first Time and request total degree.
Further, in the step (3), timing automatic update mechanism is additionally provided with, in the given time to blacklist table And the data of customer status table are updated.
Specifically, during the timing updates, the record that blacklist table all clients are taken out first is circulated, Judged whether are the time and current time interval when judging to be put on the blacklist at that time according to the limitation duration of every notes record More than or equal to limitation duration, if it is, this client records are removed from blacklist table;If it is not, then not dealing with.
Specifically, during the timing updates, customer status table data are taken out first and are circulated, first please be judged Whether seeking time is more than or equal to the unit interval with current time interval, if it is, the client is removed from customer status table; If it is not, then not dealing with.
Preferably, the limitation duration value of all clients is all the limitation duration of the global setting of acquiescence in the blacklist table Value, the limitation duration of client can be changed by the limitation duration value for changing blacklist table.
Beneficial effects of the present invention:The method of the anti-crawl of web site contents proposed by the present invention, by verifying the strict of flow Formulate, set about taking precautions against from the request of checking client, effectively prevent website data and be crawled, meanwhile, in addition to flow is verified It is additionally provided with timing automatic update mechanism, it is ensured that blacklist table and the immediate updating of customer status table data, more effective, stable Maintain the operation of whole flow process..
Brief description of the drawings
Fig. 1 is schematic network structure of the invention.
Fig. 2 is checking request flow chart of the invention.
Fig. 3 is timing automatic update mechanism flow chart of the invention.
Embodiment
Below with reference to drawings and examples, the present invention is described in detail.
As shown in figure 1, describing the schematic network structure of the present invention, i.e., including WEB server end, prevent grabbing system service Device and client, a kind of method of the anti-crawl of web site contents, comprise the following steps:
1. initially set up the rule for judging crawl behavior;
2.WEB server ends obtain client-side information, are passed to after acquisition and prevent grabbing system server;
Grab the information that system server transmit according to WEB server end 3. anti-and verified, checking is identified into result return WEB server end is given, and WEB server end decides whether to perform data query or the output of requests for page according to the result The prompting of denied access.
The regular number of times at request server end within the client ip unit interval in the step (1), request are accessed Path is constituted.
Client-side information in the step (2) includes the time of IP address, request URL address and current request.
Prevent grabbing system server setting unit interval and request upper limit number in the step (3), beyond the time and number of times Then it is judged to belong to crawl behavior.
The limitation duration that setting pipes off in the step (3), within the time, the client, which all belongs to, to be rejected.
The step (3) further comprises blacklist table and customer status table, is stored in server memory.
Time and limitation duration when blacklist table stores client ip address, is put on the blacklist.
The client-side information of customer status table storage cell time all requests, including client ip address, ask first Time and request total degree.
In the step (3), timing automatic update mechanism is additionally provided with, in the given time to blacklist table and customer status The data of table are updated.
During the timing updates, the record that blacklist table all clients are taken out first is circulated, according to every The limitation duration of notes record judged, judge time when being put on the blacklist at that time and current time interval whether more than etc. In limitation duration, if it is, this client records are removed from blacklist table;If it is not, then not dealing with.
During the timing updates, customer status table data are taken out first and are circulated, request time first is judged Whether it is more than or equal to the unit interval with current time interval, if it is, the client is removed from customer status table;If It is no, then do not deal with.
The limitation duration value of all clients is all the limitation duration value of the global setting of acquiescence in the blacklist table, is passed through The limitation duration value of modification blacklist table can change the limitation duration of client.
As shown in Fig. 2 the checking flow of the present invention is described, the programming language used according to website itself platform, such as ASP.NET (C#), PHP, ASP etc. carry out writing WEB server end web program, and program realizes following 2 functions, and one is to obtain The information of client, includes passing to after the time of IP address, request URL address and current request, acquisition and prevents grabbing system clothes Business device.Two be it is anti-grab the information that system server transmit according to WEB server end and verified, will finally verify that identifying result returns Back to WEB server end, and WEB server decides whether that the data query for performing requests for page is still exported according to the result The prompting of denied access.
Anti- system server of grabbing mainly has two global settings, and one is setting " unit interval " and permission " the request upper limit Number ", the setting refer to how long interior allow to access multiple, be then judged to belong to crawl beyond the number of times, such as 5 points of setting Only allow 1000 requests in clock, specific setting value is estimated according to each website real data;Two be that setting pipes off " limitation duration ", within the time, the client all belongs to what is be rejected, and how long concrete restriction must be entered according to actual conditions Row setting.In addition to two setting values, prevent that grabbing system server also creates two tables of data, blacklist table and customer status table, Two data lists can be stored in database table, be stored in server memory, it is considered to effectiveness of performance problem, the present invention Way is that two tables of data are stored in server memory.Blacklist table mainly stores client ip address, is put on the blacklist When time and " limitation duration ";Customer status table is storage proximal segment time content (unit interval) all requested clients Client information, including client ip address, time for asking first and the total degree of request.
Prevent that grabbing system server receives the instruction that WEB server end is transmitted, according to the IP address of client, with blacklist table Record matched, once in the presence of, then show to belong to refusal, then return one refusal mark return WEB server end, WEB server end, which is received, refuses not performing business datum inquiry then and directly export prompt message to return to client, such as Only this signal language received of " you have been put on the blacklist, and please visit again later ", now client, what is do not obtained has Imitate information data.If being not matched to the client in blacklist, equally with the client ip address and customer status table Matched, if it does not exist, then the IP of the client, current time are added into customer status table, and the client is existed The request number of times of customer status table is set to 1, finally returns to WEB server end with the mark being verified, WEB server receives anti- Grab the mark that system server is verified and then continue executing with the business datum of requests for page and inquire about and be back to client;If The client is matched in customer status table, then shows that the client is existing requested within the unit interval, now by the visitor Family end Jia 1 in the request number of times of customer status table, then whether the number of times after judging client request through Jia 1 is more than above-mentioned institute " the request upper limit number " for " unit interval " said, if it is greater, then the client is piped off and from customer status table Remove, when piping off, be mainly stored in the client ip, present system time, also have " during limitation of setting described above It is long " value, finally to verify that the mark not passed through returns to WEB server end, WEB server then exports signal language and is back to client End.If client request sum through Jia 1 after number of times be not have " the request upper limit number " for being more than " unit interval ", it is direct with The instruction being verified returns to WEB server, and WEB server then performs the data query of requests for page and is back to client End.
As shown in figure 3, describing the flow of timing automatic update mechanism of the invention.In addition to flow is verified, blacklist table number According to and customer status table data be all ageing, so the mechanism of timing automatic renewal must be set up, it is preferred that the present invention is set The time is automatically updated to be per minute, it is per minute that all the two table data are updated, all clients of blacklist table are taken out first The record at end is circulated, and is judged according to " the limitation duration " of every notes record, judgement when being put on the blacklist at that time when Between with current time interval whether be more than or equal to " limitation duration ", if it is, by this client records from blacklist table Remove, otherwise do not deal with.The same customer status table data per minute that automatically take out are circulated, judge request time first and Whether current time interval is more than or equal to " unit interval ", and if it is the client is removed from customer status table, is not then Do not deal with.
In blacklist table, the limitation duration value of all clients is all global setting " limitation duration " value of acquiescence, such as The time that must be limited some clients is longer, need only change " limitation duration " value of the client in blacklist table.
Herein referred blacklist is exactly the meaning of blacklist table, technically for, pipe off, be exactly by client Data deposit blacklist table;Blacklist table is a kind of object of technical program, and it is a business action to pipe off.
Embodiment above is only that the preferred embodiment of the present invention is described, and not the scope of the present invention is entered Row is limited, on the premise of design spirit of the present invention is not departed from, technical side of this area ordinary skill technical staff to the present invention In various modifications and improvement that case is made, the protection domain that claims of the present invention determination all should be fallen into.
Part that the present invention does not relate to is same as the prior art or can be realized using prior art.

Claims (7)

1. a kind of method of the anti-crawl of web site contents, comprises the following steps:
(1) rule for judging crawl behavior is initially set up;
(2) WEB server end obtains client-side information, is passed to after acquisition and prevents grabbing system server;
(3) prevent that grabbing the information that system server transmit according to WEB server end is verified, will verify that identifying result returns to WEB server end, and WEB server end decides whether that performing the data query of requests for page or output refuses according to the result The prompting accessed absolutely;
The step (3) includes blacklist table and customer status table, is stored in server memory;
Time and limitation duration when blacklist table stores client ip address, is put on the blacklist;
The client-side information of customer status table storage cell time all requests, including client ip address, ask first when Between and request total degree;
Timing automatic update mechanism is additionally provided with the step (3), in the given time to blacklist table and customer status table Data are updated;During the timing updates, the record that blacklist table all clients are taken out first is circulated, root Limitation duration according to every notes record is judged whether the time and current time interval when judging to be put on the blacklist at that time are big In equal to limitation duration, if it is, this client records are removed from blacklist table;If it is not, then not dealing with.
2. the method for the anti-crawl of web site contents as claimed in claim 1, it is characterised in that:In the step (1) rule by The path composition that the number of times at request server end, request are accessed in the client ip unit interval.
3. the method for the anti-crawl of web site contents as claimed in claim 1, it is characterised in that:Client in the step (2) Information includes the time of IP address, request URL address and current request.
4. the method for the anti-crawl of web site contents as claimed in claim 1, it is characterised in that:Prevent grabbing system in the step (3) Server settings unit interval and request upper limit number, are then judged to belong to crawl behavior beyond the time and number of times.
5. the method for the anti-crawl of web site contents as claimed in claim 1, it is characterised in that:Setting is included in the step (3) The limitation duration of blacklist, within the time, the client, which all belongs to, to be rejected.
6. the method for the anti-crawl of web site contents as claimed in claim 1, it is characterised in that:During the timing updates, Customer status table data are taken out first to be circulated, and judge whether request time is more than or equal to unit with current time interval first Time, if it is, the client is removed from customer status table;If it is not, then not dealing with.
7. the method for the anti-crawl of web site contents as claimed in claim 1, it is characterised in that:All clients in the blacklist table The limitation duration value at end is all the limitation duration value of the global setting of acquiescence, can be repaiied by the limitation duration value for changing blacklist table Change the limitation duration of client.
CN201110222891.1A 2011-08-04 2011-08-04 A kind of method of the anti-crawl of web site contents Active CN102916935B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201110222891.1A CN102916935B (en) 2011-08-04 2011-08-04 A kind of method of the anti-crawl of web site contents

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201110222891.1A CN102916935B (en) 2011-08-04 2011-08-04 A kind of method of the anti-crawl of web site contents

Publications (2)

Publication Number Publication Date
CN102916935A CN102916935A (en) 2013-02-06
CN102916935B true CN102916935B (en) 2017-08-25

Family

ID=47615169

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201110222891.1A Active CN102916935B (en) 2011-08-04 2011-08-04 A kind of method of the anti-crawl of web site contents

Country Status (1)

Country Link
CN (1) CN102916935B (en)

Families Citing this family (13)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN104217136B (en) * 2013-06-05 2017-05-03 北京齐尔布莱特科技有限公司 Method and system for preventing web page text message from being captured automatically
CN104506525B (en) * 2014-12-22 2018-04-20 北京奇安信科技有限公司 Prevent the method and protective device that malice captures
CN107291742A (en) * 2016-03-31 2017-10-24 北京小度信息科技有限公司 The anti-grasping means of data and device
CN105827619B (en) * 2016-04-25 2019-02-15 无锡中科富农物联科技有限公司 Crawler in the case of height access closes method
CN109150928A (en) * 2017-06-15 2019-01-04 北京京东尚科信息技术有限公司 Method and apparatus for handling request
CN107196811A (en) * 2017-07-13 2017-09-22 上海幻电信息科技有限公司 Video website door chain control system and method
CN107483563A (en) * 2017-07-31 2017-12-15 九次方大数据信息集团有限公司 The data query method and apparatus and client and server of anti-reptile
CN109150819B (en) * 2018-01-15 2019-06-11 北京数安鑫云信息技术有限公司 A kind of attack recognition method and its identifying system
CN110365628B (en) * 2018-04-11 2020-12-04 滴图(北京)科技有限公司 Data request processing method and device
CN110020940A (en) * 2019-04-02 2019-07-16 中电科大数据研究院有限公司 Processing method, device, equipment and the storage medium of credit list
CN110442770B (en) * 2019-08-08 2023-06-20 深圳市今天国际物流技术股份有限公司 Data grabbing and storing method and device, computer equipment and storage medium
CN111553776B (en) * 2020-04-26 2023-08-08 成都新致云服信息技术有限公司 Data processing method and device and electronic equipment
CN111917787B (en) * 2020-08-06 2023-07-21 北京奇艺世纪科技有限公司 Request detection method, request detection device, electronic equipment and computer readable storage medium

Citations (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102045319A (en) * 2009-10-21 2011-05-04 中国移动通信集团山东有限公司 Method and device for detecting SQL (Structured Query Language) injection attack

Family Cites Families (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101902438B (en) * 2009-05-25 2013-05-15 北京启明星辰信息技术股份有限公司 Method and device for automatically identifying web crawlers
CN102088477A (en) * 2010-11-25 2011-06-08 互动在线(北京)科技有限公司 Website content anti-acquisition system and method thereof

Patent Citations (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102045319A (en) * 2009-10-21 2011-05-04 中国移动通信集团山东有限公司 Method and device for detecting SQL (Structured Query Language) injection attack

Also Published As

Publication number Publication date
CN102916935A (en) 2013-02-06

Similar Documents

Publication Publication Date Title
CN102916935B (en) A kind of method of the anti-crawl of web site contents
CN104065644B (en) CC attack recognition method and apparatus based on log analysis
Sun et al. Security of online reputation systems: The evolution of attacks and defenses
Li et al. Botnet economics: uncertainty matters
US8219549B2 (en) Forum mining for suspicious link spam sites detection
CN103902386B (en) Multi-thread network crawler processing method based on connection proxy optimal management
CN104768139B (en) A kind of method and device that short message is sent
CN109274632A (en) A kind of recognition methods of website and device
CN108712426A (en) Reptile recognition methods and system a little are buried based on user behavior
Lauinger et al. Game of Registrars: An Empirical Analysis of {Post-Expiration} Domain Name Takeovers
CN106230835B (en) Method based on Nginx log analysis and the IPTABLES anti-malicious access forwarded
CN107517200B (en) Malicious crawler defense strategy selection method for Web server
CN103067387B (en) A kind of anti-phishing monitoring system and method
CN106951784B (en) XSS vulnerability detection-oriented Web application reverse analysis method
CN104125238A (en) DoS (Denial of Service) and DDoS (Distributed Denial of service) attack resisting method of DNS recursive server
CN104202344B (en) A kind of method and device for the anti-ddos attack of DNS service
Avarikioti et al. Structure and content of the visible Darknet
CN110365810A (en) Domain name caching method, device, equipment and storage medium based on web crawlers
CN107800689A (en) A kind of Website Usability ensures processing method and processing device
CN111049837A (en) Malicious website identification and interception technology based on communication operator network transport layer
CN108319634A (en) The directory access method and apparatus of distributed file system
CN102413201B (en) Processing method and equipment for domain name system (DNS) query request
CN105827619B (en) Crawler in the case of height access closes method
CN109981533B (en) DDoS attack detection method, device, electronic equipment and storage medium
CN106557590A (en) A kind of intelligent Answer System

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant