CN102916935B - A kind of method of the anti-crawl of web site contents - Google Patents
A kind of method of the anti-crawl of web site contents Download PDFInfo
- Publication number
- CN102916935B CN102916935B CN201110222891.1A CN201110222891A CN102916935B CN 102916935 B CN102916935 B CN 102916935B CN 201110222891 A CN201110222891 A CN 201110222891A CN 102916935 B CN102916935 B CN 102916935B
- Authority
- CN
- China
- Prior art keywords
- client
- crawl
- blacklist
- time
- request
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Active
Links
Abstract
The present invention provides a kind of method of the anti-crawl of web site contents, initially sets up the rule for judging crawl behavior;WEB server end obtains client-side information, is passed to after acquisition and prevents grabbing system server;It is anti-to grab the information that system server transmit according to WEB server end and verified, checking mark result is returned into WEB server end, and WEB server end decides whether the data query of execution requests for page or the prompting of output denied access according to the result.The method of the anti-crawl of web site contents proposed by the present invention, by the strict formulation for verifying flow, set about taking precautions against from the request of checking client, website data is effectively prevent to be crawled, simultaneously, timing automatic update mechanism is additionally provided with addition to flow is verified, it is ensured that blacklist table and the immediate updating of customer status table data, the operation of more effective, stabilization maintenance whole flow process.
Description
Technical field
The present invention relates to a kind of anti-grasping means of web site contents.
Background technology
" crawl " described herein, refers to that program obtains a kind of mode of other website datas according to specified rules orientation.
In those early years, internet occurs in that a kind of system of search engine, to reach mass data shape by crawl web site contents
Into platform, the technology is to obtain station address by all means, and the content of webpage is captured according to network address, grabbed
Content is analyzed and finally gets corresponding data message;Meanwhile, also there is the data grabber of other non-search engine platforms, it is competing
Strive opponent or other related enterprises brings business to be worth by capturing the specific information content to them.
Another crawl belongs to malice, no matter enterprise web site or personal website, all there is rival, in order to will
The website of rival is in paralysis, can be attacked using various technical methods, wherein by the way of largely crawl data
The Website server of rival excess load is occurred causes paralysis to be even more common attack method.
Search engine collecting, the crawl of business value or malicious attack crawl, are primarily present following two in accordance with the above
Individual problem:One is that data are stolen on a large scale, can bring certain service impact to website operation, while it is hidden to have some
Private data are exposed, and negative impact is brought to personal or enterprise;Two be either normally to capture or malicious attack crawl, all
It is the performance for indirectly or directly affecting Website server, so that the attack for reducing the stability of website, particularly malice is grabbed
Take the interests for even more directly compromising website and enterprise.This several operation for the website that is crawled especially using original content as
For main website, on the one hand occupy and be crawled the substantial amounts of Internet resources in website, reduce the speed of service and the operation of network
Efficiency;On the other hand the intellectual property for being crawled website has also been invaded, thus infringement is crawled the interests of website.
The content of the invention
It is an object of the invention to provide a kind of processing method of the anti-crawl of web site contents, this method can quickly, stably, effectively
Website large-scale data is prevented to be crawled.
The technical solution adopted for the present invention to solve the technical problems is as follows:
A kind of method of the anti-crawl of web site contents, comprises the following steps:
1. initially set up the rule for judging crawl behavior;
2.WEB server ends obtain client-side information, are passed to after acquisition and prevent grabbing system server;
Grab the information that system server transmit according to WEB server end 3. anti-and verified, checking is identified into result return
WEB server end is given, and WEB server end decides whether to perform data query or the output of requests for page according to the result
The prompting of denied access.
Specifically, the number of times at request server end, the request within the client ip unit interval of the rule in the step (1)
The path composition of access.
Preferably, the client-side information in the step (2) includes IP address, request URL address and current request
Time.
Specifically, prevent grabbing system server setting unit interval and request upper limit number in the step (3), beyond the time
And number of times is then judged to belong to crawl behavior.
Preferably, the limitation duration that setting pipes off in the step (3), within the time, the client all belongs to
It is rejected.
Preferably, the step (3) further comprises blacklist table and customer status table, is stored in server memory.
Time and limitation duration when blacklist table stores client ip address, is put on the blacklist.
The client-side information of customer status table storage cell time all requests, including client ip address, ask first
Time and request total degree.
Further, in the step (3), timing automatic update mechanism is additionally provided with, in the given time to blacklist table
And the data of customer status table are updated.
Specifically, during the timing updates, the record that blacklist table all clients are taken out first is circulated,
Judged whether are the time and current time interval when judging to be put on the blacklist at that time according to the limitation duration of every notes record
More than or equal to limitation duration, if it is, this client records are removed from blacklist table;If it is not, then not dealing with.
Specifically, during the timing updates, customer status table data are taken out first and are circulated, first please be judged
Whether seeking time is more than or equal to the unit interval with current time interval, if it is, the client is removed from customer status table;
If it is not, then not dealing with.
Preferably, the limitation duration value of all clients is all the limitation duration of the global setting of acquiescence in the blacklist table
Value, the limitation duration of client can be changed by the limitation duration value for changing blacklist table.
Beneficial effects of the present invention:The method of the anti-crawl of web site contents proposed by the present invention, by verifying the strict of flow
Formulate, set about taking precautions against from the request of checking client, effectively prevent website data and be crawled, meanwhile, in addition to flow is verified
It is additionally provided with timing automatic update mechanism, it is ensured that blacklist table and the immediate updating of customer status table data, more effective, stable
Maintain the operation of whole flow process..
Brief description of the drawings
Fig. 1 is schematic network structure of the invention.
Fig. 2 is checking request flow chart of the invention.
Fig. 3 is timing automatic update mechanism flow chart of the invention.
Embodiment
Below with reference to drawings and examples, the present invention is described in detail.
As shown in figure 1, describing the schematic network structure of the present invention, i.e., including WEB server end, prevent grabbing system service
Device and client, a kind of method of the anti-crawl of web site contents, comprise the following steps:
1. initially set up the rule for judging crawl behavior;
2.WEB server ends obtain client-side information, are passed to after acquisition and prevent grabbing system server;
Grab the information that system server transmit according to WEB server end 3. anti-and verified, checking is identified into result return
WEB server end is given, and WEB server end decides whether to perform data query or the output of requests for page according to the result
The prompting of denied access.
The regular number of times at request server end within the client ip unit interval in the step (1), request are accessed
Path is constituted.
Client-side information in the step (2) includes the time of IP address, request URL address and current request.
Prevent grabbing system server setting unit interval and request upper limit number in the step (3), beyond the time and number of times
Then it is judged to belong to crawl behavior.
The limitation duration that setting pipes off in the step (3), within the time, the client, which all belongs to, to be rejected.
The step (3) further comprises blacklist table and customer status table, is stored in server memory.
Time and limitation duration when blacklist table stores client ip address, is put on the blacklist.
The client-side information of customer status table storage cell time all requests, including client ip address, ask first
Time and request total degree.
In the step (3), timing automatic update mechanism is additionally provided with, in the given time to blacklist table and customer status
The data of table are updated.
During the timing updates, the record that blacklist table all clients are taken out first is circulated, according to every
The limitation duration of notes record judged, judge time when being put on the blacklist at that time and current time interval whether more than etc.
In limitation duration, if it is, this client records are removed from blacklist table;If it is not, then not dealing with.
During the timing updates, customer status table data are taken out first and are circulated, request time first is judged
Whether it is more than or equal to the unit interval with current time interval, if it is, the client is removed from customer status table;If
It is no, then do not deal with.
The limitation duration value of all clients is all the limitation duration value of the global setting of acquiescence in the blacklist table, is passed through
The limitation duration value of modification blacklist table can change the limitation duration of client.
As shown in Fig. 2 the checking flow of the present invention is described, the programming language used according to website itself platform, such as
ASP.NET (C#), PHP, ASP etc. carry out writing WEB server end web program, and program realizes following 2 functions, and one is to obtain
The information of client, includes passing to after the time of IP address, request URL address and current request, acquisition and prevents grabbing system clothes
Business device.Two be it is anti-grab the information that system server transmit according to WEB server end and verified, will finally verify that identifying result returns
Back to WEB server end, and WEB server decides whether that the data query for performing requests for page is still exported according to the result
The prompting of denied access.
Anti- system server of grabbing mainly has two global settings, and one is setting " unit interval " and permission " the request upper limit
Number ", the setting refer to how long interior allow to access multiple, be then judged to belong to crawl beyond the number of times, such as 5 points of setting
Only allow 1000 requests in clock, specific setting value is estimated according to each website real data;Two be that setting pipes off
" limitation duration ", within the time, the client all belongs to what is be rejected, and how long concrete restriction must be entered according to actual conditions
Row setting.In addition to two setting values, prevent that grabbing system server also creates two tables of data, blacklist table and customer status table,
Two data lists can be stored in database table, be stored in server memory, it is considered to effectiveness of performance problem, the present invention
Way is that two tables of data are stored in server memory.Blacklist table mainly stores client ip address, is put on the blacklist
When time and " limitation duration ";Customer status table is storage proximal segment time content (unit interval) all requested clients
Client information, including client ip address, time for asking first and the total degree of request.
Prevent that grabbing system server receives the instruction that WEB server end is transmitted, according to the IP address of client, with blacklist table
Record matched, once in the presence of, then show to belong to refusal, then return one refusal mark return WEB server end,
WEB server end, which is received, refuses not performing business datum inquiry then and directly export prompt message to return to client, such as
Only this signal language received of " you have been put on the blacklist, and please visit again later ", now client, what is do not obtained has
Imitate information data.If being not matched to the client in blacklist, equally with the client ip address and customer status table
Matched, if it does not exist, then the IP of the client, current time are added into customer status table, and the client is existed
The request number of times of customer status table is set to 1, finally returns to WEB server end with the mark being verified, WEB server receives anti-
Grab the mark that system server is verified and then continue executing with the business datum of requests for page and inquire about and be back to client;If
The client is matched in customer status table, then shows that the client is existing requested within the unit interval, now by the visitor
Family end Jia 1 in the request number of times of customer status table, then whether the number of times after judging client request through Jia 1 is more than above-mentioned institute
" the request upper limit number " for " unit interval " said, if it is greater, then the client is piped off and from customer status table
Remove, when piping off, be mainly stored in the client ip, present system time, also have " during limitation of setting described above
It is long " value, finally to verify that the mark not passed through returns to WEB server end, WEB server then exports signal language and is back to client
End.If client request sum through Jia 1 after number of times be not have " the request upper limit number " for being more than " unit interval ", it is direct with
The instruction being verified returns to WEB server, and WEB server then performs the data query of requests for page and is back to client
End.
As shown in figure 3, describing the flow of timing automatic update mechanism of the invention.In addition to flow is verified, blacklist table number
According to and customer status table data be all ageing, so the mechanism of timing automatic renewal must be set up, it is preferred that the present invention is set
The time is automatically updated to be per minute, it is per minute that all the two table data are updated, all clients of blacklist table are taken out first
The record at end is circulated, and is judged according to " the limitation duration " of every notes record, judgement when being put on the blacklist at that time when
Between with current time interval whether be more than or equal to " limitation duration ", if it is, by this client records from blacklist table
Remove, otherwise do not deal with.The same customer status table data per minute that automatically take out are circulated, judge request time first and
Whether current time interval is more than or equal to " unit interval ", and if it is the client is removed from customer status table, is not then
Do not deal with.
In blacklist table, the limitation duration value of all clients is all global setting " limitation duration " value of acquiescence, such as
The time that must be limited some clients is longer, need only change " limitation duration " value of the client in blacklist table.
Herein referred blacklist is exactly the meaning of blacklist table, technically for, pipe off, be exactly by client
Data deposit blacklist table;Blacklist table is a kind of object of technical program, and it is a business action to pipe off.
Embodiment above is only that the preferred embodiment of the present invention is described, and not the scope of the present invention is entered
Row is limited, on the premise of design spirit of the present invention is not departed from, technical side of this area ordinary skill technical staff to the present invention
In various modifications and improvement that case is made, the protection domain that claims of the present invention determination all should be fallen into.
Part that the present invention does not relate to is same as the prior art or can be realized using prior art.
Claims (7)
1. a kind of method of the anti-crawl of web site contents, comprises the following steps:
(1) rule for judging crawl behavior is initially set up;
(2) WEB server end obtains client-side information, is passed to after acquisition and prevents grabbing system server;
(3) prevent that grabbing the information that system server transmit according to WEB server end is verified, will verify that identifying result returns to
WEB server end, and WEB server end decides whether that performing the data query of requests for page or output refuses according to the result
The prompting accessed absolutely;
The step (3) includes blacklist table and customer status table, is stored in server memory;
Time and limitation duration when blacklist table stores client ip address, is put on the blacklist;
The client-side information of customer status table storage cell time all requests, including client ip address, ask first when
Between and request total degree;
Timing automatic update mechanism is additionally provided with the step (3), in the given time to blacklist table and customer status table
Data are updated;During the timing updates, the record that blacklist table all clients are taken out first is circulated, root
Limitation duration according to every notes record is judged whether the time and current time interval when judging to be put on the blacklist at that time are big
In equal to limitation duration, if it is, this client records are removed from blacklist table;If it is not, then not dealing with.
2. the method for the anti-crawl of web site contents as claimed in claim 1, it is characterised in that:In the step (1) rule by
The path composition that the number of times at request server end, request are accessed in the client ip unit interval.
3. the method for the anti-crawl of web site contents as claimed in claim 1, it is characterised in that:Client in the step (2)
Information includes the time of IP address, request URL address and current request.
4. the method for the anti-crawl of web site contents as claimed in claim 1, it is characterised in that:Prevent grabbing system in the step (3)
Server settings unit interval and request upper limit number, are then judged to belong to crawl behavior beyond the time and number of times.
5. the method for the anti-crawl of web site contents as claimed in claim 1, it is characterised in that:Setting is included in the step (3)
The limitation duration of blacklist, within the time, the client, which all belongs to, to be rejected.
6. the method for the anti-crawl of web site contents as claimed in claim 1, it is characterised in that:During the timing updates,
Customer status table data are taken out first to be circulated, and judge whether request time is more than or equal to unit with current time interval first
Time, if it is, the client is removed from customer status table;If it is not, then not dealing with.
7. the method for the anti-crawl of web site contents as claimed in claim 1, it is characterised in that:All clients in the blacklist table
The limitation duration value at end is all the limitation duration value of the global setting of acquiescence, can be repaiied by the limitation duration value for changing blacklist table
Change the limitation duration of client.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201110222891.1A CN102916935B (en) | 2011-08-04 | 2011-08-04 | A kind of method of the anti-crawl of web site contents |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201110222891.1A CN102916935B (en) | 2011-08-04 | 2011-08-04 | A kind of method of the anti-crawl of web site contents |
Publications (2)
Publication Number | Publication Date |
---|---|
CN102916935A CN102916935A (en) | 2013-02-06 |
CN102916935B true CN102916935B (en) | 2017-08-25 |
Family
ID=47615169
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201110222891.1A Active CN102916935B (en) | 2011-08-04 | 2011-08-04 | A kind of method of the anti-crawl of web site contents |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN102916935B (en) |
Families Citing this family (13)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN104217136B (en) * | 2013-06-05 | 2017-05-03 | 北京齐尔布莱特科技有限公司 | Method and system for preventing web page text message from being captured automatically |
CN104506525B (en) * | 2014-12-22 | 2018-04-20 | 北京奇安信科技有限公司 | Prevent the method and protective device that malice captures |
CN107291742A (en) * | 2016-03-31 | 2017-10-24 | 北京小度信息科技有限公司 | The anti-grasping means of data and device |
CN105827619B (en) * | 2016-04-25 | 2019-02-15 | 无锡中科富农物联科技有限公司 | Crawler in the case of height access closes method |
CN109150928A (en) * | 2017-06-15 | 2019-01-04 | 北京京东尚科信息技术有限公司 | Method and apparatus for handling request |
CN107196811A (en) * | 2017-07-13 | 2017-09-22 | 上海幻电信息科技有限公司 | Video website door chain control system and method |
CN107483563A (en) * | 2017-07-31 | 2017-12-15 | 九次方大数据信息集团有限公司 | The data query method and apparatus and client and server of anti-reptile |
CN109150819B (en) * | 2018-01-15 | 2019-06-11 | 北京数安鑫云信息技术有限公司 | A kind of attack recognition method and its identifying system |
CN110365628B (en) * | 2018-04-11 | 2020-12-04 | 滴图(北京)科技有限公司 | Data request processing method and device |
CN110020940A (en) * | 2019-04-02 | 2019-07-16 | 中电科大数据研究院有限公司 | Processing method, device, equipment and the storage medium of credit list |
CN110442770B (en) * | 2019-08-08 | 2023-06-20 | 深圳市今天国际物流技术股份有限公司 | Data grabbing and storing method and device, computer equipment and storage medium |
CN111553776B (en) * | 2020-04-26 | 2023-08-08 | 成都新致云服信息技术有限公司 | Data processing method and device and electronic equipment |
CN111917787B (en) * | 2020-08-06 | 2023-07-21 | 北京奇艺世纪科技有限公司 | Request detection method, request detection device, electronic equipment and computer readable storage medium |
Citations (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN102045319A (en) * | 2009-10-21 | 2011-05-04 | 中国移动通信集团山东有限公司 | Method and device for detecting SQL (Structured Query Language) injection attack |
Family Cites Families (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN101902438B (en) * | 2009-05-25 | 2013-05-15 | 北京启明星辰信息技术股份有限公司 | Method and device for automatically identifying web crawlers |
CN102088477A (en) * | 2010-11-25 | 2011-06-08 | 互动在线(北京)科技有限公司 | Website content anti-acquisition system and method thereof |
-
2011
- 2011-08-04 CN CN201110222891.1A patent/CN102916935B/en active Active
Patent Citations (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN102045319A (en) * | 2009-10-21 | 2011-05-04 | 中国移动通信集团山东有限公司 | Method and device for detecting SQL (Structured Query Language) injection attack |
Also Published As
Publication number | Publication date |
---|---|
CN102916935A (en) | 2013-02-06 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN102916935B (en) | A kind of method of the anti-crawl of web site contents | |
CN104065644B (en) | CC attack recognition method and apparatus based on log analysis | |
Sun et al. | Security of online reputation systems: The evolution of attacks and defenses | |
Li et al. | Botnet economics: uncertainty matters | |
US8219549B2 (en) | Forum mining for suspicious link spam sites detection | |
CN103902386B (en) | Multi-thread network crawler processing method based on connection proxy optimal management | |
CN104768139B (en) | A kind of method and device that short message is sent | |
CN109274632A (en) | A kind of recognition methods of website and device | |
CN108712426A (en) | Reptile recognition methods and system a little are buried based on user behavior | |
Lauinger et al. | Game of Registrars: An Empirical Analysis of {Post-Expiration} Domain Name Takeovers | |
CN106230835B (en) | Method based on Nginx log analysis and the IPTABLES anti-malicious access forwarded | |
CN107517200B (en) | Malicious crawler defense strategy selection method for Web server | |
CN103067387B (en) | A kind of anti-phishing monitoring system and method | |
CN106951784B (en) | XSS vulnerability detection-oriented Web application reverse analysis method | |
CN104125238A (en) | DoS (Denial of Service) and DDoS (Distributed Denial of service) attack resisting method of DNS recursive server | |
CN104202344B (en) | A kind of method and device for the anti-ddos attack of DNS service | |
Avarikioti et al. | Structure and content of the visible Darknet | |
CN110365810A (en) | Domain name caching method, device, equipment and storage medium based on web crawlers | |
CN107800689A (en) | A kind of Website Usability ensures processing method and processing device | |
CN111049837A (en) | Malicious website identification and interception technology based on communication operator network transport layer | |
CN108319634A (en) | The directory access method and apparatus of distributed file system | |
CN102413201B (en) | Processing method and equipment for domain name system (DNS) query request | |
CN105827619B (en) | Crawler in the case of height access closes method | |
CN109981533B (en) | DDoS attack detection method, device, electronic equipment and storage medium | |
CN106557590A (en) | A kind of intelligent Answer System |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
C06 | Publication | ||
PB01 | Publication | ||
C10 | Entry into substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |