CN104902008A - Crawler data processing method - Google Patents
Crawler data processing method Download PDFInfo
- Publication number
- CN104902008A CN104902008A CN201510200123.4A CN201510200123A CN104902008A CN 104902008 A CN104902008 A CN 104902008A CN 201510200123 A CN201510200123 A CN 201510200123A CN 104902008 A CN104902008 A CN 104902008A
- Authority
- CN
- China
- Prior art keywords
- page
- blacklist
- reptile
- identifying code
- white list
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
Classifications
-
- H—ELECTRICITY
- H04—ELECTRIC COMMUNICATION TECHNIQUE
- H04L—TRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
- H04L67/00—Network arrangements or protocols for supporting network services or applications
- H04L67/01—Protocols
- H04L67/02—Protocols based on web technology, e.g. hypertext transfer protocol [HTTP]
-
- H—ELECTRICITY
- H04—ELECTRIC COMMUNICATION TECHNIQUE
- H04L—TRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
- H04L63/00—Network architectures or network communication protocols for network security
- H04L63/08—Network architectures or network communication protocols for network security for authentication of entities
- H04L63/0815—Network architectures or network communication protocols for network security for authentication of entities providing single-sign-on or federations
-
- H—ELECTRICITY
- H04—ELECTRIC COMMUNICATION TECHNIQUE
- H04L—TRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
- H04L63/00—Network architectures or network communication protocols for network security
- H04L63/14—Network architectures or network communication protocols for network security for detecting or protecting against malicious traffic
- H04L63/1441—Countermeasures against malicious traffic
- H04L63/145—Countermeasures against malicious traffic the attack involving the propagation of malware through the network, e.g. viruses, trojans or worms
-
- H—ELECTRICITY
- H04—ELECTRIC COMMUNICATION TECHNIQUE
- H04L—TRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
- H04L63/00—Network architectures or network communication protocols for network security
- H04L63/14—Network architectures or network communication protocols for network security for detecting or protecting against malicious traffic
- H04L63/1441—Countermeasures against malicious traffic
- H04L63/1458—Denial of Service
-
- H—ELECTRICITY
- H04—ELECTRIC COMMUNICATION TECHNIQUE
- H04L—TRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
- H04L67/00—Network arrangements or protocols for supporting network services or applications
- H04L67/01—Protocols
- H04L67/10—Protocols in which an application is distributed across nodes in the network
- H04L67/1095—Replication or mirroring of data, e.g. scheduling or transport for data synchronisation between network nodes
Abstract
The invention discloses a crawler data processing method. The method comprises the steps that 1) login is managed, a white list is input in the background by utilizing a black and white list pool, and the effective operation time of a black list is set as X min; 2) whether the IP of a present login page belongs to the black list or the white list is determined; 3) when the IP of the present login page belongs to the white list, the IP is allowed to carry out normal page logic operation; 4) when the IP of the present login page belongs to the black list, the IP enters identifying code page operation, after an identifying code is written by the IP, the IP is released from the black list, and the IP is allowed to carry out normal page logic operation; and 5) when the e IP of the present login page belongs to neither the black list nor the white list, a counter pool is entered, a Cache counter carries out a plus 1 operation before the Cache counter exceeds the time limit, the accumulated counting value is compared with a threshold, when the counting value is lower than the threshold, the IP is determined not to be a crawler, and when the counting value is greater than the threshold, the IP is determined to be a crawler.
Description
Technical field
The present invention relates to large technical field of data processing, specifically a kind of data processing method for reptile.
Background technology
Web crawlers is the important component part that grasping system held up in Dissatisfied index.The main purpose of reptile is to the local mirror back-up forming or networking content by the page download on the Internet.Find by analyzing apache daily record, the bandwidth sum server resource of certain system 40% all consumes on reptile, if the reptile of removing 10%-15% search engine, carrying out anti-reptile strategy, can save the resource of 20%-25%, is that break-in optimizes web system in fact.
Reptile request is the mechanism of similar httpClient or the order of curl, wget, and normal users request generally walks browser.Reptile request generally can not perform the asynchronous JavaScript operation in the page, and user's request then performs the asynchronous JavaScript operation that Jquery provides.
Summary of the invention
The object of the present invention is to provide a kind of data processing method for reptile, effectively to distinguish reptile request and normal users request, thus reptile request can be stopped, save system resource.
Object of the present invention is achieved through the following technical solutions: a kind of data processing method for reptile, comprises the following steps:
Step 1: administrative login also utilizes black and white lists pond at backstage typing white list, and is set as X minute the operating time of blacklist effectively;
Step 2: judge that the IP of current login page is blacklist or white list;
Step 3: when the IP of current login page is white list, allows this IP carry out normal page logical operation;
Step 4: when the IP of current login page is blacklist, allows this IP enter identifying code page operation, after this IP writes identifying code, discharges this IP and is being not blacklist, allow this IP carry out normal page logical operation simultaneously;
Step 5: the IP of current login page be non-blacklist and non-white list time, traffic statistics instrument is utilized to carry out examination, enter counter pond, 1 operation can be added before Cache counter is expired at Cache counter, then the count value after relatively adding up and threshold values, when count value is less than threshold values, judge that this IP is as non-reptile, this IP is allowed to carry out normal page logical operation, when count value is greater than threshold values, this IP is arranged to blacklist, and allow this IP enter identifying code page operation, after this IP writes identifying code, discharge this IP and be not blacklist, if this IP does not fill in identifying code, then judge that this IP is as reptile,
Step 6: when this IP logs off the page, asynchronous JS request, cache count value subtracts 1 operation.
The setting principle of said method is: because reptile request is the mechanism of similar httpClient or the order of curl, wget, and normal users request generally walks browser.Reptile request generally can not perform the asynchronous JavaScript operation in the page, and user's request then performs the asynchronous JavaScript operation that Jquery provides.
Therefore, the present invention utilizes asynchronous JavaScript to operate, thus filters out reptile request, and blacklist is arranged in reptile request.Simultaneously due to the change of network IP, sometimes same IP the previous day is blacklist, it within second day, is then normal users logging request, therefore all IP can not be set to unalterable fixing and extremely ask IP, therefore need in real time along current blacklist, utilize asynchronous JavaScript to operate differentiation current blacklist simultaneously and whether continue as reptile request.
WEB system all walks http agreement with WEB reservoir, and each request produces a client to I haven't seen you for ages and is connected with the tcp of server.By netstat order, the IP while of just can viewing current corresponding to connection server and connection amount.Order/bin/netstat-nat-n | the general all hundreds ofs of grep 80 or several thousand.When the linking number that same IP is corresponding exceedes the threshold values that we observe, just can be judged as that improper user asks.Threshold value setting is most important, and large-scale Internet bar or same school, company IP out also may be mistaken for illegal request.This strategy comprises two timing scripts, and IP (tcpForbidCmd.sh) is sealed in a timing, a time controlled released IP (tcpReleaseCmd.sh), is respectively respectively to perform once every 5 minutes and 40 minutes.This strategy has been equivalent to threshold to our default, and in similar highway traffic system, certain road setting limit for height 4 meters of railings, can not pass through at this higher than the cars of 4 meters.This strategy energy preventing malicious or the newly hand-written irregular reptile of request frequency.
Anti-reptile strobe utility of the present invention implements also fairly simple, we can do access counter with memcached or local internal memory, in time period before buffer memory is expired (as 3 minutes), once, counter adds 1 in each IP access, the KEY of buffer memory comprises IP, by the value that counter obtains, judge more than a threshold values, this IP probably has problem, so just can return an identifying code page, require that user fills in identifying code.If reptile, certainly can not fill in identifying code, just refused, protect the resource of rear end.The setting of threshold values is also that very important, different system is different.This strobe utility improves by we, will be more accurate.Namely we add the Asynchronous Request of a JS in the bottom of webpage, and this Asynchronous Request is used for the value of down counter, carries out value added, depreciation during page-out, generate a difference during page-in to IP.According to the analysis before us, reptile can not perform the request of asynchronous JS depreciation.Can judge whether this IP is reptile from the size of the value generated like this.
Within X minute, be set to 30 minutes.
Setting counter pond is expired after Y minute, within Y minute, is arranged to 3 minutes.
The invention has the advantages that: cost is low, effectively to distinguish reptile request and normal users request, thus reptile request can be stopped, save system resource, avoid the request of erroneous judgement normal users simultaneously.
Accompanying drawing explanation
Fig. 1 is schematic flow sheet of the present invention.
Embodiment
Below in conjunction with embodiment and accompanying drawing, the present invention is described in further detail, but embodiments of the present invention are not limited thereto.
Embodiment 1:
As shown in Figure 1.
For a data processing method for reptile, comprise the following steps:
Step 1: administrative login also utilizes black and white lists pond at backstage typing white list, and is set as X minute the operating time of blacklist effectively;
Step 2: judge that the IP of current login page is blacklist or white list;
Step 3: when the IP of current login page is white list, allows this IP carry out normal page logical operation;
Step 4: when the IP of current login page is blacklist, allows this IP enter identifying code page operation, after this IP writes identifying code, discharges this IP and is being not blacklist, allow this IP carry out normal page logical operation simultaneously;
Step 5: the IP of current login page be non-blacklist and non-white list time, traffic statistics instrument is utilized to carry out examination, enter counter pond, 1 operation can be added before Cache counter is expired at Cache counter, then the count value after relatively adding up and threshold values, when count value is less than threshold values, judge that this IP is as non-reptile, this IP is allowed to carry out normal page logical operation, when count value is greater than threshold values, this IP is arranged to blacklist, and allow this IP enter identifying code page operation, after this IP writes identifying code, discharge this IP and be not blacklist, if this IP does not fill in identifying code, then judge that this IP is as reptile,
Step 6: when this IP logs off the page, asynchronous JS request, cache count value subtracts 1 operation.
The setting principle of said method is: because reptile request is the mechanism of similar httpClient or the order of curl, wget, and normal users request generally walks browser.Reptile request generally can not perform the asynchronous JavaScript operation in the page, and user's request then performs the asynchronous JavaScript operation that Jquery provides.
Therefore, the present invention utilizes asynchronous JavaScript to operate, thus filters out reptile request, and blacklist is arranged in reptile request.Simultaneously due to the change of network IP, sometimes same IP the previous day is blacklist, it within second day, is then normal users logging request, therefore all IP can not be set to unalterable fixing and extremely ask IP, therefore need in real time along current blacklist, utilize asynchronous JavaScript to operate differentiation current blacklist simultaneously and whether continue as reptile request.
WEB system all walks http agreement with WEB reservoir, and each request produces a client to I haven't seen you for ages and is connected with the tcp of server.By netstat order, the IP while of just can viewing current corresponding to connection server and connection amount.Order/bin/netstat-nat-n | the general all hundreds ofs of grep 80 or several thousand.When the linking number that same IP is corresponding exceedes the threshold values that we observe, just can be judged as that improper user asks.Threshold value setting is most important, and large-scale Internet bar or same school, company IP out also may be mistaken for illegal request.This strategy comprises two timing scripts, and IP (tcpForbidCmd.sh) is sealed in a timing, a time controlled released IP (tcpReleaseCmd.sh), is respectively respectively to perform once every 5 minutes and 40 minutes.This strategy has been equivalent to threshold to our default, and in similar highway traffic system, certain road setting limit for height 4 meters of railings, can not pass through at this higher than the cars of 4 meters.This strategy energy preventing malicious or the newly hand-written irregular reptile of request frequency.
Anti-reptile strobe utility of the present invention implements also fairly simple, we can do access counter with memcached or local internal memory, in time period before buffer memory is expired (as 3 minutes), once, counter adds 1 in each IP access, the KEY of buffer memory comprises IP, by the value that counter obtains, judge more than a threshold values, this IP probably has problem, so just can return an identifying code page, require that user fills in identifying code.If reptile, certainly can not fill in identifying code, just refused, protect the resource of rear end.The setting of threshold values is also that very important, different system is different.This strobe utility improves by we, will be more accurate.Namely we add the Asynchronous Request of a JS in the bottom of webpage, and this Asynchronous Request is used for the value of down counter, carries out value added, depreciation during page-out, generate a difference during page-in to IP.According to the analysis before us, reptile can not perform the request of asynchronous JS depreciation.Can judge whether this IP is reptile from the size of the value generated like this.
Within X minute, be set to 30 minutes.
Setting counter pond is expired after Y minute, within Y minute, is arranged to 3 minutes.
As mentioned above, then well the present invention can be realized.
Claims (3)
1. for a data processing method for reptile, it is characterized in that: comprise the following steps:
Step 1: administrative login also utilizes black and white lists pond at backstage typing white list, and is set as X minute the operating time of blacklist effectively;
Step 2: judge that the IP of current login page is blacklist or white list;
Step 3: when the IP of current login page is white list, allows this IP carry out normal page logical operation;
Step 4: when the IP of current login page is blacklist, allows this IP enter identifying code page operation, after this IP writes identifying code, discharges this IP and is being not blacklist, allow this IP carry out normal page logical operation simultaneously;
Step 5: the IP of current login page be non-blacklist and non-white list time, traffic statistics instrument is utilized to carry out examination, enter counter pond, 1 operation can be added before Cache counter is expired at Cache counter, then the count value after relatively adding up and threshold values, when count value is less than threshold values, judge that this IP is as non-reptile, this IP is allowed to carry out normal page logical operation, when count value is greater than threshold values, this IP is arranged to blacklist, and allow this IP enter identifying code page operation, after this IP writes identifying code, discharge this IP and be not blacklist, if this IP does not fill in identifying code, then judge that this IP is as reptile,
Step 6: when this IP logs off the page, asynchronous JS request, cache count value subtracts 1 operation.
2. a kind of data processing method for reptile according to claim 1, is characterized in that: within X minute, be set to 30 minutes.
3. a kind of data processing method for reptile according to claim 1, is characterized in that: setting counter pond is expired after Y minute, within Y minute, is arranged to 3 minutes.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201510200123.4A CN104902008A (en) | 2015-04-26 | 2015-04-26 | Crawler data processing method |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201510200123.4A CN104902008A (en) | 2015-04-26 | 2015-04-26 | Crawler data processing method |
Publications (1)
Publication Number | Publication Date |
---|---|
CN104902008A true CN104902008A (en) | 2015-09-09 |
Family
ID=54034404
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201510200123.4A Pending CN104902008A (en) | 2015-04-26 | 2015-04-26 | Crawler data processing method |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN104902008A (en) |
Cited By (10)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN105306465A (en) * | 2015-10-30 | 2016-02-03 | 新浪网技术(中国)有限公司 | Website secure access realization method and apparatus |
CN105827619A (en) * | 2016-04-25 | 2016-08-03 | 无锡中科富农物联科技有限公司 | Crawler blocking method under large visitor volume condition |
CN105930727A (en) * | 2016-04-25 | 2016-09-07 | 无锡中科富农物联科技有限公司 | Web-based crawler identification algorithm |
CN106657057A (en) * | 2016-12-20 | 2017-05-10 | 北京金堤科技有限公司 | Anti-crawler system and method |
CN106713241A (en) * | 2015-11-16 | 2017-05-24 | 腾讯科技(深圳)有限公司 | Identity verification method, device and system |
WO2017084337A1 (en) * | 2015-11-16 | 2017-05-26 | 腾讯科技(深圳)有限公司 | Identity verification method, apparatus and system |
CN107634947A (en) * | 2017-09-18 | 2018-01-26 | 北京京东尚科信息技术有限公司 | Limitation malice logs in or the method and apparatus of registration |
CN108777687A (en) * | 2018-06-05 | 2018-11-09 | 掌阅科技股份有限公司 | Reptile hold-up interception method, electronic equipment, storage medium based on user behavior portrait |
CN110401654A (en) * | 2019-07-23 | 2019-11-01 | 广州市百果园信息技术有限公司 | A kind of method, apparatus of business access, system, equipment and storage medium |
CN110581859A (en) * | 2019-09-18 | 2019-12-17 | 成都安恒信息技术有限公司 | Anti-crawling insect method based on page embedded points |
Citations (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN103428196A (en) * | 2012-12-27 | 2013-12-04 | 北京安天电子设备有限公司 | URL white list-based WEB application intrusion detecting method and apparatus |
CN103475637A (en) * | 2013-04-24 | 2013-12-25 | 携程计算机技术(上海)有限公司 | Network access control method and system based on IP access behaviors |
CN104202291A (en) * | 2014-07-11 | 2014-12-10 | 西安电子科技大学 | Anti-phishing method based on multi-factor comprehensive assessment method |
-
2015
- 2015-04-26 CN CN201510200123.4A patent/CN104902008A/en active Pending
Patent Citations (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN103428196A (en) * | 2012-12-27 | 2013-12-04 | 北京安天电子设备有限公司 | URL white list-based WEB application intrusion detecting method and apparatus |
CN103475637A (en) * | 2013-04-24 | 2013-12-25 | 携程计算机技术(上海)有限公司 | Network access control method and system based on IP access behaviors |
CN104202291A (en) * | 2014-07-11 | 2014-12-10 | 西安电子科技大学 | Anti-phishing method based on multi-factor comprehensive assessment method |
Non-Patent Citations (1)
Title |
---|
码迷: ""给网站加入优雅的实时反爬虫策略"", 《码迷》 * |
Cited By (17)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN105306465A (en) * | 2015-10-30 | 2016-02-03 | 新浪网技术(中国)有限公司 | Website secure access realization method and apparatus |
CN105306465B (en) * | 2015-10-30 | 2019-01-18 | 新浪网技术(中国)有限公司 | Web portal security accesses implementation method and device |
US10547624B2 (en) | 2015-11-16 | 2020-01-28 | Tencent Technology (Shenzhen) Company Limited | Identity authentication method, apparatus, and system |
CN106713241A (en) * | 2015-11-16 | 2017-05-24 | 腾讯科技(深圳)有限公司 | Identity verification method, device and system |
WO2017084337A1 (en) * | 2015-11-16 | 2017-05-26 | 腾讯科技(深圳)有限公司 | Identity verification method, apparatus and system |
US11258810B2 (en) | 2015-11-16 | 2022-02-22 | Tencent Technology (Shenzhen) Company Limited | Identity authentication method, apparatus, and system |
CN105827619B (en) * | 2016-04-25 | 2019-02-15 | 无锡中科富农物联科技有限公司 | Crawler in the case of height access closes method |
CN105827619A (en) * | 2016-04-25 | 2016-08-03 | 无锡中科富农物联科技有限公司 | Crawler blocking method under large visitor volume condition |
CN105930727A (en) * | 2016-04-25 | 2016-09-07 | 无锡中科富农物联科技有限公司 | Web-based crawler identification algorithm |
CN105930727B (en) * | 2016-04-25 | 2018-11-09 | 无锡中科富农物联科技有限公司 | Reptile recognition methods based on Web |
CN106657057A (en) * | 2016-12-20 | 2017-05-10 | 北京金堤科技有限公司 | Anti-crawler system and method |
CN106657057B (en) * | 2016-12-20 | 2020-09-29 | 北京金堤科技有限公司 | Anti-crawler system and method |
CN107634947A (en) * | 2017-09-18 | 2018-01-26 | 北京京东尚科信息技术有限公司 | Limitation malice logs in or the method and apparatus of registration |
CN108777687A (en) * | 2018-06-05 | 2018-11-09 | 掌阅科技股份有限公司 | Reptile hold-up interception method, electronic equipment, storage medium based on user behavior portrait |
CN110401654A (en) * | 2019-07-23 | 2019-11-01 | 广州市百果园信息技术有限公司 | A kind of method, apparatus of business access, system, equipment and storage medium |
CN110581859A (en) * | 2019-09-18 | 2019-12-17 | 成都安恒信息技术有限公司 | Anti-crawling insect method based on page embedded points |
CN110581859B (en) * | 2019-09-18 | 2021-11-26 | 成都安恒信息技术有限公司 | Anti-crawling insect method based on page embedded points |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN104902008A (en) | Crawler data processing method | |
CN109831465B (en) | Website intrusion detection method based on big data log analysis | |
CN103297435B (en) | A kind of abnormal access behavioral value method and system based on WEB daily record | |
CN105117484A (en) | Internet public opinion monitoring method and system | |
CN106101104A (en) | A kind of malice domain name detection method based on domain name mapping and system | |
US10079770B2 (en) | Junk information filtering method and apparatus | |
CN104994117A (en) | Malicious domain name detection method and system based on DNS (Domain Name Server) resolution data | |
CN102077201A (en) | System and method for dynamic and real-time categorization of webpages | |
CN106484709A (en) | A kind of auditing method of daily record data and audit device | |
CN103888490A (en) | Automatic WEB client man-machine identification method | |
CN102541884B (en) | Method and device for database optimization | |
CN105224691B (en) | A kind of information processing method and device | |
CN104184601B (en) | The acquisition methods and device of user's online hours | |
CN106254137A (en) | The alarm root-cause analysis system and method for supervisory systems | |
CN110519263B (en) | Anti-swipe method, device, apparatus, and computer-readable storage medium | |
CN103118035A (en) | Website access request parameter legal range analysis method and device | |
CN106649031A (en) | Monitoring data obtaining method and device, and computer | |
CN103001972A (en) | Identification method and identification device and firewall for DDOS (distributed denial of service) attack | |
CN111428108A (en) | Anti-crawler method, device and medium based on deep learning | |
Shin et al. | A grand spread estimator using a graphics processing unit | |
CN111049837A (en) | Malicious website identification and interception technology based on communication operator network transport layer | |
CN106067879A (en) | The detection method of information and device | |
CN104811418B (en) | The method and device of viral diagnosis | |
CN102945254A (en) | Method for detecting abnormal data among TB-level mass audit data | |
CN108494635A (en) | A kind of network flow detection system based on cloud computing |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
C06 | Publication | ||
PB01 | Publication | ||
C10 | Entry into substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
TA01 | Transfer of patent application right | ||
TA01 | Transfer of patent application right |
Effective date of registration: 20170621 Address after: 610000 Chengdu high tech Zone, Sichuan Tianyi street, No. 3, building 38 Applicant after: Chengdu Chuan Hang Information technology company limited Applicant after: Suzhou Chong Xing Mdt InfoTech Ltd Address before: 610000 Chengdu high tech Zone, Sichuan Tianyi street, No. 3, building 38 Applicant before: Chengdu Chuan Hang Information technology company limited |
|
RJ01 | Rejection of invention patent application after publication | ||
RJ01 | Rejection of invention patent application after publication |
Application publication date: 20150909 |