CN104902008A - Crawler data processing method - Google Patents

Crawler data processing method Download PDF

Info

Publication number
CN104902008A
CN104902008A CN201510200123.4A CN201510200123A CN104902008A CN 104902008 A CN104902008 A CN 104902008A CN 201510200123 A CN201510200123 A CN 201510200123A CN 104902008 A CN104902008 A CN 104902008A
Authority
CN
China
Prior art keywords
page
blacklist
reptile
identifying code
white list
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN201510200123.4A
Other languages
Chinese (zh)
Inventor
严澜
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Suzhou Chong Xing Mdt InfoTech Ltd
Original Assignee
Chengdu Chuan Hang Information Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Chengdu Chuan Hang Information Technology Co Ltd filed Critical Chengdu Chuan Hang Information Technology Co Ltd
Priority to CN201510200123.4A priority Critical patent/CN104902008A/en
Publication of CN104902008A publication Critical patent/CN104902008A/en
Pending legal-status Critical Current

Links

Classifications

    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L67/00Network arrangements or protocols for supporting network services or applications
    • H04L67/01Protocols
    • H04L67/02Protocols based on web technology, e.g. hypertext transfer protocol [HTTP]
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L63/00Network architectures or network communication protocols for network security
    • H04L63/08Network architectures or network communication protocols for network security for authentication of entities
    • H04L63/0815Network architectures or network communication protocols for network security for authentication of entities providing single-sign-on or federations
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L63/00Network architectures or network communication protocols for network security
    • H04L63/14Network architectures or network communication protocols for network security for detecting or protecting against malicious traffic
    • H04L63/1441Countermeasures against malicious traffic
    • H04L63/145Countermeasures against malicious traffic the attack involving the propagation of malware through the network, e.g. viruses, trojans or worms
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L63/00Network architectures or network communication protocols for network security
    • H04L63/14Network architectures or network communication protocols for network security for detecting or protecting against malicious traffic
    • H04L63/1441Countermeasures against malicious traffic
    • H04L63/1458Denial of Service
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L67/00Network arrangements or protocols for supporting network services or applications
    • H04L67/01Protocols
    • H04L67/10Protocols in which an application is distributed across nodes in the network
    • H04L67/1095Replication or mirroring of data, e.g. scheduling or transport for data synchronisation between network nodes

Abstract

The invention discloses a crawler data processing method. The method comprises the steps that 1) login is managed, a white list is input in the background by utilizing a black and white list pool, and the effective operation time of a black list is set as X min; 2) whether the IP of a present login page belongs to the black list or the white list is determined; 3) when the IP of the present login page belongs to the white list, the IP is allowed to carry out normal page logic operation; 4) when the IP of the present login page belongs to the black list, the IP enters identifying code page operation, after an identifying code is written by the IP, the IP is released from the black list, and the IP is allowed to carry out normal page logic operation; and 5) when the e IP of the present login page belongs to neither the black list nor the white list, a counter pool is entered, a Cache counter carries out a plus 1 operation before the Cache counter exceeds the time limit, the accumulated counting value is compared with a threshold, when the counting value is lower than the threshold, the IP is determined not to be a crawler, and when the counting value is greater than the threshold, the IP is determined to be a crawler.

Description

A kind of data processing method for reptile
Technical field
The present invention relates to large technical field of data processing, specifically a kind of data processing method for reptile.
Background technology
Web crawlers is the important component part that grasping system held up in Dissatisfied index.The main purpose of reptile is to the local mirror back-up forming or networking content by the page download on the Internet.Find by analyzing apache daily record, the bandwidth sum server resource of certain system 40% all consumes on reptile, if the reptile of removing 10%-15% search engine, carrying out anti-reptile strategy, can save the resource of 20%-25%, is that break-in optimizes web system in fact.
Reptile request is the mechanism of similar httpClient or the order of curl, wget, and normal users request generally walks browser.Reptile request generally can not perform the asynchronous JavaScript operation in the page, and user's request then performs the asynchronous JavaScript operation that Jquery provides.
Summary of the invention
The object of the present invention is to provide a kind of data processing method for reptile, effectively to distinguish reptile request and normal users request, thus reptile request can be stopped, save system resource.
Object of the present invention is achieved through the following technical solutions: a kind of data processing method for reptile, comprises the following steps:
Step 1: administrative login also utilizes black and white lists pond at backstage typing white list, and is set as X minute the operating time of blacklist effectively;
Step 2: judge that the IP of current login page is blacklist or white list;
Step 3: when the IP of current login page is white list, allows this IP carry out normal page logical operation;
Step 4: when the IP of current login page is blacklist, allows this IP enter identifying code page operation, after this IP writes identifying code, discharges this IP and is being not blacklist, allow this IP carry out normal page logical operation simultaneously;
Step 5: the IP of current login page be non-blacklist and non-white list time, traffic statistics instrument is utilized to carry out examination, enter counter pond, 1 operation can be added before Cache counter is expired at Cache counter, then the count value after relatively adding up and threshold values, when count value is less than threshold values, judge that this IP is as non-reptile, this IP is allowed to carry out normal page logical operation, when count value is greater than threshold values, this IP is arranged to blacklist, and allow this IP enter identifying code page operation, after this IP writes identifying code, discharge this IP and be not blacklist, if this IP does not fill in identifying code, then judge that this IP is as reptile,
Step 6: when this IP logs off the page, asynchronous JS request, cache count value subtracts 1 operation.
The setting principle of said method is: because reptile request is the mechanism of similar httpClient or the order of curl, wget, and normal users request generally walks browser.Reptile request generally can not perform the asynchronous JavaScript operation in the page, and user's request then performs the asynchronous JavaScript operation that Jquery provides.
Therefore, the present invention utilizes asynchronous JavaScript to operate, thus filters out reptile request, and blacklist is arranged in reptile request.Simultaneously due to the change of network IP, sometimes same IP the previous day is blacklist, it within second day, is then normal users logging request, therefore all IP can not be set to unalterable fixing and extremely ask IP, therefore need in real time along current blacklist, utilize asynchronous JavaScript to operate differentiation current blacklist simultaneously and whether continue as reptile request.
WEB system all walks http agreement with WEB reservoir, and each request produces a client to I haven't seen you for ages and is connected with the tcp of server.By netstat order, the IP while of just can viewing current corresponding to connection server and connection amount.Order/bin/netstat-nat-n | the general all hundreds ofs of grep 80 or several thousand.When the linking number that same IP is corresponding exceedes the threshold values that we observe, just can be judged as that improper user asks.Threshold value setting is most important, and large-scale Internet bar or same school, company IP out also may be mistaken for illegal request.This strategy comprises two timing scripts, and IP (tcpForbidCmd.sh) is sealed in a timing, a time controlled released IP (tcpReleaseCmd.sh), is respectively respectively to perform once every 5 minutes and 40 minutes.This strategy has been equivalent to threshold to our default, and in similar highway traffic system, certain road setting limit for height 4 meters of railings, can not pass through at this higher than the cars of 4 meters.This strategy energy preventing malicious or the newly hand-written irregular reptile of request frequency.
Anti-reptile strobe utility of the present invention implements also fairly simple, we can do access counter with memcached or local internal memory, in time period before buffer memory is expired (as 3 minutes), once, counter adds 1 in each IP access, the KEY of buffer memory comprises IP, by the value that counter obtains, judge more than a threshold values, this IP probably has problem, so just can return an identifying code page, require that user fills in identifying code.If reptile, certainly can not fill in identifying code, just refused, protect the resource of rear end.The setting of threshold values is also that very important, different system is different.This strobe utility improves by we, will be more accurate.Namely we add the Asynchronous Request of a JS in the bottom of webpage, and this Asynchronous Request is used for the value of down counter, carries out value added, depreciation during page-out, generate a difference during page-in to IP.According to the analysis before us, reptile can not perform the request of asynchronous JS depreciation.Can judge whether this IP is reptile from the size of the value generated like this.
Within X minute, be set to 30 minutes.
Setting counter pond is expired after Y minute, within Y minute, is arranged to 3 minutes.
The invention has the advantages that: cost is low, effectively to distinguish reptile request and normal users request, thus reptile request can be stopped, save system resource, avoid the request of erroneous judgement normal users simultaneously.
Accompanying drawing explanation
Fig. 1 is schematic flow sheet of the present invention.
Embodiment
Below in conjunction with embodiment and accompanying drawing, the present invention is described in further detail, but embodiments of the present invention are not limited thereto.
Embodiment 1:
As shown in Figure 1.
For a data processing method for reptile, comprise the following steps:
Step 1: administrative login also utilizes black and white lists pond at backstage typing white list, and is set as X minute the operating time of blacklist effectively;
Step 2: judge that the IP of current login page is blacklist or white list;
Step 3: when the IP of current login page is white list, allows this IP carry out normal page logical operation;
Step 4: when the IP of current login page is blacklist, allows this IP enter identifying code page operation, after this IP writes identifying code, discharges this IP and is being not blacklist, allow this IP carry out normal page logical operation simultaneously;
Step 5: the IP of current login page be non-blacklist and non-white list time, traffic statistics instrument is utilized to carry out examination, enter counter pond, 1 operation can be added before Cache counter is expired at Cache counter, then the count value after relatively adding up and threshold values, when count value is less than threshold values, judge that this IP is as non-reptile, this IP is allowed to carry out normal page logical operation, when count value is greater than threshold values, this IP is arranged to blacklist, and allow this IP enter identifying code page operation, after this IP writes identifying code, discharge this IP and be not blacklist, if this IP does not fill in identifying code, then judge that this IP is as reptile,
Step 6: when this IP logs off the page, asynchronous JS request, cache count value subtracts 1 operation.
The setting principle of said method is: because reptile request is the mechanism of similar httpClient or the order of curl, wget, and normal users request generally walks browser.Reptile request generally can not perform the asynchronous JavaScript operation in the page, and user's request then performs the asynchronous JavaScript operation that Jquery provides.
Therefore, the present invention utilizes asynchronous JavaScript to operate, thus filters out reptile request, and blacklist is arranged in reptile request.Simultaneously due to the change of network IP, sometimes same IP the previous day is blacklist, it within second day, is then normal users logging request, therefore all IP can not be set to unalterable fixing and extremely ask IP, therefore need in real time along current blacklist, utilize asynchronous JavaScript to operate differentiation current blacklist simultaneously and whether continue as reptile request.
WEB system all walks http agreement with WEB reservoir, and each request produces a client to I haven't seen you for ages and is connected with the tcp of server.By netstat order, the IP while of just can viewing current corresponding to connection server and connection amount.Order/bin/netstat-nat-n | the general all hundreds ofs of grep 80 or several thousand.When the linking number that same IP is corresponding exceedes the threshold values that we observe, just can be judged as that improper user asks.Threshold value setting is most important, and large-scale Internet bar or same school, company IP out also may be mistaken for illegal request.This strategy comprises two timing scripts, and IP (tcpForbidCmd.sh) is sealed in a timing, a time controlled released IP (tcpReleaseCmd.sh), is respectively respectively to perform once every 5 minutes and 40 minutes.This strategy has been equivalent to threshold to our default, and in similar highway traffic system, certain road setting limit for height 4 meters of railings, can not pass through at this higher than the cars of 4 meters.This strategy energy preventing malicious or the newly hand-written irregular reptile of request frequency.
Anti-reptile strobe utility of the present invention implements also fairly simple, we can do access counter with memcached or local internal memory, in time period before buffer memory is expired (as 3 minutes), once, counter adds 1 in each IP access, the KEY of buffer memory comprises IP, by the value that counter obtains, judge more than a threshold values, this IP probably has problem, so just can return an identifying code page, require that user fills in identifying code.If reptile, certainly can not fill in identifying code, just refused, protect the resource of rear end.The setting of threshold values is also that very important, different system is different.This strobe utility improves by we, will be more accurate.Namely we add the Asynchronous Request of a JS in the bottom of webpage, and this Asynchronous Request is used for the value of down counter, carries out value added, depreciation during page-out, generate a difference during page-in to IP.According to the analysis before us, reptile can not perform the request of asynchronous JS depreciation.Can judge whether this IP is reptile from the size of the value generated like this.
Within X minute, be set to 30 minutes.
Setting counter pond is expired after Y minute, within Y minute, is arranged to 3 minutes.
As mentioned above, then well the present invention can be realized.

Claims (3)

1. for a data processing method for reptile, it is characterized in that: comprise the following steps:
Step 1: administrative login also utilizes black and white lists pond at backstage typing white list, and is set as X minute the operating time of blacklist effectively;
Step 2: judge that the IP of current login page is blacklist or white list;
Step 3: when the IP of current login page is white list, allows this IP carry out normal page logical operation;
Step 4: when the IP of current login page is blacklist, allows this IP enter identifying code page operation, after this IP writes identifying code, discharges this IP and is being not blacklist, allow this IP carry out normal page logical operation simultaneously;
Step 5: the IP of current login page be non-blacklist and non-white list time, traffic statistics instrument is utilized to carry out examination, enter counter pond, 1 operation can be added before Cache counter is expired at Cache counter, then the count value after relatively adding up and threshold values, when count value is less than threshold values, judge that this IP is as non-reptile, this IP is allowed to carry out normal page logical operation, when count value is greater than threshold values, this IP is arranged to blacklist, and allow this IP enter identifying code page operation, after this IP writes identifying code, discharge this IP and be not blacklist, if this IP does not fill in identifying code, then judge that this IP is as reptile,
Step 6: when this IP logs off the page, asynchronous JS request, cache count value subtracts 1 operation.
2. a kind of data processing method for reptile according to claim 1, is characterized in that: within X minute, be set to 30 minutes.
3. a kind of data processing method for reptile according to claim 1, is characterized in that: setting counter pond is expired after Y minute, within Y minute, is arranged to 3 minutes.
CN201510200123.4A 2015-04-26 2015-04-26 Crawler data processing method Pending CN104902008A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201510200123.4A CN104902008A (en) 2015-04-26 2015-04-26 Crawler data processing method

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201510200123.4A CN104902008A (en) 2015-04-26 2015-04-26 Crawler data processing method

Publications (1)

Publication Number Publication Date
CN104902008A true CN104902008A (en) 2015-09-09

Family

ID=54034404

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201510200123.4A Pending CN104902008A (en) 2015-04-26 2015-04-26 Crawler data processing method

Country Status (1)

Country Link
CN (1) CN104902008A (en)

Cited By (10)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN105306465A (en) * 2015-10-30 2016-02-03 新浪网技术(中国)有限公司 Website secure access realization method and apparatus
CN105827619A (en) * 2016-04-25 2016-08-03 无锡中科富农物联科技有限公司 Crawler blocking method under large visitor volume condition
CN105930727A (en) * 2016-04-25 2016-09-07 无锡中科富农物联科技有限公司 Web-based crawler identification algorithm
CN106657057A (en) * 2016-12-20 2017-05-10 北京金堤科技有限公司 Anti-crawler system and method
CN106713241A (en) * 2015-11-16 2017-05-24 腾讯科技(深圳)有限公司 Identity verification method, device and system
WO2017084337A1 (en) * 2015-11-16 2017-05-26 腾讯科技(深圳)有限公司 Identity verification method, apparatus and system
CN107634947A (en) * 2017-09-18 2018-01-26 北京京东尚科信息技术有限公司 Limitation malice logs in or the method and apparatus of registration
CN108777687A (en) * 2018-06-05 2018-11-09 掌阅科技股份有限公司 Reptile hold-up interception method, electronic equipment, storage medium based on user behavior portrait
CN110401654A (en) * 2019-07-23 2019-11-01 广州市百果园信息技术有限公司 A kind of method, apparatus of business access, system, equipment and storage medium
CN110581859A (en) * 2019-09-18 2019-12-17 成都安恒信息技术有限公司 Anti-crawling insect method based on page embedded points

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103428196A (en) * 2012-12-27 2013-12-04 北京安天电子设备有限公司 URL white list-based WEB application intrusion detecting method and apparatus
CN103475637A (en) * 2013-04-24 2013-12-25 携程计算机技术(上海)有限公司 Network access control method and system based on IP access behaviors
CN104202291A (en) * 2014-07-11 2014-12-10 西安电子科技大学 Anti-phishing method based on multi-factor comprehensive assessment method

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103428196A (en) * 2012-12-27 2013-12-04 北京安天电子设备有限公司 URL white list-based WEB application intrusion detecting method and apparatus
CN103475637A (en) * 2013-04-24 2013-12-25 携程计算机技术(上海)有限公司 Network access control method and system based on IP access behaviors
CN104202291A (en) * 2014-07-11 2014-12-10 西安电子科技大学 Anti-phishing method based on multi-factor comprehensive assessment method

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
码迷: ""给网站加入优雅的实时反爬虫策略"", 《码迷》 *

Cited By (17)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN105306465A (en) * 2015-10-30 2016-02-03 新浪网技术(中国)有限公司 Website secure access realization method and apparatus
CN105306465B (en) * 2015-10-30 2019-01-18 新浪网技术(中国)有限公司 Web portal security accesses implementation method and device
US10547624B2 (en) 2015-11-16 2020-01-28 Tencent Technology (Shenzhen) Company Limited Identity authentication method, apparatus, and system
CN106713241A (en) * 2015-11-16 2017-05-24 腾讯科技(深圳)有限公司 Identity verification method, device and system
WO2017084337A1 (en) * 2015-11-16 2017-05-26 腾讯科技(深圳)有限公司 Identity verification method, apparatus and system
US11258810B2 (en) 2015-11-16 2022-02-22 Tencent Technology (Shenzhen) Company Limited Identity authentication method, apparatus, and system
CN105827619B (en) * 2016-04-25 2019-02-15 无锡中科富农物联科技有限公司 Crawler in the case of height access closes method
CN105827619A (en) * 2016-04-25 2016-08-03 无锡中科富农物联科技有限公司 Crawler blocking method under large visitor volume condition
CN105930727A (en) * 2016-04-25 2016-09-07 无锡中科富农物联科技有限公司 Web-based crawler identification algorithm
CN105930727B (en) * 2016-04-25 2018-11-09 无锡中科富农物联科技有限公司 Reptile recognition methods based on Web
CN106657057A (en) * 2016-12-20 2017-05-10 北京金堤科技有限公司 Anti-crawler system and method
CN106657057B (en) * 2016-12-20 2020-09-29 北京金堤科技有限公司 Anti-crawler system and method
CN107634947A (en) * 2017-09-18 2018-01-26 北京京东尚科信息技术有限公司 Limitation malice logs in or the method and apparatus of registration
CN108777687A (en) * 2018-06-05 2018-11-09 掌阅科技股份有限公司 Reptile hold-up interception method, electronic equipment, storage medium based on user behavior portrait
CN110401654A (en) * 2019-07-23 2019-11-01 广州市百果园信息技术有限公司 A kind of method, apparatus of business access, system, equipment and storage medium
CN110581859A (en) * 2019-09-18 2019-12-17 成都安恒信息技术有限公司 Anti-crawling insect method based on page embedded points
CN110581859B (en) * 2019-09-18 2021-11-26 成都安恒信息技术有限公司 Anti-crawling insect method based on page embedded points

Similar Documents

Publication Publication Date Title
CN104902008A (en) Crawler data processing method
CN109831465B (en) Website intrusion detection method based on big data log analysis
CN103297435B (en) A kind of abnormal access behavioral value method and system based on WEB daily record
CN105117484A (en) Internet public opinion monitoring method and system
CN106101104A (en) A kind of malice domain name detection method based on domain name mapping and system
US10079770B2 (en) Junk information filtering method and apparatus
CN104994117A (en) Malicious domain name detection method and system based on DNS (Domain Name Server) resolution data
CN102077201A (en) System and method for dynamic and real-time categorization of webpages
CN106484709A (en) A kind of auditing method of daily record data and audit device
CN103888490A (en) Automatic WEB client man-machine identification method
CN102541884B (en) Method and device for database optimization
CN105224691B (en) A kind of information processing method and device
CN104184601B (en) The acquisition methods and device of user's online hours
CN106254137A (en) The alarm root-cause analysis system and method for supervisory systems
CN110519263B (en) Anti-swipe method, device, apparatus, and computer-readable storage medium
CN103118035A (en) Website access request parameter legal range analysis method and device
CN106649031A (en) Monitoring data obtaining method and device, and computer
CN103001972A (en) Identification method and identification device and firewall for DDOS (distributed denial of service) attack
CN111428108A (en) Anti-crawler method, device and medium based on deep learning
Shin et al. A grand spread estimator using a graphics processing unit
CN111049837A (en) Malicious website identification and interception technology based on communication operator network transport layer
CN106067879A (en) The detection method of information and device
CN104811418B (en) The method and device of viral diagnosis
CN102945254A (en) Method for detecting abnormal data among TB-level mass audit data
CN108494635A (en) A kind of network flow detection system based on cloud computing

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
TA01 Transfer of patent application right
TA01 Transfer of patent application right

Effective date of registration: 20170621

Address after: 610000 Chengdu high tech Zone, Sichuan Tianyi street, No. 3, building 38

Applicant after: Chengdu Chuan Hang Information technology company limited

Applicant after: Suzhou Chong Xing Mdt InfoTech Ltd

Address before: 610000 Chengdu high tech Zone, Sichuan Tianyi street, No. 3, building 38

Applicant before: Chengdu Chuan Hang Information technology company limited

RJ01 Rejection of invention patent application after publication
RJ01 Rejection of invention patent application after publication

Application publication date: 20150909