CN106331108A - Crawler realization method and system capable of breaking through IP limit - Google Patents

Crawler realization method and system capable of breaking through IP limit Download PDF

Info

Publication number
CN106331108A
CN106331108A CN201610729927.8A CN201610729927A CN106331108A CN 106331108 A CN106331108 A CN 106331108A CN 201610729927 A CN201610729927 A CN 201610729927A CN 106331108 A CN106331108 A CN 106331108A
Authority
CN
China
Prior art keywords
task
reptile
crawl
high latency
dispatch server
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN201610729927.8A
Other languages
Chinese (zh)
Inventor
周灏
董超
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Silver respectful
Original Assignee
Beijing Liangkebang Information Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing Liangkebang Information Technology Co Ltd filed Critical Beijing Liangkebang Information Technology Co Ltd
Priority to CN201610729927.8A priority Critical patent/CN106331108A/en
Publication of CN106331108A publication Critical patent/CN106331108A/en
Pending legal-status Critical Current

Links

Classifications

    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L67/00Network arrangements or protocols for supporting network services or applications
    • H04L67/01Protocols
    • H04L67/02Protocols based on web technology, e.g. hypertext transfer protocol [HTTP]
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/95Retrieval from the web
    • G06F16/951Indexing; Web crawling techniques
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L67/00Network arrangements or protocols for supporting network services or applications
    • H04L67/50Network services
    • H04L67/60Scheduling or organising the servicing of application requests, e.g. requests for application data transmissions using the analysis and optimisation of the required network resources

Abstract

A crawler realization method capable of breaking through IP limit comprises the following steps: 1) a crawler scheduling server issues a capture task, wherein the capture task comprises task ID, URL of an HTTP request and all parameters and longest waiting time; 2) after receiving the capture task, a client immediately initiates the HTTP request to capture corresponding pages; 3) after page capture is finished, a detection detects whether the longest waiting time is surpassed, if not, the step 4) is carried out, or otherwise, the step 1) is carried out; and 4) a sending module sends the captured data to the crawler scheduling server, and meanwhile, labels the task ID, wherein the captured data is a character string returned by HTTP response. The invention also provides a crawler realization system capable of breaking through the IP limit.

Description

A kind of reptile realization method and system breaking through IP restriction
Technical field
The invention belongs to the technical field of web crawlers, more particularly to a kind of break through IP limit reptile implementation method and System.
Background technology
(be otherwise known as web crawlers webpage Aranea, network robot, and in the middle of FOAF community, more frequent is referred to as webpage Follower), it is a kind of according to certain rule, automatically captures program or the script of web message.
The credit data capturing user in interconnection is the important means of credit rating, such as, capture from Alipay website Transaction record just can reflect the economic strength of user from side.But the skill of artificial setting is also encountered when capturing these information Art obstacle.
IP restriction, in order to prevent crawler capturing information, has been done in some website.Such as limiting single IP can only in per minute Access 100 times, then a crawler server can only initiate 100 network requests in per minute, when initiating the 101st request Shi Zehui is refused by destination server.
The most universal solution is to increase server thus increases the quantity of IP address.Such as limit single IP every point Can only access in clock 100 times, then IP quantity is increased to 500, it is possible to reach 50000 requests per minute.Do so Though problem can be solved, but cost is huge, the most uneconomical.
Summary of the invention
The technology of the present invention solves problem: overcome the deficiencies in the prior art, it is provided that a kind of reptile reality breaking through IP restriction Existing method, it spends little and can break through IP restriction and realize crawler capturing information.
The technical solution of the present invention is: the reptile implementation method that this breakthrough IP limits, the method includes following step Rapid:
(1) reptile dispatch server issues a crawl task, crawl task comprise task ID, HTTP request URL with And all parameter, high latency;
(2), after client receives crawl task, initiate HTTP request immediately and capture the corresponding page;
(3) page has captured, and checks either with or without exceeding high latency, if being not above high latency, then Perform step (4), otherwise perform step (1);
(4) data grabbed being sent to reptile dispatch server, mark task ID simultaneously, the data grabbed are The character string that http response returns.
The present invention is issued to client (such as, the APP that user mobile phone is installed) the task of capturing the page, by client The huge IP quantity that end provides breaks through restriction, therefore spends little and can break through IP restriction and realize crawler capturing information.
The reptile additionally providing a kind of IP of breakthrough restriction realizes system, and this system includes:
Reptile dispatch server, its configuration issues a crawl task, and crawl task comprises task ID, HTTP request URL and all parameter, high latency;
Client, after its configuration receives crawl task, initiates HTTP request immediately and captures the corresponding page;
Detection module, its configuration has captured at the page, checks either with or without exceeding high latency;
Sending module, the data grabbed are sent to reptile dispatch server, mark task ID simultaneously, grab by its configuration The data got are the character string that http response returns.
Accompanying drawing explanation
Fig. 1 is the flow chart of the reptile implementation method that the breakthrough IP according to the present invention limits.
Detailed description of the invention
As it is shown in figure 1, the reptile implementation method that this breakthrough IP limits, the method comprises the following steps:
(1) reptile dispatch server issues a crawl task, crawl task comprise task ID, HTTP request URL with And all parameter, high latency;
(2), after client receives crawl task, initiate HTTP request immediately and capture the corresponding page;
(3) page has captured, and checks either with or without exceeding high latency, if being not above high latency, then Perform step (4), otherwise perform step (1);
(4) data grabbed being sent to reptile dispatch server, mark task ID simultaneously, the data grabbed are The character string that http response returns.
The present invention is issued to client (such as, the APP that user mobile phone is installed) the task of capturing the page, by client The huge IP quantity that end provides breaks through restriction, therefore spends little and can break through IP restriction and realize crawler capturing information.
Further, in described step (1), high latency is 30 seconds.
It addition, also include checking procedure after described step (4): two clients send identical in high latency HTTP request, the result of return is submitted to reptile dispatch server, if the result submitted to is the most identical, then judges this time to capture Authentic and valid;If submitting to result different, then judge that this crawl is invalid, re-issue crawl task.
It addition, also include after described step (4): reptile dispatch server is often collected and once submitted to, the most in the buffer Searching current task ID, if there being other clients to submit to, just twice submission of contrast is the most consistent, otherwise specifically Caching is charged in submission.
It addition, twice submission of contrast, if consistent, capture credible, result is write into data base;If twice submission differs Cause, then this captures task calcellation, regenerates crawl task, re-issues.
It addition, in described step (1), reptile dispatch server issues task equably.
It will appreciated by the skilled person that all or part of step realizing in above-described embodiment method is permissible Instructing relevant hardware by program to complete, described program can be stored in a computer read/write memory medium, Upon execution, including each step of above-described embodiment method, and described storage medium may is that ROM/RAM, magnetic to this program Dish, CD, storage card etc..Therefore, corresponding with the method for the present invention, the present invention include the most simultaneously a kind of break through IP limit Reptile realizes system, and this system generally represents with the form of the corresponding functional module of step each with method.Use the method System, comprising:
Reptile dispatch server, its configuration issues a crawl task, and crawl task comprises task ID, HTTP request URL and all parameter, high latency;
Client, after its configuration receives crawl task, initiates HTTP request immediately and captures the corresponding page;
Detection module, its configuration has captured at the page, checks either with or without exceeding high latency;
Sending module, the data grabbed are sent to reptile dispatch server, mark task ID simultaneously, grab by its configuration The data got are the character string that http response returns.
The above, be only presently preferred embodiments of the present invention, and the present invention not makees any pro forma restriction, every depends on Any simple modification, equivalent variations and the modification made above example according to the technical spirit of the present invention, the most still belongs to the present invention The protection domain of technical scheme.

Claims (7)

1. break through the reptile implementation method that IP limits for one kind, it is characterised in that: the method comprises the following steps:
(1) reptile dispatch server issues a crawl task, and crawl task comprises task ID, the URL of HTTP request and complete Portion's parameter, high latency;
(2), after client receives crawl task, initiate HTTP request immediately and capture the corresponding page;
(3) page has captured, and checking either with or without exceeding high latency, if being not above high latency, then performing Step (4), otherwise performs step (1);
(4) data grabbed being sent to reptile dispatch server, mark task ID simultaneously, the data grabbed are that HTTP rings The character string that should return.
The reptile implementation method that breakthrough IP the most according to claim 1 limits, it is characterised in that: in described step (1) High latency is 30 seconds.
Breakthrough IP the most according to claim 1 and 2 limit reptile implementation method, it is characterised in that: described step (4) it After also include checking procedure: two clients send identical HTTP request in high latency, the result of return are carried Give reptile dispatch server, if the result submitted to is the most identical, then judge that this time crawl is authentic and valid;If submitting to result different, Then judge that this crawl is invalid, re-issue crawl task.
Breakthrough IP the most according to claim 1 and 2 limit reptile implementation method, it is characterised in that: described step (4) it After also include: reptile dispatch server is often collected and is once submitted to, searches current task ID the most in the buffer, if had Other clients were submitted to, and just twice submission of contrast is the most consistent, otherwise current submission to were charged to caching.
The reptile implementation method that breakthrough IP the most according to claim 4 limits, it is characterised in that: contrast twice submission, as Fruit unanimously then captures credible, and result is write into data base;If twice submission is inconsistent, then this captures task calcellation, again gives birth to Become crawl task, re-issue.
The reptile implementation method that breakthrough IP the most according to claim 1 limits, it is characterised in that: in described step (1), climb Worm dispatch server issues task equably.
7. the reptile breaking through IP restriction realizes system, it is characterised in that: this system includes:
Reptile dispatch server, its configuration issues a crawl task, and crawl task comprises the URL of task ID, HTTP request And all parameter, high latency;
Client, after its configuration receives crawl task, initiates HTTP request immediately and captures the corresponding page;
Detection module, its configuration has captured at the page, checks either with or without exceeding high latency;
Sending module, the data grabbed are sent to reptile dispatch server, mark task ID simultaneously, grab by its configuration Data be http response return character string.
CN201610729927.8A 2016-08-25 2016-08-25 Crawler realization method and system capable of breaking through IP limit Pending CN106331108A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201610729927.8A CN106331108A (en) 2016-08-25 2016-08-25 Crawler realization method and system capable of breaking through IP limit

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201610729927.8A CN106331108A (en) 2016-08-25 2016-08-25 Crawler realization method and system capable of breaking through IP limit

Publications (1)

Publication Number Publication Date
CN106331108A true CN106331108A (en) 2017-01-11

Family

ID=57791114

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201610729927.8A Pending CN106331108A (en) 2016-08-25 2016-08-25 Crawler realization method and system capable of breaking through IP limit

Country Status (1)

Country Link
CN (1) CN106331108A (en)

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109298987A (en) * 2017-07-25 2019-02-01 北京国双科技有限公司 A kind of method and device detecting web crawlers operating status
CN110912769A (en) * 2019-11-12 2020-03-24 中移(杭州)信息技术有限公司 CDN cache hit rate statistical method, system, network device and storage medium
CN111162930A (en) * 2019-12-09 2020-05-15 杭州安恒信息技术股份有限公司 Delay response control method

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101635718A (en) * 2009-08-26 2010-01-27 中兴通讯股份有限公司 Network crawler system and method for acquiring resource as well as network resource gripping device
CN103037010A (en) * 2012-12-26 2013-04-10 人民搜索网络股份公司 Distributed network crawler system and catching method thereof
CN103559219A (en) * 2013-10-18 2014-02-05 北京京东尚科信息技术有限公司 Distributed web crawler capture task dispatching method, dispatching-side device and capture nodes
CN103873597A (en) * 2014-04-15 2014-06-18 厦门市美亚柏科信息股份有限公司 Distributed webpage downloading method and system
CN106503017A (en) * 2015-09-08 2017-03-15 摩贝(上海)生物科技有限公司 A kind of distributed reptile system task grasping system and method

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101635718A (en) * 2009-08-26 2010-01-27 中兴通讯股份有限公司 Network crawler system and method for acquiring resource as well as network resource gripping device
CN103037010A (en) * 2012-12-26 2013-04-10 人民搜索网络股份公司 Distributed network crawler system and catching method thereof
CN103559219A (en) * 2013-10-18 2014-02-05 北京京东尚科信息技术有限公司 Distributed web crawler capture task dispatching method, dispatching-side device and capture nodes
CN103873597A (en) * 2014-04-15 2014-06-18 厦门市美亚柏科信息股份有限公司 Distributed webpage downloading method and system
CN106503017A (en) * 2015-09-08 2017-03-15 摩贝(上海)生物科技有限公司 A kind of distributed reptile system task grasping system and method

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
胡炜: "分布式Web_Crawler系统研究与实现", 《中国优秀硕士学位论文全文数据库(电子期刊)信息科技辑》 *

Cited By (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109298987A (en) * 2017-07-25 2019-02-01 北京国双科技有限公司 A kind of method and device detecting web crawlers operating status
CN110912769A (en) * 2019-11-12 2020-03-24 中移(杭州)信息技术有限公司 CDN cache hit rate statistical method, system, network device and storage medium
CN110912769B (en) * 2019-11-12 2021-08-10 中移(杭州)信息技术有限公司 CDN cache hit rate statistical method, system, network device and storage medium
CN111162930A (en) * 2019-12-09 2020-05-15 杭州安恒信息技术股份有限公司 Delay response control method
CN111162930B (en) * 2019-12-09 2022-11-11 杭州安恒信息技术股份有限公司 Delay response control method

Similar Documents

Publication Publication Date Title
US11128621B2 (en) Method and apparatus for accessing website
CN103765423B (en) Gathering transaction data associated with locally stored data files
CN102868719B (en) A kind of Network Access Method based on buffer memory and server
CN103348346B (en) For detecting the method and system of new browser window
US10693858B2 (en) CDN-based access control method and related device
US11610182B2 (en) System and method for electronic lead verification
CN106302595B (en) Method and equipment for carrying out health check on server
CN105407074A (en) Authentication method, apparatus and system
CN104036160A (en) Web browsing method, device and browser
CN101848374A (en) Wireless video monitoring system and wireless video monitoring method thereof
US10165062B2 (en) Method and apparatus for implementing action instruction based on barcode
CN103841111A (en) Method for preventing data from being submitted repeatedly and server
WO2015126880A1 (en) Uploading a form attachment
CN103095530A (en) Method and system for sensitive information monitoring and leakage prevention based on front-end gateway
CN106331108A (en) Crawler realization method and system capable of breaking through IP limit
CN104023046B (en) Mobile terminal recognition method and device
CN103973635A (en) Page access control method, and related device and system
CN104683290A (en) Method and device for monitoring phishing and terminal
WO2015085735A1 (en) Information requesting method and system
WO2016003042A1 (en) Unmanned book lending method using smart phone
CN106411978A (en) Resource caching method and apparatus
WO2014183494A1 (en) Method, apparatus, and system of opening a web page
CN106874753A (en) The method and device at the abnormal interface of identification
CN105095303B (en) Quick link pushing method and quick link pushing device
CN101420490A (en) Data reading method and device

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
CB03 Change of inventor or designer information

Inventor after: Dong Chao

Inventor before: Zhou Hao

Inventor before: Dong Chao

CB03 Change of inventor or designer information
TA01 Transfer of patent application right

Effective date of registration: 20200415

Address after: No.13, row 13, shagelong village, Jiangzhuang Township, Yuanyang County, Xinxiang City, Henan Province

Applicant after: Silver respectful

Address before: 100080 Haidian District Danleng street Beijing City No. 1 Internet Financial Center 11 1102

Applicant before: BEIJING LIANGKEBANG INFORMATION TECHNOLOGY Co.,Ltd.

TA01 Transfer of patent application right
RJ01 Rejection of invention patent application after publication

Application publication date: 20170111

RJ01 Rejection of invention patent application after publication