Background technology
(be otherwise known as web crawlers webpage Aranea, network robot, and in the middle of FOAF community, more frequent is referred to as webpage
Follower), it is a kind of according to certain rule, automatically captures program or the script of web message.
The credit data capturing user in interconnection is the important means of credit rating, such as, capture from Alipay website
Transaction record just can reflect the economic strength of user from side.But the skill of artificial setting is also encountered when capturing these information
Art obstacle.
IP restriction, in order to prevent crawler capturing information, has been done in some website.Such as limiting single IP can only in per minute
Access 100 times, then a crawler server can only initiate 100 network requests in per minute, when initiating the 101st request
Shi Zehui is refused by destination server.
The most universal solution is to increase server thus increases the quantity of IP address.Such as limit single IP every point
Can only access in clock 100 times, then IP quantity is increased to 500, it is possible to reach 50000 requests per minute.Do so
Though problem can be solved, but cost is huge, the most uneconomical.
Summary of the invention
The technology of the present invention solves problem: overcome the deficiencies in the prior art, it is provided that a kind of reptile reality breaking through IP restriction
Existing method, it spends little and can break through IP restriction and realize crawler capturing information.
The technical solution of the present invention is: the reptile implementation method that this breakthrough IP limits, the method includes following step
Rapid:
(1) reptile dispatch server issues a crawl task, crawl task comprise task ID, HTTP request URL with
And all parameter, high latency;
(2), after client receives crawl task, initiate HTTP request immediately and capture the corresponding page;
(3) page has captured, and checks either with or without exceeding high latency, if being not above high latency, then
Perform step (4), otherwise perform step (1);
(4) data grabbed being sent to reptile dispatch server, mark task ID simultaneously, the data grabbed are
The character string that http response returns.
The present invention is issued to client (such as, the APP that user mobile phone is installed) the task of capturing the page, by client
The huge IP quantity that end provides breaks through restriction, therefore spends little and can break through IP restriction and realize crawler capturing information.
The reptile additionally providing a kind of IP of breakthrough restriction realizes system, and this system includes:
Reptile dispatch server, its configuration issues a crawl task, and crawl task comprises task ID, HTTP request
URL and all parameter, high latency;
Client, after its configuration receives crawl task, initiates HTTP request immediately and captures the corresponding page;
Detection module, its configuration has captured at the page, checks either with or without exceeding high latency;
Sending module, the data grabbed are sent to reptile dispatch server, mark task ID simultaneously, grab by its configuration
The data got are the character string that http response returns.
Detailed description of the invention
As it is shown in figure 1, the reptile implementation method that this breakthrough IP limits, the method comprises the following steps:
(1) reptile dispatch server issues a crawl task, crawl task comprise task ID, HTTP request URL with
And all parameter, high latency;
(2), after client receives crawl task, initiate HTTP request immediately and capture the corresponding page;
(3) page has captured, and checks either with or without exceeding high latency, if being not above high latency, then
Perform step (4), otherwise perform step (1);
(4) data grabbed being sent to reptile dispatch server, mark task ID simultaneously, the data grabbed are
The character string that http response returns.
The present invention is issued to client (such as, the APP that user mobile phone is installed) the task of capturing the page, by client
The huge IP quantity that end provides breaks through restriction, therefore spends little and can break through IP restriction and realize crawler capturing information.
Further, in described step (1), high latency is 30 seconds.
It addition, also include checking procedure after described step (4): two clients send identical in high latency
HTTP request, the result of return is submitted to reptile dispatch server, if the result submitted to is the most identical, then judges this time to capture
Authentic and valid;If submitting to result different, then judge that this crawl is invalid, re-issue crawl task.
It addition, also include after described step (4): reptile dispatch server is often collected and once submitted to, the most in the buffer
Searching current task ID, if there being other clients to submit to, just twice submission of contrast is the most consistent, otherwise specifically
Caching is charged in submission.
It addition, twice submission of contrast, if consistent, capture credible, result is write into data base;If twice submission differs
Cause, then this captures task calcellation, regenerates crawl task, re-issues.
It addition, in described step (1), reptile dispatch server issues task equably.
It will appreciated by the skilled person that all or part of step realizing in above-described embodiment method is permissible
Instructing relevant hardware by program to complete, described program can be stored in a computer read/write memory medium,
Upon execution, including each step of above-described embodiment method, and described storage medium may is that ROM/RAM, magnetic to this program
Dish, CD, storage card etc..Therefore, corresponding with the method for the present invention, the present invention include the most simultaneously a kind of break through IP limit
Reptile realizes system, and this system generally represents with the form of the corresponding functional module of step each with method.Use the method
System, comprising:
Reptile dispatch server, its configuration issues a crawl task, and crawl task comprises task ID, HTTP request
URL and all parameter, high latency;
Client, after its configuration receives crawl task, initiates HTTP request immediately and captures the corresponding page;
Detection module, its configuration has captured at the page, checks either with or without exceeding high latency;
Sending module, the data grabbed are sent to reptile dispatch server, mark task ID simultaneously, grab by its configuration
The data got are the character string that http response returns.
The above, be only presently preferred embodiments of the present invention, and the present invention not makees any pro forma restriction, every depends on
Any simple modification, equivalent variations and the modification made above example according to the technical spirit of the present invention, the most still belongs to the present invention
The protection domain of technical scheme.