A kind of mass-rent web crawlers captures detection method and the system of data
Technical field
The invention belongs to the technical field of web crawlers, capture the detection of data more particularly to a kind of mass-rent web crawlers
Method and system.
Background technology
(be otherwise known as web crawlers webpage Aranea, network robot, and in the middle of FOAF community, more frequent is referred to as webpage
Follower), it is a kind of according to certain rule, automatically captures program or the script of web message.
The credit data capturing user in interconnection is the important means of credit rating, such as, capture from Alipay website
Transaction record just can reflect the economic strength of user from side.But the skill of artificial setting is also encountered when capturing these information
Art obstacle.
IP restriction, in order to prevent crawler capturing information, has been done in some website.Such as limiting single IP can only in per minute
Access 100 times, then a crawler server can only initiate 100 network requests in per minute, when initiating the 101st request
Shi Zehui is refused by destination server.
The most universal solution is to increase server thus increases the quantity of IP address.Such as limit single IP every point
Can only access in clock 100 times, then IP quantity is increased to 500, it is possible to reach 50000 requests per minute.Do so
Though problem can be solved, but cost is huge, the most uneconomical.
Applicant is in patent application before, it is provided that the mode of mass-rent, allows vast client (such as individual calculus
Machine, mobile phone etc. connect the smart machine of the Internet) help crawl data, thus break through the restriction of IP.
But challenge is also following, how to guarantee that the data that mass-rent web crawlers captures are true and reliable, it is necessary to
A set of effective mechanism.
Summary of the invention
The technology of the present invention solves problem: overcome the deficiencies in the prior art, it is provided that a kind of mass-rent web crawlers captures number
According to detection method, it is able to ensure that the data that mass-rent web crawlers captures are true and reliable.
The technical solution of the present invention is: this mass-rent web crawlers captures the detection method of data, is made by server
Capture the inspection center of result for reptile client, reptile client uploads to inspection center the content of pages captured, inspection
The content of multiple reptile client upload is contrasted by center, if result is identical, adds credit to each reptile client
Point;If result differs, issue a subtask the most again, again check these reptile clients, to distinguish good and bad, and laggard
Row corresponding credit score plus-minus;Credit score represents the degree of reliability of reptile client, the reptile client that prioritizing selection credit score is high
Hold crawl task.
The content of multiple reptile client upload is contrasted by the present invention by inspection center, if result is identical, gives
Each reptile client adds credit score;If result differs, issue a subtask the most again, again check these reptile clients
End, to distinguish good and bad, then carries out corresponding credit score plus-minus;Credit score represents the degree of reliability of reptile client, preferentially selects
Select the high reptile client of credit score to complete crawl task, therefore, it is possible to guarantee that the data that mass-rent web crawlers captures are true
Reliably.
Additionally providing a kind of mass-rent web crawlers and capture the detecting system of data, this system includes:
Server, its configuration is used as reptile client and captures the inspection center of result;
Reptile client, its configuration uploads to inspection center the content of pages captured;
Wherein, the content of multiple reptile client upload is contrasted by inspection center, if result is identical, gives each
Reptile client adds credit score;If result differs, issue a subtask the most again, again check these reptile clients,
To distinguish good and bad, then carry out corresponding credit score plus-minus;Credit score represents the degree of reliability of reptile client, and prioritizing selection is believed
Crawl task is completed by a point high reptile client.
Accompanying drawing explanation
Fig. 1 is the flow chart that the mass-rent web crawlers according to the present invention captures the detection method of data.
Detailed description of the invention
As it is shown in figure 1, this mass-rent web crawlers captures the detection method of data, server is grabbed as reptile client
Taking the inspection center of result, reptile client uploads to inspection center the content of pages captured, and inspection center is by multiple reptiles
The content of client upload contrasts, if result is identical, adds credit score to each reptile client;If result not phase
With, issue a subtask the most again, again check these reptile clients, to distinguish good and bad, then carry out corresponding credit score
Plus-minus;Credit score represents the degree of reliability of reptile client, and the reptile client that prioritizing selection credit score is high completes to capture appoints
Business.
The content of multiple reptile client upload is contrasted by the present invention by inspection center, if result is identical, gives
Each reptile client adds credit score;If result differs, issue a subtask the most again, again check these reptile clients
End, to distinguish good and bad, then carries out corresponding credit score plus-minus;Credit score represents the degree of reliability of reptile client, preferentially selects
Select the high reptile client of credit score to complete crawl task, therefore, it is possible to guarantee that the data that mass-rent web crawlers captures are true
Reliably.
Further, the method comprises the following steps:
(1) a reptile list is safeguarded in inspection center, and each item in reptile list comprises the credit of reptile ID and reptile
Point;
(2) a crawl task is distributed to multiple reptiles of different IP addresses by inspection center, opens a time simultaneously
Window, waits that reptile uploads data;
(3) in time window, receive the data that reptile is uploaded, and be evaluated when time window is closed.
Further, if reptile fails reported data, then credit score-5 when time window is closed;Close at time window
The reptile of front reported data: if the content reported is all consistent, then complete the reptile credit score+1 of this subtask;If the content reported
Occur inconsistent, then this crawl task is pressed failure handling, and the reptile performing this subtask is listed in list to be seen;Clothes
Business device re-issues a subtask, and inspection center oneself is also gone to capture on website simultaneously, contrasts reptile again with the result of oneself
Submit the result come up to;When time window is closed next time, the result of crawl is consistent with inspection center, then credit score+1, differs
Cause then to be judged to wrong report, credit score-50.
It will appreciated by the skilled person that all or part of step realizing in above-described embodiment method is permissible
Instructing relevant hardware by program to complete, described program can be stored in a computer read/write memory medium,
Upon execution, including each step of above-described embodiment method, and described storage medium may is that ROM/RAM, magnetic to this program
Dish, CD, storage card etc..Therefore, corresponding with the method for the present invention, the present invention includes a kind of mass-rent web crawlers the most simultaneously
Capturing the detecting system of data, this system generally represents with the form of the corresponding functional module of step each with method.Using should
The system of method, comprising:
Server, its configuration is used as reptile client and captures the inspection center of result;
Reptile client, its configuration uploads to inspection center the content of pages captured;
Wherein, the content of multiple reptile client upload is contrasted by inspection center, if result is identical, gives each
Reptile client adds credit score;If result differs, issue a subtask the most again,
Again check these reptile clients, to distinguish good and bad, then carry out corresponding credit score plus-minus;Credit score table
Showing the degree of reliability of reptile client, the reptile client that prioritizing selection credit score is high completes crawl task.
The above, be only presently preferred embodiments of the present invention, and the present invention not makees any pro forma restriction, every depends on
Any simple modification, equivalent variations and the modification made above example according to the technical spirit of the present invention, the most still belongs to the present invention
The protection domain of technical scheme.