CN106326447A

CN106326447A - Detection method and system of data captured by crowd sourcing network crawlers

Info

Publication number: CN106326447A
Application number: CN201610737578.4A
Authority: CN
Inventors: 周灏; 董超
Original assignee: Beijing Liangkebang Information Technology Co Ltd
Current assignee: Silver respectful
Priority date: 2016-08-26
Filing date: 2016-08-26
Publication date: 2017-01-11
Anticipated expiration: 2036-08-26
Also published as: CN106326447B

Abstract

The invention discloses a detection method of data captured by crowd sourcing network crawlers. The data captured by the crowd sourcing network crawlers is enabled to be real and reliable by the detection method. The detection method of the data captured by the crowd sourcing network crawlers comprises the following steps: taking a server as a checking center of captured results of crawler client sides, uploading captured page content to the checking center by the crawler client sides, contrasting the content uploaded by a plurality of crawler client sides, adding a credit score for each crawler client side if results are the same; issuing a task once again if the results are different, rechecking the several crawler client sides to distinguish superior and inferior crawler client sides, and then carrying out corresponding credit score addition and reduction, wherein the credit scores show reliability degree of the crawler client sides, and the crawler client sides with high credit scores are preferentially chosen to complete capture task. The invention also provides a detection system of the data captured by the crowd sourcing network crawlers.

Description

A kind of mass-rent web crawlers captures detection method and the system of data

Technical field

The invention belongs to the technical field of web crawlers, capture the detection of data more particularly to a kind of mass-rent web crawlers Method and system.

Background technology

(be otherwise known as web crawlers webpage Aranea, network robot, and in the middle of FOAF community, more frequent is referred to as webpage Follower), it is a kind of according to certain rule, automatically captures program or the script of web message.

The credit data capturing user in interconnection is the important means of credit rating, such as, capture from Alipay website Transaction record just can reflect the economic strength of user from side.But the skill of artificial setting is also encountered when capturing these information Art obstacle.

IP restriction, in order to prevent crawler capturing information, has been done in some website.Such as limiting single IP can only in per minute Access 100 times, then a crawler server can only initiate 100 network requests in per minute, when initiating the 101st request Shi Zehui is refused by destination server.

The most universal solution is to increase server thus increases the quantity of IP address.Such as limit single IP every point Can only access in clock 100 times, then IP quantity is increased to 500, it is possible to reach 50000 requests per minute.Do so Though problem can be solved, but cost is huge, the most uneconomical.

Applicant is in patent application before, it is provided that the mode of mass-rent, allows vast client (such as individual calculus Machine, mobile phone etc. connect the smart machine of the Internet) help crawl data, thus break through the restriction of IP.

But challenge is also following, how to guarantee that the data that mass-rent web crawlers captures are true and reliable, it is necessary to A set of effective mechanism.

Summary of the invention

The technology of the present invention solves problem: overcome the deficiencies in the prior art, it is provided that a kind of mass-rent web crawlers captures number According to detection method, it is able to ensure that the data that mass-rent web crawlers captures are true and reliable.

The technical solution of the present invention is: this mass-rent web crawlers captures the detection method of data, is made by server Capture the inspection center of result for reptile client, reptile client uploads to inspection center the content of pages captured, inspection The content of multiple reptile client upload is contrasted by center, if result is identical, adds credit to each reptile client Point；If result differs, issue a subtask the most again, again check these reptile clients, to distinguish good and bad, and laggard Row corresponding credit score plus-minus；Credit score represents the degree of reliability of reptile client, the reptile client that prioritizing selection credit score is high Hold crawl task.

The content of multiple reptile client upload is contrasted by the present invention by inspection center, if result is identical, gives Each reptile client adds credit score；If result differs, issue a subtask the most again, again check these reptile clients End, to distinguish good and bad, then carries out corresponding credit score plus-minus；Credit score represents the degree of reliability of reptile client, preferentially selects Select the high reptile client of credit score to complete crawl task, therefore, it is possible to guarantee that the data that mass-rent web crawlers captures are true Reliably.

Additionally providing a kind of mass-rent web crawlers and capture the detecting system of data, this system includes:

Server, its configuration is used as reptile client and captures the inspection center of result；

Reptile client, its configuration uploads to inspection center the content of pages captured；

Wherein, the content of multiple reptile client upload is contrasted by inspection center, if result is identical, gives each Reptile client adds credit score；If result differs, issue a subtask the most again, again check these reptile clients, To distinguish good and bad, then carry out corresponding credit score plus-minus；Credit score represents the degree of reliability of reptile client, and prioritizing selection is believed Crawl task is completed by a point high reptile client.

Accompanying drawing explanation

Fig. 1 is the flow chart that the mass-rent web crawlers according to the present invention captures the detection method of data.

Detailed description of the invention

As it is shown in figure 1, this mass-rent web crawlers captures the detection method of data, server is grabbed as reptile client Taking the inspection center of result, reptile client uploads to inspection center the content of pages captured, and inspection center is by multiple reptiles The content of client upload contrasts, if result is identical, adds credit score to each reptile client；If result not phase With, issue a subtask the most again, again check these reptile clients, to distinguish good and bad, then carry out corresponding credit score Plus-minus；Credit score represents the degree of reliability of reptile client, and the reptile client that prioritizing selection credit score is high completes to capture appoints Business.

Further, the method comprises the following steps:

(1) a reptile list is safeguarded in inspection center, and each item in reptile list comprises the credit of reptile ID and reptile Point；

(2) a crawl task is distributed to multiple reptiles of different IP addresses by inspection center, opens a time simultaneously Window, waits that reptile uploads data；

(3) in time window, receive the data that reptile is uploaded, and be evaluated when time window is closed.

Further, if reptile fails reported data, then credit score-5 when time window is closed；Close at time window The reptile of front reported data: if the content reported is all consistent, then complete the reptile credit score+1 of this subtask；If the content reported Occur inconsistent, then this crawl task is pressed failure handling, and the reptile performing this subtask is listed in list to be seen；Clothes Business device re-issues a subtask, and inspection center oneself is also gone to capture on website simultaneously, contrasts reptile again with the result of oneself Submit the result come up to；When time window is closed next time, the result of crawl is consistent with inspection center, then credit score+1, differs Cause then to be judged to wrong report, credit score-50.

It will appreciated by the skilled person that all or part of step realizing in above-described embodiment method is permissible Instructing relevant hardware by program to complete, described program can be stored in a computer read/write memory medium, Upon execution, including each step of above-described embodiment method, and described storage medium may is that ROM/RAM, magnetic to this program Dish, CD, storage card etc..Therefore, corresponding with the method for the present invention, the present invention includes a kind of mass-rent web crawlers the most simultaneously Capturing the detecting system of data, this system generally represents with the form of the corresponding functional module of step each with method.Using should The system of method, comprising:

Wherein, the content of multiple reptile client upload is contrasted by inspection center, if result is identical, gives each Reptile client adds credit score；If result differs, issue a subtask the most again,

Again check these reptile clients, to distinguish good and bad, then carry out corresponding credit score plus-minus；Credit score table Showing the degree of reliability of reptile client, the reptile client that prioritizing selection credit score is high completes crawl task.

The above, be only presently preferred embodiments of the present invention, and the present invention not makees any pro forma restriction, every depends on Any simple modification, equivalent variations and the modification made above example according to the technical spirit of the present invention, the most still belongs to the present invention The protection domain of technical scheme.

Claims

1. the detection method of mass-rent web crawlers crawl data, it is characterised in that: server is grabbed as reptile client Taking the inspection center of result, reptile client uploads to inspection center the content of pages captured, and inspection center is by multiple reptiles The content of client upload contrasts, if result is identical, adds credit score to each reptile client；If result not phase With, issue a subtask the most again, again check these reptile clients, to distinguish good and bad, then carry out corresponding credit score Plus-minus；Credit score represents the degree of reliability of reptile client, and the reptile client that prioritizing selection credit score is high completes to capture appoints Business.

Mass-rent web crawlers the most according to claim 1 captures the detection method of data, it is characterised in that: the method includes Following steps:

(1) a reptile list is safeguarded in inspection center, and each item in reptile list comprises the credit score of reptile ID and reptile；

(2) a crawl task is distributed to multiple reptiles of different IP addresses by inspection center, opens a time window simultaneously, Wait that reptile uploads data；

Mass-rent web crawlers the most according to claim 2 captures the detection method of data, it is characterised in that: if time window During closedown, reptile fails reported data, then credit score-5；The reptile of reported data before time window is closed: if the content reported All consistent, then complete the reptile credit score+1 of this subtask；If the content reported occurs inconsistent, then this crawl task is pressed Failure handling, and the reptile performing this subtask is listed in list to be seen；Server re-issues a subtask, checks simultaneously Center oneself is also gone to capture on website, the result again submitted to up according to reptile with the result of oneself；Time window next time During closedown, the result of crawl is consistent with inspection center, then credit score+1, inconsistent, is judged to wrong report, credit score-50.

4. the detecting system of mass-rent web crawlers crawl data, it is characterised in that: this system includes:

Wherein, the content of multiple reptile client upload is contrasted by inspection center, if result is identical, gives each reptile Client adds credit score；If result differs, issue a subtask the most again, again check these reptile clients, to distinguish Not good and bad, then carry out corresponding credit score plus-minus；Credit score represents the degree of reliability of reptile client, prioritizing selection credit score High reptile client completes crawl task.