CN106326447A - Detection method and system of data captured by crowd sourcing network crawlers - Google Patents

Detection method and system of data captured by crowd sourcing network crawlers Download PDF

Info

Publication number
CN106326447A
CN106326447A CN201610737578.4A CN201610737578A CN106326447A CN 106326447 A CN106326447 A CN 106326447A CN 201610737578 A CN201610737578 A CN 201610737578A CN 106326447 A CN106326447 A CN 106326447A
Authority
CN
China
Prior art keywords
reptile
credit score
client
result
data
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN201610737578.4A
Other languages
Chinese (zh)
Other versions
CN106326447B (en
Inventor
周灏
董超
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Silver respectful
Original Assignee
Beijing Liangkebang Information Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing Liangkebang Information Technology Co Ltd filed Critical Beijing Liangkebang Information Technology Co Ltd
Priority to CN201610737578.4A priority Critical patent/CN106326447B/en
Publication of CN106326447A publication Critical patent/CN106326447A/en
Application granted granted Critical
Publication of CN106326447B publication Critical patent/CN106326447B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING; COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/95Retrieval from the web
    • G06F16/951Indexing; Web crawling techniques

Abstract

The invention discloses a detection method of data captured by crowd sourcing network crawlers. The data captured by the crowd sourcing network crawlers is enabled to be real and reliable by the detection method. The detection method of the data captured by the crowd sourcing network crawlers comprises the following steps: taking a server as a checking center of captured results of crawler client sides, uploading captured page content to the checking center by the crawler client sides, contrasting the content uploaded by a plurality of crawler client sides, adding a credit score for each crawler client side if results are the same; issuing a task once again if the results are different, rechecking the several crawler client sides to distinguish superior and inferior crawler client sides, and then carrying out corresponding credit score addition and reduction, wherein the credit scores show reliability degree of the crawler client sides, and the crawler client sides with high credit scores are preferentially chosen to complete capture task. The invention also provides a detection system of the data captured by the crowd sourcing network crawlers.

Description

A kind of mass-rent web crawlers captures detection method and the system of data
Technical field
The invention belongs to the technical field of web crawlers, capture the detection of data more particularly to a kind of mass-rent web crawlers Method and system.
Background technology
(be otherwise known as web crawlers webpage Aranea, network robot, and in the middle of FOAF community, more frequent is referred to as webpage Follower), it is a kind of according to certain rule, automatically captures program or the script of web message.
The credit data capturing user in interconnection is the important means of credit rating, such as, capture from Alipay website Transaction record just can reflect the economic strength of user from side.But the skill of artificial setting is also encountered when capturing these information Art obstacle.
IP restriction, in order to prevent crawler capturing information, has been done in some website.Such as limiting single IP can only in per minute Access 100 times, then a crawler server can only initiate 100 network requests in per minute, when initiating the 101st request Shi Zehui is refused by destination server.
The most universal solution is to increase server thus increases the quantity of IP address.Such as limit single IP every point Can only access in clock 100 times, then IP quantity is increased to 500, it is possible to reach 50000 requests per minute.Do so Though problem can be solved, but cost is huge, the most uneconomical.
Applicant is in patent application before, it is provided that the mode of mass-rent, allows vast client (such as individual calculus Machine, mobile phone etc. connect the smart machine of the Internet) help crawl data, thus break through the restriction of IP.
But challenge is also following, how to guarantee that the data that mass-rent web crawlers captures are true and reliable, it is necessary to A set of effective mechanism.
Summary of the invention
The technology of the present invention solves problem: overcome the deficiencies in the prior art, it is provided that a kind of mass-rent web crawlers captures number According to detection method, it is able to ensure that the data that mass-rent web crawlers captures are true and reliable.
The technical solution of the present invention is: this mass-rent web crawlers captures the detection method of data, is made by server Capture the inspection center of result for reptile client, reptile client uploads to inspection center the content of pages captured, inspection The content of multiple reptile client upload is contrasted by center, if result is identical, adds credit to each reptile client Point;If result differs, issue a subtask the most again, again check these reptile clients, to distinguish good and bad, and laggard Row corresponding credit score plus-minus;Credit score represents the degree of reliability of reptile client, the reptile client that prioritizing selection credit score is high Hold crawl task.
The content of multiple reptile client upload is contrasted by the present invention by inspection center, if result is identical, gives Each reptile client adds credit score;If result differs, issue a subtask the most again, again check these reptile clients End, to distinguish good and bad, then carries out corresponding credit score plus-minus;Credit score represents the degree of reliability of reptile client, preferentially selects Select the high reptile client of credit score to complete crawl task, therefore, it is possible to guarantee that the data that mass-rent web crawlers captures are true Reliably.
Additionally providing a kind of mass-rent web crawlers and capture the detecting system of data, this system includes:
Server, its configuration is used as reptile client and captures the inspection center of result;
Reptile client, its configuration uploads to inspection center the content of pages captured;
Wherein, the content of multiple reptile client upload is contrasted by inspection center, if result is identical, gives each Reptile client adds credit score;If result differs, issue a subtask the most again, again check these reptile clients, To distinguish good and bad, then carry out corresponding credit score plus-minus;Credit score represents the degree of reliability of reptile client, and prioritizing selection is believed Crawl task is completed by a point high reptile client.
Accompanying drawing explanation
Fig. 1 is the flow chart that the mass-rent web crawlers according to the present invention captures the detection method of data.
Detailed description of the invention
As it is shown in figure 1, this mass-rent web crawlers captures the detection method of data, server is grabbed as reptile client Taking the inspection center of result, reptile client uploads to inspection center the content of pages captured, and inspection center is by multiple reptiles The content of client upload contrasts, if result is identical, adds credit score to each reptile client;If result not phase With, issue a subtask the most again, again check these reptile clients, to distinguish good and bad, then carry out corresponding credit score Plus-minus;Credit score represents the degree of reliability of reptile client, and the reptile client that prioritizing selection credit score is high completes to capture appoints Business.
The content of multiple reptile client upload is contrasted by the present invention by inspection center, if result is identical, gives Each reptile client adds credit score;If result differs, issue a subtask the most again, again check these reptile clients End, to distinguish good and bad, then carries out corresponding credit score plus-minus;Credit score represents the degree of reliability of reptile client, preferentially selects Select the high reptile client of credit score to complete crawl task, therefore, it is possible to guarantee that the data that mass-rent web crawlers captures are true Reliably.
Further, the method comprises the following steps:
(1) a reptile list is safeguarded in inspection center, and each item in reptile list comprises the credit of reptile ID and reptile Point;
(2) a crawl task is distributed to multiple reptiles of different IP addresses by inspection center, opens a time simultaneously Window, waits that reptile uploads data;
(3) in time window, receive the data that reptile is uploaded, and be evaluated when time window is closed.
Further, if reptile fails reported data, then credit score-5 when time window is closed;Close at time window The reptile of front reported data: if the content reported is all consistent, then complete the reptile credit score+1 of this subtask;If the content reported Occur inconsistent, then this crawl task is pressed failure handling, and the reptile performing this subtask is listed in list to be seen;Clothes Business device re-issues a subtask, and inspection center oneself is also gone to capture on website simultaneously, contrasts reptile again with the result of oneself Submit the result come up to;When time window is closed next time, the result of crawl is consistent with inspection center, then credit score+1, differs Cause then to be judged to wrong report, credit score-50.
It will appreciated by the skilled person that all or part of step realizing in above-described embodiment method is permissible Instructing relevant hardware by program to complete, described program can be stored in a computer read/write memory medium, Upon execution, including each step of above-described embodiment method, and described storage medium may is that ROM/RAM, magnetic to this program Dish, CD, storage card etc..Therefore, corresponding with the method for the present invention, the present invention includes a kind of mass-rent web crawlers the most simultaneously Capturing the detecting system of data, this system generally represents with the form of the corresponding functional module of step each with method.Using should The system of method, comprising:
Server, its configuration is used as reptile client and captures the inspection center of result;
Reptile client, its configuration uploads to inspection center the content of pages captured;
Wherein, the content of multiple reptile client upload is contrasted by inspection center, if result is identical, gives each Reptile client adds credit score;If result differs, issue a subtask the most again,
Again check these reptile clients, to distinguish good and bad, then carry out corresponding credit score plus-minus;Credit score table Showing the degree of reliability of reptile client, the reptile client that prioritizing selection credit score is high completes crawl task.
The above, be only presently preferred embodiments of the present invention, and the present invention not makees any pro forma restriction, every depends on Any simple modification, equivalent variations and the modification made above example according to the technical spirit of the present invention, the most still belongs to the present invention The protection domain of technical scheme.

Claims (4)

1. the detection method of mass-rent web crawlers crawl data, it is characterised in that: server is grabbed as reptile client Taking the inspection center of result, reptile client uploads to inspection center the content of pages captured, and inspection center is by multiple reptiles The content of client upload contrasts, if result is identical, adds credit score to each reptile client;If result not phase With, issue a subtask the most again, again check these reptile clients, to distinguish good and bad, then carry out corresponding credit score Plus-minus;Credit score represents the degree of reliability of reptile client, and the reptile client that prioritizing selection credit score is high completes to capture appoints Business.
Mass-rent web crawlers the most according to claim 1 captures the detection method of data, it is characterised in that: the method includes Following steps:
(1) a reptile list is safeguarded in inspection center, and each item in reptile list comprises the credit score of reptile ID and reptile;
(2) a crawl task is distributed to multiple reptiles of different IP addresses by inspection center, opens a time window simultaneously, Wait that reptile uploads data;
(3) in time window, receive the data that reptile is uploaded, and be evaluated when time window is closed.
Mass-rent web crawlers the most according to claim 2 captures the detection method of data, it is characterised in that: if time window During closedown, reptile fails reported data, then credit score-5;The reptile of reported data before time window is closed: if the content reported All consistent, then complete the reptile credit score+1 of this subtask;If the content reported occurs inconsistent, then this crawl task is pressed Failure handling, and the reptile performing this subtask is listed in list to be seen;Server re-issues a subtask, checks simultaneously Center oneself is also gone to capture on website, the result again submitted to up according to reptile with the result of oneself;Time window next time During closedown, the result of crawl is consistent with inspection center, then credit score+1, inconsistent, is judged to wrong report, credit score-50.
4. the detecting system of mass-rent web crawlers crawl data, it is characterised in that: this system includes:
Server, its configuration is used as reptile client and captures the inspection center of result;
Reptile client, its configuration uploads to inspection center the content of pages captured;
Wherein, the content of multiple reptile client upload is contrasted by inspection center, if result is identical, gives each reptile Client adds credit score;If result differs, issue a subtask the most again, again check these reptile clients, to distinguish Not good and bad, then carry out corresponding credit score plus-minus;Credit score represents the degree of reliability of reptile client, prioritizing selection credit score High reptile client completes crawl task.
CN201610737578.4A 2016-08-26 2016-08-26 A kind of detection method and system of crowdsourcing web crawlers crawl data Active CN106326447B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201610737578.4A CN106326447B (en) 2016-08-26 2016-08-26 A kind of detection method and system of crowdsourcing web crawlers crawl data

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201610737578.4A CN106326447B (en) 2016-08-26 2016-08-26 A kind of detection method and system of crowdsourcing web crawlers crawl data

Publications (2)

Publication Number Publication Date
CN106326447A true CN106326447A (en) 2017-01-11
CN106326447B CN106326447B (en) 2019-06-21

Family

ID=57790974

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201610737578.4A Active CN106326447B (en) 2016-08-26 2016-08-26 A kind of detection method and system of crowdsourcing web crawlers crawl data

Country Status (1)

Country Link
CN (1) CN106326447B (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108228431A (en) * 2018-01-04 2018-06-29 北京中关村科金技术有限公司 A kind of method and system of configurationization reptile quality-monitoring

Citations (18)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN1601528A (en) * 2003-09-25 2005-03-30 微软公司 Systems and methods for client-based web crawling
CN101051313A (en) * 2007-05-09 2007-10-10 崔志明 Integrated data source finding method for deep layer net page data source
CN102364473A (en) * 2011-11-09 2012-02-29 中国科学院自动化研究所 Netnews search system and method based on geographic information and visual information
EP2447893A2 (en) * 2010-10-29 2012-05-02 Fujitsu Limited Technique for stateless distributed parallel crawling of interactive client-server applications
CN103544314A (en) * 2013-11-04 2014-01-29 北京中搜网络技术股份有限公司 Searching data quality statistical method
CN103605764A (en) * 2013-11-26 2014-02-26 Tcl集团股份有限公司 Web crawler system and web crawler multitask executing and scheduling method
CN103793503A (en) * 2014-01-24 2014-05-14 北京理工大学 Opinion mining and classification method based on web texts
CN103955529A (en) * 2014-05-12 2014-07-30 中国科学院计算机网络信息中心 Internet information searching and aggregating presentation method
CN104484405A (en) * 2014-12-15 2015-04-01 北京国双科技有限公司 Method and device for carrying out crawling task
CN104899323A (en) * 2015-06-19 2015-09-09 成都国腾实业集团有限公司 Crawler system used for IDC harmful information monitoring platform
CN104951512A (en) * 2015-05-27 2015-09-30 中国科学院信息工程研究所 Public sentiment data collection method and system based on Internet
CN105117501A (en) * 2015-10-09 2015-12-02 广州神马移动信息科技有限公司 Web crawler scheduling method and web crawler system applying same
CN105279249A (en) * 2015-09-30 2016-01-27 北京奇虎科技有限公司 Method and device for determining confidence of point of interest data in website
CN105279272A (en) * 2015-10-30 2016-01-27 南京未来网络产业创新有限公司 Content aggregation method based on distributed web crawlers
CN105447088A (en) * 2015-11-06 2016-03-30 杭州掘数科技有限公司 Volunteer computing based multi-tenant professional cloud crawler
CN105556545A (en) * 2013-03-15 2016-05-04 美国结构数据有限公司 Apparatus, systems, and methods for crowdsourcing domain specific intelligence
CN105740262A (en) * 2014-12-10 2016-07-06 深圳先进技术研究院 Crowdsourcing mode-based data acquisition sharing system and method
CN105893416A (en) * 2015-12-01 2016-08-24 乐视网信息技术(北京)股份有限公司 Data service system

Patent Citations (18)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN1601528A (en) * 2003-09-25 2005-03-30 微软公司 Systems and methods for client-based web crawling
CN101051313A (en) * 2007-05-09 2007-10-10 崔志明 Integrated data source finding method for deep layer net page data source
EP2447893A2 (en) * 2010-10-29 2012-05-02 Fujitsu Limited Technique for stateless distributed parallel crawling of interactive client-server applications
CN102364473A (en) * 2011-11-09 2012-02-29 中国科学院自动化研究所 Netnews search system and method based on geographic information and visual information
CN105556545A (en) * 2013-03-15 2016-05-04 美国结构数据有限公司 Apparatus, systems, and methods for crowdsourcing domain specific intelligence
CN103544314A (en) * 2013-11-04 2014-01-29 北京中搜网络技术股份有限公司 Searching data quality statistical method
CN103605764A (en) * 2013-11-26 2014-02-26 Tcl集团股份有限公司 Web crawler system and web crawler multitask executing and scheduling method
CN103793503A (en) * 2014-01-24 2014-05-14 北京理工大学 Opinion mining and classification method based on web texts
CN103955529A (en) * 2014-05-12 2014-07-30 中国科学院计算机网络信息中心 Internet information searching and aggregating presentation method
CN105740262A (en) * 2014-12-10 2016-07-06 深圳先进技术研究院 Crowdsourcing mode-based data acquisition sharing system and method
CN104484405A (en) * 2014-12-15 2015-04-01 北京国双科技有限公司 Method and device for carrying out crawling task
CN104951512A (en) * 2015-05-27 2015-09-30 中国科学院信息工程研究所 Public sentiment data collection method and system based on Internet
CN104899323A (en) * 2015-06-19 2015-09-09 成都国腾实业集团有限公司 Crawler system used for IDC harmful information monitoring platform
CN105279249A (en) * 2015-09-30 2016-01-27 北京奇虎科技有限公司 Method and device for determining confidence of point of interest data in website
CN105117501A (en) * 2015-10-09 2015-12-02 广州神马移动信息科技有限公司 Web crawler scheduling method and web crawler system applying same
CN105279272A (en) * 2015-10-30 2016-01-27 南京未来网络产业创新有限公司 Content aggregation method based on distributed web crawlers
CN105447088A (en) * 2015-11-06 2016-03-30 杭州掘数科技有限公司 Volunteer computing based multi-tenant professional cloud crawler
CN105893416A (en) * 2015-12-01 2016-08-24 乐视网信息技术(北京)股份有限公司 Data service system

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108228431A (en) * 2018-01-04 2018-06-29 北京中关村科金技术有限公司 A kind of method and system of configurationization reptile quality-monitoring

Also Published As

Publication number Publication date
CN106326447B (en) 2019-06-21

Similar Documents

Publication Publication Date Title
JP6503357B2 (en) Approve payment by reading QR code generated by separate user or device
CN105426415A (en) Management method, device and system of website access request
CN104767775B (en) Web application information push method and system
CN102870118B (en) Access method, device and system to user behavior
US9692852B2 (en) Uploading a form attachment
CN103425486B (en) Use the method and system of the remote card Content Management of sync server end script
CN106294648A (en) A kind of processing method and processing device for page access path
CN107347076A (en) The detection method and device of SSRF leaks
CN109639740A (en) A kind of login state sharing method and device based on device id
CN108073703A (en) A kind of comment information acquisition methods, device, equipment and storage medium
CN106326447A (en) Detection method and system of data captured by crowd sourcing network crawlers
US20160162984A1 (en) Processing unstructured messages
CN106686104B (en) Method and equipment for operation and maintenance of target server
CN107391714A (en) A kind of screenshot method, capture server, sectional drawing service system and medium
CN105227532B (en) A kind of blocking-up method and device of malicious act
CN110276202A (en) A kind of detection method and device of unserializing loophole
CN110035075A (en) Detection method, device, computer equipment and the storage medium of fishing website
CN105281963A (en) nginx server vulnerability detection method and device
CN102215146B (en) Webpage downloading monitoring method and device
CN108924159A (en) The verification method and device in a kind of message characteristic identification library
CN108322427A (en) A kind of method and apparatus carrying out air control to access request
CN108134803A (en) A kind of URL attack guarding methods and device
CN107133779A (en) A kind of active method, system and the browser plug-in for collecting resume of multi-domain communication
US10102512B1 (en) Systems and methods for financial data transfer
US9680917B2 (en) Method, apparatus, and system of opening a web page

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant
CB03 Change of inventor or designer information
CB03 Change of inventor or designer information

Inventor after: Dong Chao

Inventor before: Zhou Hao

Inventor before: Dong Chao

TR01 Transfer of patent right
TR01 Transfer of patent right

Effective date of registration: 20200415

Address after: No.13, row 13, shagelong village, Jiangzhuang Township, Yuanyang County, Xinxiang City, Henan Province

Patentee after: Silver respectful

Address before: 100080 Haidian District Danleng street Beijing City No. 1 Internet Financial Center 11 1102

Patentee before: BEIJING LIANGKEBANG INFORMATION TECHNOLOGY Co.,Ltd.