CN102932448A - Distributed network crawler URL (uniform resource locator) duplicate removal system and method - Google Patents

Distributed network crawler URL (uniform resource locator) duplicate removal system and method Download PDF

Info

Publication number
CN102932448A
CN102932448A CN201210425213XA CN201210425213A CN102932448A CN 102932448 A CN102932448 A CN 102932448A CN 201210425213X A CN201210425213X A CN 201210425213XA CN 201210425213 A CN201210425213 A CN 201210425213A CN 102932448 A CN102932448 A CN 102932448A
Authority
CN
China
Prior art keywords
url
reptile
scheduling
child node
central server
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN201210425213XA
Other languages
Chinese (zh)
Other versions
CN102932448B (en
Inventor
刘述
徐贵宝
江文学
何宝宏
高强
赵劲
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Research Institute of Telecommunications Transmission Ministry of Industry and Information Technology
Original Assignee
Research Institute of Telecommunications Transmission Ministry of Industry and Information Technology
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Research Institute of Telecommunications Transmission Ministry of Industry and Information Technology filed Critical Research Institute of Telecommunications Transmission Ministry of Industry and Information Technology
Priority to CN201210425213.XA priority Critical patent/CN102932448B/en
Publication of CN102932448A publication Critical patent/CN102932448A/en
Application granted granted Critical
Publication of CN102932448B publication Critical patent/CN102932448B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Abstract

The invention provides a distributed network crawler URL (uniform resource locator) duplicate removal system and method. The system comprises crawler collection sub-nodes, a central server and a database server. The method comprises the following steps that: each crawler collection sub-node registers to the central server; the crawler collection sub-node obtains a URL from a database waiting queue and obtains new URL information from the URL; the crawler collection sub-node performs first-level duplicate removal on the newly obtained URL; if the first-level duplicate removal fails, the URL is discarded; if the first-level duplicate removal succeeds, the newly obtained URL is added to a local URL abstract table and sent to the central server; the central server performs second-level duplicate removal on the newly obtained URL; if the second-level duplicate removal succeeds, the URL is added to a global URL abstract table; and the crawler collection sub-node adds the link of the URL to the waiting queue. According to the system and method provided by the invention, the duplicate removal task concentrated on the central node can be decomposed to the crawler collection sub-nodes through the first-level duplicate removal by a hierarchical duplicate removal mechanism, and the central server maintains a global duplicate removal table through a second-level duplicate removal mode, thus the system expansion is remarkably facilitated, and the system design, deployment and operation are very flexible and convenient.

Description

The URL re-scheduling system and method for a kind of distributed network reptile
Technical field
The present invention relates to internet arena, particularly the URL re-scheduling system and method for a kind of distributed network reptile based on distributed structure/architecture.
Background technology
At present the distributed network reptile can be divided into three kinds of master slave mode, autonomous mode and mixed modes by the communication mode difference.Master slave mode refers to be managed by the main frame that a main frame is responsible for all operational network reptiles as the control node, reptile only need to receive task there from the control node, and just passable to the control node newly-generated job invocation, needn't communicate by letter with other reptiles in this process, this mode realizes simple and is beneficial to management.The control node then needs to communicate with all reptiles, and it needs an address list to come the information of all reptiles in the saved system.When the reptile quantity in the system changed, the coordinator needed the data in the scheduler tabulation, and this process is transparent for the reptile in the system.But along with the increase of reptile webpage quantity, the control node can become the bottleneck of whole system and cause whole distributed network crawler system hydraulic performance decline.And autonomous mode refers to there is not the coordinator in the system, and all reptiles all must intercom mutually, and is more more complex than reptile under the master slave mode.Make each web crawlers in this way can safeguard an address list, storing the position of all reptiles in the whole system in the table, can directly send to the reptile that need this data to data when communicating by letter at every turn.When the reptile quantity in the system changed, the address list of each reptile needed to upgrade.Mixed mode is a kind of compromise pattern in conjunction with the characteristics of top two kinds of patterns.All reptiles can intercom mutually and all have simultaneously the task distribution function in this pattern.A special reptile is but arranged in all reptiles, and this reptile major function is carried out centralized distribution to can't distributing after distributing through the reptile task of task.Use each web crawlers of this mode only need safeguard the address list of own acquisition range.And special reptile needs also to preserve the address list that need to carry out centralized distribution except the address list of preserving own acquisition range.
The URL re-scheduling is the core content of network crawler system, and at present the re-scheduling mode of URL mainly contains following four kinds in the web crawlers: the re-scheduling mode of based on database, based on the re-scheduling mode of internal memory Hash, based on the re-scheduling mode in disk path with based on the re-scheduling mode of Bloom filter.The re-scheduling of based on database is to realize that by every record in the ergodic data storehouse stability of this re-scheduling mode is very high, but re-scheduling efficient is very low, and when the record number of database surpassed 1,000,000, the performance of database can sharply descend.So the time efficiency of the re-scheduling mode of based on database becomes the bottleneck of whole distributed system.Re-scheduling mode based on internal memory Hash is to show to carry out re-scheduling by the Hash in the internal memory, but this mode is too large to the consumption of memory source.Re-scheduling mode based on the disk path is by link being carried out the MD5 coding and created corresponding path in disk, this re-scheduling mode based on the path has higher time efficiency, but is not suitable for a large-scale distributed crawler system of creeping towards the whole network.Last a kind of re-scheduling mode based on Bloom filter is a kind of re-scheduling scheme of relatively commonly using, and that benefit is is very quick, save the space, but can have certain false recognition rate, also is a kind of re-scheduling mode based on internal memory.
General distributed network reptile can be selected more stable database re-scheduling mode, but re-scheduling efficient is too low, is not suitable in the situation to big data quantity.Re-scheduling under the master slave mode concentrates on center control nodes carries out, and each gathers child node and only is responsible for being assigned to of task is processed, through having concentrated re-scheduling, very high to the requirement of center control nodes like this before allocating task.
Summary of the invention
For above-mentioned problems of the prior art, the object of the present invention is to provide the URL re-scheduling system and method for a kind of new distributed network reptile.
In order to realize the foregoing invention purpose, the technical solution used in the present invention is as follows:
The URL re-scheduling system of a kind of distributed network reptile is characterized in that, comprising: reptile gathers child node, central server, and database server, wherein,
Described reptile gathers child node and is deployed on the interior multiple servers of net, is used for the one-level re-scheduling;
Described central server is used for the secondary re-scheduling; Described reptile gathers child node and needs to register at described central server, obtains authentication; Described central server is safeguarded an address list, is used for recording current active reptile and gathers the child node address information;
Described data server is used for crawling the distribution of task and the storage of collection result; Described reptile gathers child node and obtain task from described database server waiting list, and will store in the database server after the collection result processing.
Further, in the URL of above-mentioned distributed network reptile re-scheduling system, described reptile gathers the child node storage inside and safeguards anticipatory remark ground URL informative abstract table.
Further, in the URL of above-mentioned distributed network reptile re-scheduling system, overall URL informative abstract table of storage system maintenance in the described central server.
The present invention provides the URL rearrangement of a kind of distributed network reptile of the URL re-scheduling system for above-mentioned any distributed network reptile on the other hand, it is characterized in that, may further comprise the steps:
Step 1: reptile gathers child node and registers at central server;
Step 2: reptile gathers child node to database server acquisition request new task, and namely reptile gathers child node and obtain a URL from database wait formation, and from then on URL obtains new URL information;
Step 3: reptile gathers child node the described URL information of newly obtaining is carried out the one-level re-scheduling in this locality, does not pass through such as the one-level re-scheduling, then abandons this URL information, again obtains task, enters step 2; Pass through such as the one-level re-scheduling, reptile gathers child node described URL information is added local URL informative abstract table and enters step 4;
Step 4: reptile gathers child node described URL information is sent to central server;
Step 5: central server carries out the secondary re-scheduling to described URL information; Do not pass through such as the secondary re-scheduling, then abandon this URL information; Central server notice reptile gathers child node and again obtains task, enters step 2; Pass through such as the secondary re-scheduling, enter step 6;
Step 6: central server adds overall URL abstract with this URL information and notifies reptile to gather child node;
Step 7: reptile collection child node joins the link of this URL information in the waiting list.
Further, in the URL of above-mentioned distributed network reptile rearrangement, described step 1 also comprises: set up and store an overall URL informative abstract table by central server; Gather child node by each reptile and set up and store local URL informative abstract table.
Further, in the URL of above-mentioned any distributed network reptile rearrangement, in the one-level re-scheduling process of described step 3, the URL information of putting down in writing in the local URL informative abstract table of storing in described URL information according to the new url parsing and the reptile collection child node is compared, represent then that if any identical URL information the one-level re-scheduling do not pass through, as do not find that identical URL information represents that then the one-level re-scheduling passes through.
Further, in the URL of above-mentioned any distributed network reptile rearrangement, in the secondary re-scheduling process of described step 5, the URL information of putting down in writing in the overall URL informative abstract table of storing in described URL information according to the new url parsing and the central server is compared, represent then that if any identical URL information the secondary re-scheduling do not pass through, as do not find that identical URL information represents that then the secondary re-scheduling passes through.
Further, in the URL of above-mentioned any distributed network reptile rearrangement, described central server is collected overall URL informative abstract table by secondary re-scheduling mechanism, and each reptile gathers child node by secondary re-scheduling mechanism and central server interactive learning, URL informative abstract table and the central server of its maintenance are reached unanimity, can reduce the traffic with central server.
Further, in the URL of above-mentioned any distributed network reptile rearrangement, before the described step 1, also comprise the steps:
Step 0: gather child node a plurality of reptiles of interior multiple servers deploy.
Use the URL re-scheduling system and method for distributed network reptile provided by the invention, can gather child node to each reptile with originally concentrating on the re-scheduling Task-decomposing that Centroid (central server) carries out by classification re-scheduling mechanism, the pressure of Centroid can be not excessive, Centroid is no longer born the task distribution function simultaneously, the communication process of Centroid and child node becomes very simple like this, need not design complicated communication protocol.Central server is expanded in system by the mode of monitoring each reptile collection child node state is very convenient, and it is very flexible, convenient that the design of system, deployment and operation become, and cost performance is higher.
Description of drawings
Fig. 1 is the URL re-scheduling system architecture schematic diagram of a kind of distributed network reptile;
Fig. 2 is the URL rearrangement FB(flow block) of a kind of distributed network reptile.
Embodiment
In order to make purpose of the present invention, technical scheme and advantage clearer, below in conjunction with embodiment and accompanying drawing, the present invention is further elaborated.Should be appreciated that specific embodiment described herein only in order to explain the present invention, is not intended to limit the present invention.
The URL re-scheduling system architecture schematic diagram of a kind of distributed network reptile that the embodiment of the invention provides is characterized in that as shown in Figure 1, comprising: reptile gathers child node (reptile child node), central server, and database server, wherein,
Described reptile gathers child node and is deployed on the interior multiple servers of net, is used for the one-level re-scheduling;
Described central server is used for the secondary re-scheduling; Described reptile gathers child node and needs to register at described central server, obtains authentication; Described central server is safeguarded an address list, is used for recording current active reptile and gathers the child node address information;
Described data server is used for crawling the distribution of task and the storage of collection result; Described reptile gathers child node and obtain task from described database server waiting list, and will store in the database server after the collection result processing.
In the present embodiment, a plurality of reptiles of multiple servers deploy in net gather child node, and each child node represents " identity " of oneself for oneself unique identification.The distributed network reptile URL re-scheduling system of present embodiment does not have center control nodes, and each reptile gathers child node and do not obtain the task of crawling but the own URL information of obtaining from center control Nodes from database server.Each reptile gathers between child node and need not communicate by letter like this, can independently operate, and each reptile gathers child node and only needs get final product at central server " registration " when joining this system, the adding of very convenient new node and withdrawing from, and autgmentability is fine.
Central server in the present embodiment is as the aggregation node of distributed re-scheduling, and each reptile gathers child node to be needed register at central server first in joining system the time.Central server is not born the function of distributed tasks, only is responsible for the processing of second level URL re-scheduling, and is lower to the requirement of central server like this.Central server only is equivalent to data at this moment and gathers the center, and its can gather child node with each reptile and communicate, and finishes whole data re-scheduling task.
Adopt data server to be used for crawling the distribution of task and the storage of collection result in the present embodiment, waiting list is placed on convenient the carrying out of controlling the process that crawls in the database.Here database server has played the effect of task distribution with transfer, and each reptile gathers child node and can realize alternately by it.
Further, described reptile gathers the child node storage inside and safeguards that anticipatory remark ground URL informative abstract table is used for first order re-scheduling.
Further, overall URL informative abstract table of storage system maintenance in the described central server is used for second level re-scheduling.
The mode of re-scheduling Scheme Choice of the present invention abstract in internal memory is carried out, and each gathers child node inside and safeguards that all such anticipatory remark ground URL abstract is used for first order re-scheduling.When getting access to a new url and just enter the first order re-scheduling stage from waiting list after filtering, acquisition node can use first local abstract that re-scheduling is carried out in this link.If first order re-scheduling is passed through, acquisition node can send to the summary of this link central server and carry out second level re-scheduling, also the URL summary is write local abstract simultaneously.If first order re-scheduling is not passed through, just need not carry out second level re-scheduling.An overall abstract is also safeguarded in central server inside, and it is to get after " being gathered " by each data that gather the local abstract of child node, and the information of the inside is more complete.When second level re-scheduling that and if only if was passed through, this link just was considered new url and adds operation queue and processes.
The present invention embodiment on the other hand provides a kind of URL re-scheduling system that utilizes above-mentioned any distributed network reptile to carry out the URL rearrangement of distributed network reptile, and the method FB(flow block) comprises the steps: as shown in Figure 2
Step 1: reptile gathers child node (collection child node) and registers at central server;
Step 2: reptile gathers child node to database server acquisition request new task, and namely reptile gathers child node and obtain a URL from database wait formation, and from then on URL obtains new URL information;
Step 3: reptile gathers child node the described URL information of newly obtaining is carried out the one-level re-scheduling in this locality, does not pass through such as the one-level re-scheduling, then abandons this URL information, again obtains task, enters step 2; Pass through such as the one-level re-scheduling, reptile gathers child node described URL information is added local URL informative abstract table and enters step 4;
Step 4: reptile gathers child node described URL information is sent (uploading) to central server;
Step 5: central server carries out the secondary re-scheduling to described URL information; Do not pass through such as the secondary re-scheduling, then abandon this URL information; Central server notice reptile gathers child node and again obtains task, enters step 2; Pass through such as the secondary re-scheduling, enter step 6;
Step 6: central server adds overall URL abstract with this URL information and notifies reptile to gather child node;
Step 7: reptile collection child node joins the link of this URL information in the waiting list.
In the said process, once the flow process of complete re-scheduling process comprises the two-stage re-scheduling, concrete steps are as follows: at first, gather child node and from database wait formation, obtain a new url, because the link in the database server waiting list has guaranteed it is unduplicated, so it is asked to process, resolve the URL that makes new advances, new URL is processed and then joins in the waiting list carrying out re-scheduling.Gather child node a new URL carried out the one-level re-scheduling in this locality first, namely in local re-scheduling URL abstract, whether therein judge this URL, if would represent that re-scheduling do not pass through, directly give up this URL, then obtain new URL and process; If local URL abstract should summary, represent that then re-scheduling passes through, gathering child node can join this URL in the local re-scheduling abstract.Then gather child node the summary of this URL is sent to central server, carry out second level re-scheduling by central server, if this summary has existed, represent that then this link is processed in central server, the notice reptile obtains new URL.If the secondary re-scheduling is passed through, central server can join this URL summary in the re-scheduling table of oneself, then gathers child node this link is joined in the waiting list, is through with once taking turns the re-scheduling process like this.
Further, in the said method, in step 1, set up and store an overall URL informative abstract table by central server; Gather child node by each reptile and set up and store local URL informative abstract table.
Further, in the said method, in the one-level re-scheduling process of described step 3, the URL information of putting down in writing in the local URL informative abstract table of storing in described URL information according to the new url parsing and the reptile collection child node is compared, represent then that if any identical URL information the one-level re-scheduling do not pass through, as do not find that identical URL information represents that then the one-level re-scheduling passes through.
Further, in the said method, in the secondary re-scheduling process of described step 5, the URL information of putting down in writing in the overall URL informative abstract table of storing in described URL information according to the new url parsing and the central server is compared, represent then that if any identical URL information the secondary re-scheduling do not pass through, as do not find that identical URL information represents that then the secondary re-scheduling passes through.
Further, in the URL of above-mentioned any distributed network reptile rearrangement, described central server is collected overall URL informative abstract table by secondary re-scheduling mechanism, and each reptile gathers child node by secondary re-scheduling mechanism and central server interactive learning, URL informative abstract table and the central server of its maintenance are reached unanimity, can reduce the traffic with central server.
When initial, the overall re-scheduling table that central server is safeguarded (overall URL informative abstract table) is the summation of each child node re-scheduling table (local URL summary info table), after a period of time, the re-scheduling table (local URL summary info table) that each reptile gathers child node can level off to unanimously, secondary re-scheduling (secondary re-scheduling) can be fewer and feweri like this, lower of ideal situation just can be obtained a result by one-level re-scheduling (one-level re-scheduling), and efficient can improve greatly.
Further, in the said method, before the described step 1, also comprise the steps:
Step 0: a plurality of reptiles of the multiple servers deploy in net gather child node.
The URL re-scheduling system and method for the distributed network reptile that use above-described embodiment provides, can gather child node to each reptile with originally concentrating on the re-scheduling Task-decomposing that Centroid (central server) carries out by classification re-scheduling mechanism, the pressure of Centroid can be not excessive, Centroid is no longer born the task distribution function simultaneously, the communication process of Centroid and child node becomes very simple like this, need not design complicated communication protocol.Central server is expanded in system by the mode of monitoring the child node state is very convenient, and it is very flexible, convenient that the design of system, deployment and operation become, and cost performance is higher.
The distributed network reptile framework that the present invention relates to and data re-scheduling scheme thereof go for the needs of multiple directed information acquisition tasks, and it is strong to dispose convenience, flexible configuration, autgmentability.
The above embodiment has only expressed embodiments of the present invention, and it describes comparatively concrete and detailed, but can not therefore be interpreted as the restriction to claim of the present invention.Should be pointed out that for the person of ordinary skill of the art without departing from the inventive concept of the premise, can also make some distortion and improvement, these all belong to protection scope of the present invention.Therefore, the protection range of patent of the present invention should be as the criterion with claims.

Claims (9)

1. the URL re-scheduling system of a distributed network reptile is characterized in that, comprising: reptile gathers child node, central server, and database server, wherein,
Described reptile gathers child node and is deployed on the interior multiple servers of net, is used for the one-level re-scheduling;
Described central server is used for the secondary re-scheduling; Described reptile gathers child node and needs to register at described central server, obtains authentication; Described central server is safeguarded an address list, is used for recording current active reptile and gathers the child node address information;
Described database server is used for crawling the distribution of task and the storage of collection result; Described reptile gathers child node and obtain task from described database server waiting list, and will store in the database server after the collection result processing.
2. the URL re-scheduling system of distributed network reptile according to claim 1 is characterized in that, described reptile gathers the child node storage inside and safeguards anticipatory remark ground URL informative abstract table.
3. the URL re-scheduling system of distributed network reptile according to claim 1 and 2 is characterized in that, overall URL informative abstract table of storage system maintenance in the described central server.
4. a URL rearrangement that is used for the distributed network reptile of each described system of claim 1-3 is characterized in that, may further comprise the steps:
Step 1: reptile gathers child node and registers at central server;
Step 2: reptile gathers child node to database server acquisition request new task, and namely reptile gathers child node and obtain a URL from database wait formation, and from then on URL obtains new URL information;
Step 3: reptile gathers child node the described URL information of newly obtaining is carried out the one-level re-scheduling in this locality, does not pass through such as the one-level re-scheduling, then abandons this URL information, again obtains task, enters step 2; Pass through such as the one-level re-scheduling, reptile gathers child node described URL information is added local URL informative abstract table and enters step 4;
Step 4: reptile gathers child node described URL information is sent to central server;
Step 5: central server carries out the secondary re-scheduling to described URL information; Do not pass through such as the secondary re-scheduling, then abandon this URL information; Central server notice reptile gathers child node and again obtains task, enters step 2; Pass through such as the secondary re-scheduling, enter step 6;
Step 6: central server adds overall URL abstract with this URL information and notifies reptile to gather child node;
Step 7: reptile collection child node joins the link of this URL information in the waiting list.
5. the URL rearrangement of distributed network reptile according to claim 4 is characterized in that, described step 1 also comprises: set up and store an overall URL informative abstract table by central server; Gather child node by each reptile and set up and store local URL informative abstract table.
6. the URL rearrangement of distributed network reptile according to claim 5, it is characterized in that, in the one-level re-scheduling process of described step 3, the URL information of putting down in writing in the local URL informative abstract table of storing in described URL information according to the new url parsing and the reptile collection child node is compared, represent then that if any identical URL information the one-level re-scheduling do not pass through, as do not find that identical URL information represents that then the one-level re-scheduling passes through.
7. the URL rearrangement of distributed network reptile according to claim 6, it is characterized in that, in the secondary re-scheduling process of described step 5, the URL information of putting down in writing in the overall URL informative abstract table of storing in described URL information according to the new url parsing and the central server is compared, represent then that if any identical URL information the secondary re-scheduling do not pass through, as do not find that identical URL information represents that then the secondary re-scheduling passes through.
8. the URL rearrangement of distributed network reptile according to claim 7, it is characterized in that, described central server is collected overall URL informative abstract table by secondary re-scheduling mechanism, and each reptile gathers child node by secondary re-scheduling mechanism and central server interactive learning, URL informative abstract table and the central server of its maintenance are reached unanimity, can reduce the traffic with central server.
9. the URL rearrangement of each described distributed network reptile is characterized in that according to claim 4-8, before the described step 1, also comprises the steps:
Step 0: a plurality of reptiles of the multiple servers deploy in net gather child node.
CN201210425213.XA 2012-10-30 2012-10-30 The URL re-scheduling system and method for a kind of distributed network reptile Active CN102932448B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201210425213.XA CN102932448B (en) 2012-10-30 2012-10-30 The URL re-scheduling system and method for a kind of distributed network reptile

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201210425213.XA CN102932448B (en) 2012-10-30 2012-10-30 The URL re-scheduling system and method for a kind of distributed network reptile

Publications (2)

Publication Number Publication Date
CN102932448A true CN102932448A (en) 2013-02-13
CN102932448B CN102932448B (en) 2016-04-27

Family

ID=47647145

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201210425213.XA Active CN102932448B (en) 2012-10-30 2012-10-30 The URL re-scheduling system and method for a kind of distributed network reptile

Country Status (1)

Country Link
CN (1) CN102932448B (en)

Cited By (10)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103475688A (en) * 2013-05-24 2013-12-25 北京网秦天下科技有限公司 Distributed method and distributed system for downloading website data
CN104063506A (en) * 2014-07-08 2014-09-24 百度在线网络技术(北京)有限公司 Method and device for identifying repeated web pages
CN104408182A (en) * 2014-12-15 2015-03-11 北京国双科技有限公司 Method and device for processing web crawler data on distributed system
CN106506673A (en) * 2016-11-25 2017-03-15 国信优易数据有限公司 A kind of large-scale distributed data management system and its method
CN106528567A (en) * 2015-09-11 2017-03-22 北京国双科技有限公司 Method and device for updating web crawler cluster information
CN107066526A (en) * 2017-02-23 2017-08-18 武汉智寻天下科技有限公司 A kind of network crawler system and method
CN107329969A (en) * 2017-05-23 2017-11-07 合肥智权信息科技有限公司 It is a kind of that system and method are updated based on the data message repeatedly verified
CN108415941A (en) * 2018-01-29 2018-08-17 湖北省楚天云有限公司 A kind of spiders method, apparatus and electronic equipment
CN109359231A (en) * 2017-12-29 2019-02-19 广州Tcl智能家居科技有限公司 A kind of information crawler method, server and the storage medium of distributed network crawler
CN111064713A (en) * 2019-02-15 2020-04-24 腾讯科技(深圳)有限公司 Node control method and related device in distributed system

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20080235163A1 (en) * 2007-03-22 2008-09-25 Srinivasan Balasubramanian System and method for online duplicate detection and elimination in a web crawler
CN102073683A (en) * 2010-12-22 2011-05-25 四川大学 Distributed real-time news information acquisition system
CN102271331A (en) * 2010-06-02 2011-12-07 中国移动通信集团广东有限公司 Method and system for detecting reliability of service provider (SP) site
CN102314463A (en) * 2010-07-07 2012-01-11 北京瑞信在线系统技术有限公司 Distributed crawler system and webpage data extraction method for the same

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20080235163A1 (en) * 2007-03-22 2008-09-25 Srinivasan Balasubramanian System and method for online duplicate detection and elimination in a web crawler
CN102271331A (en) * 2010-06-02 2011-12-07 中国移动通信集团广东有限公司 Method and system for detecting reliability of service provider (SP) site
CN102314463A (en) * 2010-07-07 2012-01-11 北京瑞信在线系统技术有限公司 Distributed crawler system and webpage data extraction method for the same
CN102073683A (en) * 2010-12-22 2011-05-25 四川大学 Distributed real-time news information acquisition system

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
吴小惠: "分布式网络爬虫URL去重策略的改进", 《平顶山学院学报》 *

Cited By (14)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103475688A (en) * 2013-05-24 2013-12-25 北京网秦天下科技有限公司 Distributed method and distributed system for downloading website data
CN104063506B (en) * 2014-07-08 2017-04-12 百度在线网络技术(北京)有限公司 Method and device for identifying repeated web pages
CN104063506A (en) * 2014-07-08 2014-09-24 百度在线网络技术(北京)有限公司 Method and device for identifying repeated web pages
CN104408182A (en) * 2014-12-15 2015-03-11 北京国双科技有限公司 Method and device for processing web crawler data on distributed system
CN106528567A (en) * 2015-09-11 2017-03-22 北京国双科技有限公司 Method and device for updating web crawler cluster information
CN106528567B (en) * 2015-09-11 2019-11-12 北京国双科技有限公司 The update method and device of web crawlers cluster information
CN106506673B (en) * 2016-11-25 2019-08-02 国信优易数据有限公司 A kind of large-scale distributed data management system and its method
CN106506673A (en) * 2016-11-25 2017-03-15 国信优易数据有限公司 A kind of large-scale distributed data management system and its method
CN107066526A (en) * 2017-02-23 2017-08-18 武汉智寻天下科技有限公司 A kind of network crawler system and method
CN107329969A (en) * 2017-05-23 2017-11-07 合肥智权信息科技有限公司 It is a kind of that system and method are updated based on the data message repeatedly verified
CN109359231A (en) * 2017-12-29 2019-02-19 广州Tcl智能家居科技有限公司 A kind of information crawler method, server and the storage medium of distributed network crawler
CN108415941A (en) * 2018-01-29 2018-08-17 湖北省楚天云有限公司 A kind of spiders method, apparatus and electronic equipment
CN111064713A (en) * 2019-02-15 2020-04-24 腾讯科技(深圳)有限公司 Node control method and related device in distributed system
CN111064713B (en) * 2019-02-15 2021-05-25 腾讯科技(深圳)有限公司 Node control method and related device in distributed system

Also Published As

Publication number Publication date
CN102932448B (en) 2016-04-27

Similar Documents

Publication Publication Date Title
CN102932448B (en) The URL re-scheduling system and method for a kind of distributed network reptile
CN104618693A (en) Cloud computing based online processing task management method and system for monitoring video
CN105635283A (en) Organization and management and using method and system for cloud manufacturing service
CN102902669B (en) Distributed information grasping means based on internet system
CN104966006A (en) Intelligent face identification system based on cloud variation platform
CN106304230B (en) Based on the wireless self-networking method and device routed immediately
CN101840432A (en) Data mining device based on Deep Web deep dynamic data and method thereof
CN103886508A (en) Mass farmland data monitoring method and system
CN107133273A (en) A kind of transit's routes data processing method and server cluster based on big data
CN102325061A (en) Method for monitoring network, equipment and system
CN103744365B (en) Bridging module for communication between room control terminal and upper computer and method thereof
CN104301131A (en) Fault management method and device
CN109074287A (en) Infrastructure resources state
CN104866528B (en) Multi-platform collecting method and system
CN103699556A (en) Digital local chronicle information system for compiling local chronicle and geographical information
CN102802166A (en) Improved Zigbee network lamination method
CN101692737B (en) Light weight data synchronization system and method
CN104331322B (en) A kind of process migration method and apparatus
Zhao et al. An integrated processing platform for traffic sensor data and its applications in intelligent transportation systems
CN102148757B (en) A kind of multiple nucleus system message distributing method and device
CN104202181B (en) Performance of network equipments managing device and method based on order line
CN203951500U (en) A kind of wireless sensor network based on P2P technology
CN103634411B (en) A kind of marketing data real time broadcasting system and method with state consistency
CN106878029A (en) A kind of network data auditing system and method
CN105468684A (en) Sensitive word filtering system and communication method thereof

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
C14 Grant of patent or utility model
GR01 Patent grant