CN102932448B

CN102932448B - The URL re-scheduling system and method for a kind of distributed network reptile

Info

Publication number: CN102932448B
Application number: CN201210425213.XA
Authority: CN
Inventors: 刘述; 徐贵宝; 江文学; 何宝宏; 高强; 赵劲
Original assignee: Research Institute of Telecommunications Transmission Ministry of Industry and Information Technology
Current assignee: China Academy of Information and Communications Technology CAICT
Priority date: 2012-10-30
Filing date: 2012-10-30
Publication date: 2016-04-27
Anticipated expiration: 2032-10-30
Also published as: CN102932448A

Abstract

A URL re-scheduling system and method for distributed network reptile, described system comprises reptile and gathers child node, central server, database server.Described method comprises: reptile gathers child node and registers on central server; Reptile gathers child node and obtain URL from database waiting list, and from then on URL obtains new URL information; Reptile gathers child node and carries out one-level re-scheduling to the new URL obtained, and as one-level re-scheduling is not passed through, then abandons this URL; As one-level re-scheduling is passed through, the URL newly obtained is added local URL abstract and sends to central server; Central server carries out secondary re-scheduling to the new URL obtained, and as secondary re-scheduling is passed through, URL is added overall URL abstract; Reptile gathers child node and the link of this URL is joined in waiting list.System and method provided by the invention can be decomposed originally concentrate on re-scheduling task that Centroid carries out each reptile by one-level re-scheduling and gathered child node by classification re-scheduling mechanism, central server safeguards an overall re-scheduling form by the mode of secondary re-scheduling, very conveniently to expand in system, the design of system, dispose and become very flexibly with operation, facilitate.

Description

The URL re-scheduling system and method for a kind of distributed network reptile

Technical field

The present invention relates to internet arena, particularly the URL re-scheduling system and method for a kind of distributed network reptile based on distributed structure/architecture.

Background technology

Current distributed network reptile can be divided into master slave mode, autonomous mode and mixed mode three kinds by communication mode difference.Master slave mode refers to that the main frame being responsible for all operational network reptiles as Controlling vertex by a main frame manages, reptile only needs from Controlling vertex there reception task, and newly-generated job invocation is just passable to Controlling vertex, need not communicate with other reptiles in this process, this mode realizes simple and is beneficial to management.Controlling vertex then needs to communicate with all reptiles, and it needs an address list to carry out the information of all reptiles in saved system.When the reptile quantity in system changes, coordinator needs the data in scheduler list, and this process is transparent for the reptile in system.But along with the increase of reptile webpage quantity, Controlling vertex can become the bottleneck of whole system and cause whole distributed network crawler system hydraulic performance decline.And autonomous mode refers in system do not have coordinator, all reptiles all must intercom mutually, more more complex than reptile under master slave mode.Make each web crawlers in this way can safeguard an address list, in table, store the position of all reptiles in whole system, during each communication, directly data can be sent to the reptile needing these data.When the reptile quantity in system changes, the address list of each reptile needs to upgrade.Mixed mode is the one compromise pattern of the feature combining two kinds of patterns above.Reptiles all in this pattern can intercom mutually and all have task matching function simultaneously.But have a special reptile in all reptiles, this reptile major function carries out centralized distribution to cannot having distributed after reptile task matching of task.Use each web crawlers of this mode only need safeguard the address list of oneself acquisition range.And special reptile need except preserving the address list of oneself acquisition range also preserve need the address list carrying out centralized distribution.

URL re-scheduling is the core content of network crawler system, and in current web crawlers, the re-scheduling mode of URL mainly contains following four kinds: the re-scheduling mode based on database, the re-scheduling mode based on internal memory Hash, the re-scheduling mode based on disk path and the re-scheduling mode based on Bloom filter.Re-scheduling based on database is realized by the every bar record in ergodic data storehouse, and the stability of this re-scheduling mode is very high, but re-scheduling efficiency is very low, and when the record number of database is more than 1,000,000, the performance of database can sharply decline.So the time efficiency based on the re-scheduling mode of database becomes the bottleneck of whole distributed system.Re-scheduling mode based on internal memory Hash carries out re-scheduling by the Hash table in internal memory, but this mode is too large to the consumption of memory source.Re-scheduling mode based on disk path is by carrying out MD5 coding to link and create corresponding path in disk, this re-scheduling mode based on path has higher time efficiency, but is not suitable for a large-scale distributed crawler system of creeping towards the whole network.Re-scheduling mode finally based on Bloom filter is a kind of re-scheduling scheme relatively commonly used, and benefit is very fast, saves space, but can there is certain false recognition rate, is also a kind of re-scheduling mode based on internal memory.

General distributed network reptile can select more stable database re-scheduling mode, but re-scheduling efficiency is too low, is not suitable for when big data quantity.Re-scheduling under master slave mode concentrates on center control nodes and carries out, and each child node that gathers only is responsible for processing being assigned to of task, through concentrating re-scheduling before allocating task, very high to the requirement of center control nodes like this.

Summary of the invention

For above-mentioned problems of the prior art, the object of the present invention is to provide the URL re-scheduling system and method for a kind of new distributed network reptile.

In order to realize foregoing invention object, the technical solution used in the present invention is as follows:

A URL re-scheduling system for distributed network reptile, is characterized in that, comprising: reptile gathers child node, central server, database server, wherein,

Described reptile gathers child node and is deployed on the multiple servers in net, for one-level re-scheduling;

Described central server is used for secondary re-scheduling; Described reptile gathers child node and need register on described central server, obtains authentication; An address list safeguarded by described central server, gathers child node address information for the reptile of recording current active;

Described data server is for the storage of the distribution and collection result that crawl task; Described reptile gathers child node and obtain task from described database server waiting list, and is stored into after collection result process in database server.

Further, in the URL re-scheduling system of above-mentioned distributed network reptile, described reptile gathers child node storage inside and safeguards an anticipatory remark ground URL information abstract.

Further, in the URL re-scheduling system of above-mentioned distributed network reptile, the overall URL information abstract of storage system maintenance one in described central server.

The present invention provides the URL rearrangement of a kind of distributed network reptile of the URL re-scheduling system for any one distributed network reptile above-mentioned on the other hand, it is characterized in that, comprises the following steps:

Step 1: reptile gathers child node and registers on central server;

Step 2: reptile gathers child node to database server acquisition request new task, namely reptile gathers child node and obtain a URL from database waiting list, and from then on URL obtains new URL information;

Step 3: reptile gathers the URL information of child node to described new acquisition and carries out one-level re-scheduling in this locality, as one-level re-scheduling is not passed through, then abandons this URL information, again obtains task, enter step 2; As one-level re-scheduling is passed through, reptile gathers child node and described URL information is added local URL information abstract and enters step 4;

Step 4: reptile gathers child node and described URL information is sent to central server;

Step 5: central server carries out secondary re-scheduling to described URL information; As secondary re-scheduling is not passed through, then abandon this URL information; Central server notice reptile gathers child node and again obtains task, enters step 2; As secondary re-scheduling is passed through, enter step 6;

Step 6: this URL information is added overall URL abstract and notifies that reptile gathers child node by central server;

Step 7: reptile gathers child node and the link of this URL information joined in waiting list.

Further, in the URL rearrangement of above-mentioned distributed network reptile, described step 1 also comprises: set up by central server and store an overall URL information abstract; Gather child node by each reptile set up and store local URL information abstract.

Further, in the URL rearrangement of any one distributed network reptile above-mentioned, in the one-level re-scheduling process of described step 3, the described URL information according to new url parsing and reptile are gathered the URL information recorded in the local URL information abstract stored in child node compare, then represent that one-level re-scheduling is not passed through if any identical URL information, as do not found, identical URL information then represents that one-level re-scheduling is passed through.

Further, in the URL rearrangement of any one distributed network reptile above-mentioned, in the secondary re-scheduling process of described step 5, the URL information recorded in the overall URL information abstract stored in the described URL information according to new url parsing and central server is compared, then represent that secondary re-scheduling is not passed through if any identical URL information, as do not found, identical URL information then represents that secondary re-scheduling is passed through.

Further, in the URL rearrangement of any one distributed network reptile above-mentioned, described central server collects overall URL information abstract by secondary re-scheduling mechanism, and each reptile gathers child node by secondary re-scheduling mechanism and central server interactive learning, the URL information abstract making it safeguard and central server reach unanimity, and can reduce the traffic with central server.

Further, in the URL rearrangement of any one distributed network reptile above-mentioned, before described step 1, also comprise the steps:

Step 0: gather child node the multiple reptile of interior multiple servers deploy.

Use the URL re-scheduling system and method for distributed network reptile provided by the invention, child node can be gathered originally concentrating on the re-scheduling Task-decomposing that Centroid (central server) carries out to each reptile by classification re-scheduling mechanism, the pressure of Centroid can not be excessive, Centroid no longer bears task matching function simultaneously, the communication process of such Centroid and child node becomes very simple, need not design complicated communication protocol.By monitoring each reptile, central server gathers that the mode of child status is very convenient to be expanded in system, the design of system, dispose become very flexibly with operation, conveniently, cost performance is higher.

Accompanying drawing explanation

Fig. 1 is the URL re-scheduling system architecture schematic diagram of a kind of distributed network reptile;

Fig. 2 is the URL rearrangement FB(flow block) of a kind of distributed network reptile.

Embodiment

In order to make object of the present invention, technical scheme and advantage clearly understand, below in conjunction with embodiment and accompanying drawing, the present invention is further elaborated.Should be appreciated that specific embodiment described herein only in order to explain the present invention, be not intended to limit the present invention.

The URL re-scheduling system architecture schematic diagram of a kind of distributed network reptile that the embodiment of the present invention provides as shown in Figure 1, is characterized in that, comprising: reptile gathers child node (reptile child node), central server, database server, wherein,

In the present embodiment, the multiple reptile of multiple servers deploy in net gathers child node, and the unique identification that each child node is used for oneself represents oneself " identity ".The distributed network reptile URL re-scheduling system of the present embodiment does not have center control nodes, and each reptile gathers child node and do not obtain from center control nodes and crawl task but oneself obtains URL information from database server.Each reptile gathers between child node and need not communicate like this, can independent work, and each reptile gathers child node only need in central server " registration " when joining this system, the adding and exit of very convenient new node, and autgmentability is fine.

Central server in the present embodiment is used as the aggregation node of distributed re-scheduling, and each reptile gathers child node to be needed when joining in system first to register on central server.The function of distributed tasks do not born by central server, is only responsible for the process of second level URL re-scheduling, lower to the requirement of central server like this.Central server is now only equivalent to a data accumulation hub, and it can gather child node with each reptile and communicate, and completes whole data re-scheduling task.

Adopt data server to be used for crawling the distribution of task and the storage of collection result in the present embodiment, waiting list puts convenient the carrying out controlling the process that crawls in a database.Here database server serves the effect of task matching and transfer, and each reptile is gathered child node and can be realized alternately by it.

Further, described reptile gathers child node storage inside and safeguards that an anticipatory remark ground URL information abstract is for first order re-scheduling.

Further, the overall URL information abstract of storage system maintenance one in described central server, for second level re-scheduling.

The mode of re-scheduling Scheme Choice of the present invention abstract in internal memory is carried out, and each child node inside that gathers all safeguards that such anticipatory remark ground URL abstract is for first order re-scheduling.When getting a new url and just enter the first order re-scheduling stage from waiting list after filtering, acquisition node can first use local abstract to carry out re-scheduling to this link.If first order re-scheduling is passed through, the summary of this link can be sent to central server and carry out second level re-scheduling by acquisition node, simultaneously also the local abstract of URL summary write.If first order re-scheduling is not passed through, just second level re-scheduling need not be carried out.An overall abstract is also safeguarded in central server inside, and it gets after " being gathered " by the data of the local abstract of each collection child node, and the information of the inside is more complete.When second level re-scheduling that and if only if is passed through, this link is just considered new url and adds operation queue and processes.

The present invention's embodiment on the other hand provides a kind of URL rearrangement utilizing the URL re-scheduling system of any one distributed network reptile above-mentioned to carry out distributed network reptile, and the method FB(flow block) as shown in Figure 2, comprises the steps:

Step 1: reptile gathers child node (collection child node) and registers on central server;

Step 4: reptile gathers child node and described URL information is sent (uploading) to central server;

In said process, the flow process of once complete re-scheduling process comprises two-stage re-scheduling, concrete steps are as follows: first, gather child node and obtain a new url from database waiting list, ensure it is unduplicated due to the link in database server waiting list, so carry out request process to it, resolve the URL that makes new advances, then new URL is joined in waiting list carrying out re-scheduling process.Gather child node and first in this locality, one-level re-scheduling is carried out to a new URL, in local re-scheduling URL abstract, namely judge whether wherein this URL, if, represent that re-scheduling is not passed through, directly give up this URL, then obtain new URL process; If local URL abstract not this summary, then represent that re-scheduling is passed through, gather child node and this URL can be joined in local re-scheduling abstract.Then gather child node and the summary of this URL is sent to central server, carry out second level re-scheduling by central server, if this summary exists in central server, then represent that this link is processed, notice reptile obtains new URL.If secondary re-scheduling is passed through, this URL summary can join in the re-scheduling table of oneself by central server, then gathering child node and joins in waiting list by this link, to finish once taking turns re-scheduling process like this.

Further, in said method, in step 1, set up by central server and store an overall URL information abstract; Gather child node by each reptile set up and store local URL information abstract.

Further, in said method, in the one-level re-scheduling process of described step 3, the described URL information according to new url parsing and reptile are gathered the URL information recorded in the local URL information abstract stored in child node compare, then represent that one-level re-scheduling is not passed through if any identical URL information, as do not found, identical URL information then represents that one-level re-scheduling is passed through.

Further, in said method, in the secondary re-scheduling process of described step 5, the URL information recorded in the overall URL information abstract stored in the described URL information according to new url parsing and central server is compared, then represent that secondary re-scheduling is not passed through if any identical URL information, as do not found, identical URL information then represents that secondary re-scheduling is passed through.

Time initial, the overall re-scheduling table (overall URL information abstract) that central server is safeguarded is the summation of each child node re-scheduling table (local URL summary info table), after a period of time, the re-scheduling table (local URL summary info table) that each reptile gathers child node can level off to unanimously, such secondary re-scheduling (secondary re-scheduling) can be fewer and feweri, ideally only just can be obtained a result by one-level re-scheduling (one-level re-scheduling), efficiency can improve greatly.

Further, in said method, before described step 1, also comprise the steps:

Step 0: the multiple reptile of the multiple servers deploy in net gathers child node.

The URL re-scheduling system and method for the distributed network reptile using above-described embodiment to provide, child node can be gathered originally concentrating on the re-scheduling Task-decomposing that Centroid (central server) carries out to each reptile by classification re-scheduling mechanism, the pressure of Centroid can not be excessive, Centroid no longer bears task matching function simultaneously, the communication process of such Centroid and child node becomes very simple, need not design complicated communication protocol.By monitoring, the mode of child status is very convenient to be expanded in system central server, the design of system, disposes and becomes very flexibly with operation, facilitates, and cost performance is higher.

The distributed network reptile framework that the present invention relates to and data re-scheduling scheme thereof go for the needs of multiple directed information acquisition tasks, and deployment convenience, flexible configuration, autgmentability are strong.

The above embodiment only have expressed embodiments of the present invention, and it describes comparatively concrete and detailed, but therefore can not be interpreted as the restriction to the scope of the claims of the present invention.It should be pointed out that for the person of ordinary skill of the art, without departing from the inventive concept of the premise, can also make some distortion and improvement, these all belong to protection scope of the present invention.Therefore, the protection range of patent of the present invention should be as the criterion with claims.

Claims

1. a URL re-scheduling system for distributed network reptile, is characterized in that, comprising: reptile gathers child node, central server, database server, wherein,

Described reptile gathers child node and is deployed on the multiple servers in net, one-level re-scheduling is carried out for using local URL information abstract, when one-level re-scheduling is passed through, corresponding URL information is added local URL information abstract and this URL information is sent to central server;

Described central server carries out secondary re-scheduling for using overall URL information abstract to the URL information received, and when secondary re-scheduling is passed through, described URL information is added overall URL information abstract; Described reptile gathers child node and need register on described central server, obtains authentication; An address list safeguarded by described central server, gathers child node address information for the reptile of recording current active;

Described database server is for the storage of the distribution and collection result that crawl task; Described reptile gathers child node and obtain task from described database server waiting list, and is stored in database server by after collection result process;

Described reptile gathers child node storage inside and safeguards an anticipatory remark ground URL information abstract;

The overall URL information abstract of storage system maintenance one in described central server;

Reptile described in each gathers child node and central server interactive learning, and the described local URL information abstract making it safeguard and described overall URL information abstract reach unanimity.

2., for a URL rearrangement for the distributed network reptile of system described in claim 1, it is characterized in that, comprise the following steps:

Step 1: reptile gathers child node and registers on central server;

Step 2: reptile gathers child node to database server acquisition request new task, namely reptile gathers child node and obtain a URL from database server waiting list, and from then on URL obtains new URL information;

Step 6: this URL information is added overall URL information abstract and notifies that reptile gathers child node by central server;

Step 7: reptile gathers child node and joins in waiting list by the link of this URL information;

Described step 1 also comprises: set up by central server and store an overall URL information abstract; Gather child node by each reptile set up and store local URL information abstract;

In the one-level re-scheduling process of described step 3, the URL information of described new acquisition and reptile are gathered the URL information recorded in the local URL information abstract stored in child node to compare, then represent that one-level re-scheduling is not passed through if any identical URL information, as do not found, identical URL information then represents that one-level re-scheduling is passed through;

In the secondary re-scheduling process of described step 5, the URL information recorded in the overall URL information abstract stored the described URL information from the reception of reptile collection child node and central server is compared, then represent that secondary re-scheduling is not passed through if any identical URL information, as do not found, identical URL information then represents that secondary re-scheduling is passed through;

Described central server collects overall URL information abstract by secondary re-scheduling mechanism, and each reptile gathers child node by secondary re-scheduling mechanism and central server interactive learning, the URL information abstract making it safeguard and central server reach unanimity, and can reduce the traffic with central server.

3. the URL rearrangement of distributed network reptile according to claim 2, is characterized in that, before described step 1, also comprises the steps: