CN103310012B

CN103310012B - A kind of distributed network crawler system

Info

Publication number: CN103310012B
Application number: CN201310274951.3A
Authority: CN
Inventors: 王宝会; 于雷; 王丽华; 王新河; 尹科
Original assignee: Beihang University
Current assignee: Huike Education Technology Group Co ltd
Priority date: 2013-07-02
Filing date: 2013-07-02
Publication date: 2016-09-28
Anticipated expiration: 2033-07-02
Also published as: CN103310012A

Abstract

A kind of distributed network crawler system, it is adaptable to network information gathering field, including: management door, Centroid server, distributed child node server；Management door is the web interface that manager is provided by crawler system, Centroid server and the daily record of distributed child node server can be checked, interpolation theme is set, update the URL seed of certain theme, the crawl frequency parameter of configuration theme, controls the state of reptile；Centroid server and distributed child node server reptile are the main bodys of system, complete theme operation, the study of data pick-up device, page analysis and the storage of target pages.Present invention achieves a reptile and accommodate the crawl of different themes, the speed and the quality that improve crawl webpage can not meet user's requirement.

Description

A kind of distributed network crawler system

Technical field

The present invention relates to a kind of distributed network crawler system, belong to network information gathering field.

Background technology

The fast development of network brings the explosive increase of web message amount, retrieves as internet information The traditional common search engine effect of instrument becomes more and more important, but due to itself exist the network coverage low, The limitation such as loss is high, therefore can not provide the user the most comprehensively information.In order to overcome universal search The above deficiency of engine, topic search engine arises at the historic moment, and its target is with limited bandwidth and hardware resource Consumption, the most accurate in providing the user its care field Result.

Theme Crawler of Content is the basis of topic search engine, and its speed capturing webpage and quality are to determine that search is drawn Hold up the important indicator of quality.It is a system automatically downloading webpage in restriction field, according to the most excellent First level order and degree of subject relativity are screened and are obtained the page.Different from general reptile, Theme Crawler of Content does not pursue height Coverage rate, but optionally take theme related pages, have that resource occupation is low, index data base updates Convenient, the caching accurate advantage of the page.

But currently available technology all cannot realize judging the page and the dependency of theme and at a crawler system The theme crawl etc. that middle receiving is different, therefore causes the speed capturing webpage and quality can not meet user's requirement.

Summary of the invention

The technology of the present invention solves problem: overcome the deficiencies in the prior art, it is provided that a kind of distributed network reptile system System a, it is achieved that reptile accommodates the crawl of different themes, the speed and the quality that improve crawl webpage can not Meet user's requirement.

The technology of the present invention solution: a kind of distributed network crawler system, including: management door, center Node server, distributed child node server；Management door is the Web that manager is provided by crawler system Interface, it is possible to check Centroid server and the daily record of distributed child node server, interpolation theme is set, Update the URL seed of certain theme, the crawl frequency parameter of configuration theme, control the state of reptile；In Heart node server and distributed child node server reptile are the main bodys of system, complete theme operation, data The storage of the study of withdrawal device, page analysis and target pages；

(1) Centroid server, including URL controller, decimator module and theme control module；

Theme control module, receives management door from management interface and sends the data of coming, including the description of theme Data, interpolation and deletion action data, control theme capture the data of frequency, complete the operation about theme, Including to the description of theme, add and delete；Control theme and capture frequency；Edit each theme seed queue, And theme seed queue is sent to decimator module and URL controller module；

Decimator module, after receiving theme seed queue, first passes through fundamental analysis device and comes seed queue The webpage that URL address represents is classified, and is divided into Deep Web page and data-intensive (Data-intensive) two kinds of pages are carried out extraction and analyze, find each class after analysis by webpage the most respectively The data pick-up device that type is corresponding, then URL address and corresponding data pick-up device traveling corresponding record, and handle Record is sent to URL controller；

URL controller, receives seed queue and the URL of decimator module transmission that theme control module sends The two data are integrated, URL address and corresponding data by address and corresponding withdrawal device record Withdrawal device is corresponding, does not has the URL address of corresponding withdrawal device just to correspond to general withdrawal device, by all of URL address is queued up, and by task split-run, task is sent to the distributed child node of each reptile；

(2) distributed child node server includes child node URL controller, data pick-up device, search control Device processed, webpage capture device；

Child node URL controller, receives seed URL and corresponding number that Centroid server sends over According to withdrawal device information；First carry out URL address duplicate checking after receiving URL, then will there is no repeated acquisition URL address discharges into queue, and by queue URL address and corresponding data pick-up device information be sent to Data pick-up device and webpage capture device；

Data pick-up device, carries out page analysis also the Deep Web page from the queue of child node URL Extract in the page URL form new URL, form new URL, be equivalent to the object after list is submitted to, Pass to webpage capture device；After receiving the page that search controller sends, use page URL address corresponding Whose withdrawal device carries out the extraction of URL address in content extraction and the page, then URL is sent in URL address Address bases etc. are to be collected；

Webpage capture device, connects the URL address sended over from URL controller and data withdrawal device Receiving, then carry out the crawl of webpage, the webpage of crawl is supplied to search controller；

Search controller, is analyzed receiving the page collected, and satisfactory Page-saving enters page Storehouse, face, otherwise passes to data pick-up device the page.

Described task split-run uses weighted least-connection scheduling method, implements process as follows:

(1) calculate the difference of PR minima in PR and the URL queue of seed, with PR maximum and The ratio cc of PR minimum value difference_i:；

α_{i} = \frac{{PR}_{i} - l o w (P R)}{t o p (P R) - l o w (P R)}

PR is webpage rank, i=0,1 ..., n-1, Low (PR) and Top (PR) are respectively PR maximum And minima；

(2) calculating the weight of search depth, the target pages degree of depth of the context guide is 3, the power of search depth Ghost image rings factor-beta_iFor itself degree of depth L_iInverse:

β_i=1/L_i

(3) calculating crawl frequency, by Sigmoid function, Sigmoid smoothing of functions is uniformly strict single Adjusting, threshold range is (0.5～1), concrete crawl frequency x_iIt is calculated as follows:

x_{i} = \frac{F_{i} - l o w (F)}{t o p (F) - l o w (F)}

Wherein, Fi is the crawl frequency of seed；Low and Top obtain respectively queue medium frequency maximum and Minima.

Capture frequency influence factor gamma_iIt is calculated as:

γ_{i} = \frac{1}{1 + e^{- {ax}_{i}}}

A value is more than 1, is the weighter factor after linear smoothing result, and target is to expand head step result of calculation.

(4) judging according to Sigmoid function curve, a takes 2.5 in a system of the invention.The most permissible Drawing, the priority weighting of seed is the arithmetic average of 3 factors of influence:

Q_{i} = \frac{α_{i} + β_{i} + γ_{i}}{3}

(5) then sort descending is carried out according to Qi value.URL queue in child node inherits center The URL Weight algorithm of node, the website that the reptile guided based on withdrawal device in system only can limit at seed Inside crawl, Q-value captures frequency and 2 factors of website importance are constant, only can become with the search depth factor Change, be calculated as follows:

Q = Q_{p r e v} - \frac{β_{p r e v} - β}{3}

Wherein, Qprev is the weights transmitted from father URL；β prev is that the search of father URL is deep The degree factor；β is the search depth factor of object URL.Child node queue URL number is many, uses two Point-score sequence space exchanges the raising of efficiency for.Through theory analysis and actual test, URL weights are at 0-1 Between be distributed in uniform smooth, it is to avoid the single factors caused that acutely decays of 1 factor plays a decisive role Situation, taken into account simultaneously destination object, capture strategy and 3 principal elements of search depth, fine terrain Show priority difference.Even if a kind of special circumstances that this algorithm realizes are the requests transmitted in the face of searcher, Now priority is the highest, and Q-value is set to 1, and transmittance process is unattenuated.

(6) each child node represents its process performance with corresponding weights.Default weights are set to 1, system pipes Reason person can dynamically arrange the weights of server.Weighted least-connection scheduling scheduling is new connect time as far as possible Built connection number and its weights of making server are proportional.The algorithm flow of weighted least-connection scheduling is as follows: Assume there is one group of serverRepresent the weights of server S i, C (Si) table That shows server S i currently connects number.It is CSUM=Σ C (Si) that Servers-all currently connects the summation of number (i=0,1 ..., n-1).

Current new connection request can be sent server S m, and server S m that and if only if meets following bar Part retransmits seed:

\begin{matrix} \frac{C (S_{m}) / C S U M}{W (S_{m})} = \min {\frac{C (S_{i}) / C S U M}{W (S_{i})}} \\ &DoubleRightArrow; \frac{C (S_{m})}{W (S_{m})} = \min {\frac{C (S_{i})}{W (S_{i})}} \end{matrix}

Wherein, W (Si) is not 0.The daily record of child node feeds back to Centroid, sub-services at regular intervals Device linking number C (Si) obtains by reading daily record.This method compares child node and connects number and the ratio of priori weights Value, obtains the child node of minimum load, distributes and new crawls task.

Present invention advantage compared with prior art is:

(1) invention devise a kind of for field in the search engine of multiple themes, including a series of themes The subsystem of search (such as air ticket, hotel), they share 1 reptile, it is achieved that a reptile accommodates difference The crawl of theme, the task segmentation in a kind of new task distribution being directed to distributed reptile of the present invention is calculated Method, the existing document currently relating to this framework is all that summary describes, and does not the most solve multi-threaded and deposits In the case of the URL distribution that is likely to occur and the problem such as algorithm is compatible, the present invention solves this problem.

(2) framework of the present invention uses multi-threaded strategy based on classification annotation, solves in same crawler system The problem that multi-threaded self adaptation is compatible, by two grades of weighting task partitioning algorithms, solve based on goal orientation, The URL assignment problem of load balancing, enhances the system expandability.

(3) improved method of the URL storage strategy that the present invention proposes, can support that URL looks into efficiently Ask, insert and repeatability detection.The subject search system of native system exploitation is supplied to what user's topicalization was enriched Input interface, and return accurate structured content, its reptile have employed framework based on data pick-up device. Relating to the existing document of this framework is all that summary describes, and does not the most solve multi-threaded and can in the case of depositing The problems such as the URL distribution that can occur and algorithm compatibility.

Accompanying drawing explanation

Fig. 1 is the overall architecture schematic diagram of distributed reptile system of the present invention；

Fig. 2 is the Centroid server rack composition in the present invention；

Fig. 3 is the Organization Chart of distributed node server in the present invention.

Detailed description of the invention

Present system uses distributed system architecture based on data pick-up device, by a center main controlled node Forming with distributed crawler server, whole system cooperates collaborative work, and its overall architecture is shown in Fig. 1.

As it is shown in figure 1, the present invention is mainly made up of following module:

1, management door

Management door is the web interface that manager is provided by crawler system, can check center and sub-services The daily record of device, arranges interpolation theme, updates the URL seed of certain theme, the crawl frequency of configuration theme Deng parameter, control the state etc. of reptile.Centroid and distribution reptile are the main bodys of system, complete theme behaviour Work, the study of data pick-up device, page analysis and the storage of target pages.

2, Centroid server

Reptile center main controlled node is control axis, mainly includes URL controller, decimator module and master Topic control module, as shown in Figure 2.The concrete function of three modules sees below introduction:

(1) theme control module

Theme control module completes the operation about theme, including to the description of theme, add and delete；Control Theme processed captures frequency；Edit each theme seed queue.Seed team's column selection takes the authority page of corresponding theme, I.e. in this theme, comparison is the most representational can be as the page of a series of target information initial positions, such as hotel The Theme Crawler of Content of search, its authority page is exactly to book rooms to comprise the inquiry webpage of Form or its hotel letter in net The start page of breath list.First use universal search engine searching motif descriptive text, obtain the expansion of corresponding theme Exhibition page set, because limited amount, so obtained the seed queue of authority page again by artificial examination.

(2) decimator module

Using web page analysis algorithm based on content, start with from URL seed, training is formed for kind of a filial generation The data pick-up device of the authoritative website of table.The seed meeting a upper module demand is broadly divided into 2 classes: Deep Web page and data-intensive (Data-intensive) webpage, the basic classification device using memory character is permissible Distinguish 2 kinds of pages, use specific neck improved to tour field dictionary for Deep Web page The inquiry detection method of territory Case-based Reasoning matches the most complete interface input.Structuring for the latter is special Levying, the strategy using page block and catalogue to find carries out the URL extraction of underlying pages.Through above procedure, The data pick-up device (parser path and search depth) that URL seed can be found to be suitable for, grabs in child node During taking, this model instructs the page parsing of the targeted sites that seed represents.

(3) URL controller

The sequence of the main URL queue being responsible in Centroid, and carry out according to each child node load feedback Task is split.Because using two grades of URL configuration strategies, so Centroid server only stores seed URL, sort algorithm captures website weight representated by frequency and seed according to theme and determines priority, and calculates The concurrency that unit interval needs.Task segmentation uses weighted least-connection scheduling method.

The process that realizes of Centroid server is:

(1) theme control module receives management door from management interface and sends data (the description number of theme of coming According to, add and deletion action data；Control theme and capture the data of frequency；) this module completes about theme Operation, including to the description of theme, add and delete；Control theme and capture frequency；Edit each theme kind Subqueue.And theme seed queue is sent to decimator module and URL controller module.(note: seed Queue is exactly URL address queue, and URL address queue is exactly one group of URL address, but seed queue There is its particularity URL address, because the URL address of seed queue the most all needs target to gather the head of website Page URL address, or need the homepage URL address of the programme orientation etc. gathered.)

(2), after decimator module receives theme seed queue, first pass through fundamental analysis device and come seed queue The webpage that represents of URL address carry out classifying that (first the process of classification is to need to access URL address to point to Webpage, web page contents is acquired then carrying out subsequent classification operation), be divided into Deep Web page With data-intensive (Data-intensive) webpage, the most respectively two kinds of pages are carried out extraction and analyzes (in detail Introduce the function looked in article above for decimator module to be discussed in detail.), find each type after analysis Corresponding data pick-up device.And URL address and corresponding data pick-up device traveling corresponding record, and this Individual record is sent to URL controller.

(3) URL controller, receives seed queue and decimator module transmission that theme control module sends URL address and corresponding withdrawal device record.And the two data are integrated, URL address and Corresponding data pick-up device is corresponding, does not has the URL address of corresponding withdrawal device just to correspond to general withdrawal device, All of URL address is queued up, by task split-run, task is sent to each reptile and is distributed formula Node.

3, distributed child node server

As it is shown on figure 3, distributed child node server is the person of being embodied as crawled, mainly include child node URL controller, data pick-up device, search controller, webpage capture device.

It is as follows that the distributed child node of reptile realizes process:

(1) child node URL controller receives seed URL and the correspondence that Centroid server sends over Data pick-up device information；First URL address duplicate checking (the Internet reptile, URL ground is carried out after receiving URL Location duplicate checking method has a lot, because not being this paper emphasis, is not the most detailed duplicate checking method introduction, Ke Yiying URL address duplicate checking method with any one is general), then the URL address not having repeated acquisition is discharged into Queue.And by queue URL address and corresponding data pick-up device information be sent to data pick-up device module With webpage capture device module.

(2) data pick-up device carries out the page the Deep Web page from the queue of child node URL and divides Analyse and extract in the page URL form new URL, form new URL, be equivalent to after list submits to Object, passes to webpage capture device.

(3) webpage capture device module is received the URL address sended in the first two steps, so After carry out the crawl of webpage.

(4) webpage that webpage capture device captures, it is provided that to search controller.

(5) search controller receives the page collected, and is analyzed the page, satisfactory page Face is saved into pool of page, otherwise the page is passed to data pick-up device.

(6), after data pick-up device receives the page that search controller sends, page URL address pair is used Whose withdrawal device answered carries out the extraction of URL address in content extraction and the page, then URL address is sent into URL address bases etc. are to be collected.

Each module is discussed in detail the most again.

(1) child node URL controller

Child node URL controller receive from Centroid distribution seed URL and webpage extract URL, storage to url database, storage uses Trie type data structure, can enter being newly added URL Row duplicate detection and quick insertion.Url database applies a more New Policy in units of website, Can ensure that the retardance that the renewal of content is not detected by repeatability.Pass through the legal URL of detection according to two grades URL weighting transmission sort algorithm, receives the weight that parent page passes over the degree of depth combining in search strategy Carry out the sequence of priority, pass to webpage capture device.

(2) data pick-up device

Deep Web object from URL queue uses the inquiry probe algorithm that Centroid trains, Pattern match through concrete parameter inputs, and forms new URL, is equivalent to the object after list is submitted to, passes Pass webpage capture device.This module another input hang oneself search controller judgement after the page, URL search for Strategy ensures that this is the data-intensive page, finds algorithm according to the page block trained, and extracts 2 classes and closes The URL of the heart: page turn information and subordinate's page of data information, send into url database.

(3) webpage capture device

This is a multi-threaded parallel module, is responsible for gathering the page according to http protocol.Basic step includes: A. extract targeted sites address and port numbers out according to page URL, set up network with this address and port and be connected； B. assembled HTTP request head by page URL, being sent to targeted sites, not receiving if exceeding certain time Response message, then terminate capturing this page and being abandoned；Otherwise continue next step；C. analyze response to disappear Breath, if the conditional code returned is 2xx, then returns the correct page, enters next step；If conditional code is 301 Or 302, representation page is redirected, and extracts target URL made new advances from response header, returns previous step；If Return other conditional codes, instruction page connection failure, terminate capturing this page and being abandoned；D. from response The page infos such as date, length, page type are extracted in Tou；E. the content of the page is read, for length The bigger page, uses piecemeal to read the method spliced again and ensures the integrity of content of pages.

(4) search controller

Search strategy of the present invention uses the best-first search strategy combining concrete application enhancements, through to Deep Web and the analysis of the directory block formula page, destination object major part is the text formula page, captures the degree of depth and is less than 3 grades.The web page contents captured is made decisions by this module according to search strategy, meets the text of search depth The formula page is stored in pool of page, waits the structuring of index module, otherwise, passes to the data pick-up device of correspondence Carry out page analysis and URL extracts.

Task partitioning algorithm is specific as follows:

In distributed reptile system, the equilibrium assignment crawling task is to affect systematic function and resource distribution One of key issue.Centralized or based on two grades of Hash maps the task of the many employings of distributed reptile system at present Segmentation strategy.These 2 kinds of strategies simply solve the problem of uniform distribution, do not account for the shadow of URL priority Ring and child node loading condition.The task segmentation strategy of Theme Crawler of Content should take into account URL queue sequence and Balance dispatching based on child node load.For native system framework, task partitioning algorithm includes two grades of URL Weighting transmission sort algorithm and weighting Smallest connection URL dispatching method based on hash.

URL queue in Centroid and child node designs the sort algorithm of two grades of weighting transmission.? Centroid level, its URL queue main body is the URL seed of different themes, and impact crawls the seed of quality Attribute includes website importance, captures frequency and search depth.Seed is the power that theme is embodied in corresponding website The prestige page, its PageRank may map to the website power of influence at this subject fields, PageRank Evaluate use PageRank algorithm based on network topology as standard, the page PR that Google is given Value is integer, and theoretical range is (0～10), but through statistics, major part the page PR value 7 with Under, so, for uniform normalization, the factor of influence of website importance uses linear function to calculate, specifically For the difference of PR minima in PR and the URL queue of corresponding seed and PR maximum and minimum value difference Ratio cc_i:

α_{i} = \frac{{PR}_{i} - l o w (P R)}{t o p (P R) - l o w (P R)}

Search depth refers to the number of plies that the page specifies in optimal preference strategy, 3 grades altogether, has Hidden The seed degree of depth of Web list is 1, and the data-intensive page depth of catalogue block structured is 2, the context guide The target pages degree of depth be 3, the weights influence factor-beta of search depth_iFor itself degree of depth L_iInverse.

β_i=1/L_i

Crawl frequency is the shadow that manager is corresponding with the time interval updating strategy setting according to searching for foreground demand Ring the factor, update interval short, capture frequency big, then sub-priorities is higher.Capture frequency and be divided into seed frequency Rate and theme frequency, according to theme character arrange theme frequency be required, and seed frequency without Arranging, its value just inherits theme frequency.The sample value difference in distribution of grabbing interval is big, jumping characteristic strong, such as Transferring the possession of ticket module because instantaneity requires height, its theme arranges and is spaced apart 15min；But hotel reservation because of Little for price change amplitude, it arranges and is spaced in units of sky, if fully complying with its line shape function specification Change, then the frequency influence factor there will be the situation of a value solely big then rapid decay, and this factor has just become sequence Determiner, it is clear that unreasonable.Compare through research, tied initially with after linear normalization function Really, then weight, eventually pass Sigmoid function uniform treatment.Sigmoid smoothing of functions is the strictest Dullness, threshold range is (0.5～1), is specifically calculated as follows:

x_{i} = \frac{F_{i} - l o w (F)}{t o p (F) - l o w (F)}

Wherein, Fi is the crawl frequency of seed；Low and Top obtain respectively queue medium frequency maximum and Minima；x_iFor capturing frequency；γ_iFor capturing the frequency influence factor；

γ_{i} = \frac{1}{1 + e^{- {ax}_{i}}}

A value is more than 1, is the weighter factor after linear smoothing result, and target is to expand head step result of calculation. Judging according to Sigmoid function curve, a takes 2.5 in systems.Therefore deduce that, seed preferential Level weight is the arithmetic average of 3 factors of influence:

Q_{i} = \frac{α_{i} + β_{i} + γ_{i}}{3}

Then sort descending is carried out according to Qi value, because Centroid queue number seeds is limited, so Using insertion sort, can save memory consumption, the time is also similar to that other sort algorithms.In child node URL queue inherits the URL Weight algorithm of Centroid, the analysis found that, based on extraction in system The reptile that device guides only can crawl in the website that seed limits, and captures frequency and website importance 2 in Q-value The individual factor is constant, only can be calculated as follows with search depth factor variations:

Q = Q_{p r e v} - \frac{β_{p r e v} - β}{3}

The segmentation of Centroid queue have employed weighted least-connection scheduling (Weighted Least-Connection Scheduling) algorithm.Each child node represents its treatability with corresponding weights Energy.Default weights are set to 1, and system manager can dynamically arrange the weights of child node server.Weighting Least-Connection Scheduling makes the built connection number of server and its weights proportional when the new connection of scheduling as far as possible. Distribution flow of task is as follows: assume to have one group of child node server S=S0, S1 ..., Sn-1}, W (Si) Represent server S i weights, C (Si) represent child node server S i currently connect number.All child nodes Server currently connect the summation of number be CSUM=Σ C (Si) (i=0,1 ..., n-1).

Current new connection request can be sent child node server S m, and if only if child node server Sm meets following condition and retransmits seed:

\begin{matrix} \frac{C (S_{m}) / C S U M)}{W (S_{m})} = \min {\frac{C (S_{i}) / C S U M}{W (S_{i})}} \\ &DoubleRightArrow; \frac{C (S_{m})}{W (S_{m})} = \min {\frac{C (Σ_{t})}{W (S_{i})}} \end{matrix}, (i = 0, 1, ..., n - 1)

Wherein, W (Si) is not 0.The daily record of child node feeds back to Centroid, sub-services at regular intervals Device linking number C (Si) obtains by reading daily record.The present invention compares child node and connects number and the ratio of priori weights Value, obtains the child node of minimum load, distributes and new crawls task.

Non-elaborated part of the present invention belongs to techniques well known.

Above content is to combine concrete preferred implementation further description made for the present invention, no Can assert that the detailed description of the invention of the present invention is only limitted to this, for the ordinary skill of the technical field of the invention For personnel, without departing from the inventive concept of the premise, it is also possible to make some simple deduction or replace, All should be considered as belonging to the present invention and be determined scope of patent protection by the claims submitted to.

Claims

1. a distributed network crawler system, it is characterised in that including: management door, Centroid server, distributed child node server；Management door is the web interface that manager is provided by crawler system, Centroid server and the daily record of distributed child node server can be checked, interpolation theme is set, update the URL seed of certain theme, the crawl frequency parameter of configuration theme, controls the state of reptile；Centroid server and distributed child node server are the main body of system as reptile, complete theme operation, the study of data pick-up device, page analysis and the storage of target pages；

Theme control module, receives management door from management interface and sends the data of coming, including theme descriptions data, interpolation and deletion action data, control the data of theme crawl frequency, complete the operation about theme, including to the description of theme, add and delete；Control theme and capture frequency；Edit each theme seed queue, and theme seed queue is sent to decimator module and URL controller module；

Decimator module, after receiving theme seed queue, first pass through fundamental analysis device the webpage of the URL address representative of seed queue is classified, it is divided into Deep Web page and data-intensive (Data-intensive) webpage, the most respectively two kinds of pages are extracted, find the data pick-up device that each type is corresponding, then URL address and corresponding data device are carried out corresponding record extraction, and record is sent to URL controller；

URL controller, receive the seed queue of theme control module transmission and the URL address of decimator module transmission and corresponding data pick-up device record, the two data are integrated, corresponding for corresponding with in distributed child node server for URL address data pick-up device, the URL address not having corresponding data withdrawal device just corresponds to general withdrawal device, all of URL address is queued up, by task split-run, task is sent to each distributed child node server；

(2) distributed child node server includes child node URL controller, data pick-up device, search controller, webpage capture device；

Child node URL controller, receives seed URL and corresponding data withdrawal device information that Centroid server sends over；First carry out URL address duplicate checking after receiving URL, then the URL address not having repeated acquisition is discharged into queue, and URL address in queue and corresponding data pick-up device information are sent to data pick-up device and webpage capture device；

Data pick-up device, the Deep Web page from the queue of child node URL carry out page analysis and extract in the page URL form new URL, be equivalent to the object after list is submitted to, pass to webpage capture device；After receiving the page that search controller sends, the withdrawal device using page URL address corresponding carries out the extraction of URL address in content extraction and the page, then that URL address feeding URL address base etc. is to be collected；

Webpage capture device, the URL address that reception child node URL controller and data withdrawal device send over, carry out the crawl of webpage, the webpage after crawl is supplied to search controller；

Search controller, is analyzed receiving the page collected, and satisfactory Page-saving enters pool of page, otherwise the page is passed to data pick-up device；

It is as follows that described task split-run implements process:

(21) difference of PR minima in PR and the URL queue of seed is calculated, with PR maximum and the ratio cc of PR minimum value difference_i:

PR is webpage rank, i=0,1 ..., n-1, n are son node number, and low (PR) and top (PR) is respectively PR maximum and minima；

(22) calculating the weight of search depth, the target pages degree of depth of the context guide is 3, the weights influence factor-beta of search depth_iFor itself degree of depth L_iInverse:

β_i=1/L_i

(23) calculate crawl frequency by Sigmoid function, capture frequency x_iIt is calculated as follows:

Wherein, Fi is the crawl frequency of seed；Top (F) and Low (F) is to obtain queue medium frequency maximum and minima respectively；

Capture frequency influence factor gamma_iIt is calculated as:

A value is more than 1, is the weighter factor after linear smoothing result, and target is to expand head step result of calculation；

(24) judging according to Sigmoid function curve, the priority weighting of seed is arithmetic average Q of 3 factors of influence_i:

(25) then carrying out sort descending according to Qi value, Q is to capture frequency and constant being the most only calculated as follows of 2 factors of website importance with the value of search depth factor variations:

Wherein, Qprev is the weights transmitted from father URL；β prev is the search depth factor of father URL；β is the search depth factor of object URL；

(26) distributing the new task that crawls is, assume to have one group of child node server S=S0, S1 ..., Sn-1}, W (Si) represent child node server S i weights, C (Si) represent child node server S i currently connect number, it is CSUM=Σ C (Si) that all child node servers currently connect the summation of number, i=0,1 ..., n-1；

Current new connection request can be sent child node server S m, and child node server S m that and if only if meets following condition and retransmits seed:

Wherein, W (Si) is not 0, the daily record of child node feeds back in Centroid server at regular intervals, child node server connections mesh C (Si) obtains by reading daily record, relatively distributed child node server connections and the ratio of priori weights, obtain the child node of minimum load, distribute and new crawl task.