CN106776768A

CN106776768A - A kind of URL grasping means of distributed reptile engine and system

Info

Publication number: CN106776768A
Application number: CN201611037722.XA
Authority: CN
Inventors: 王�琦; 林子忠; 欧伟; 茅晓萍
Original assignee: FUJIAN LIUREN NETWORK SECURITY Co Ltd
Current assignee: FUJIAN LIUREN NETWORK SECURITY Co Ltd
Priority date: 2016-11-23
Filing date: 2016-11-23
Publication date: 2017-05-31
Anticipated expiration: 2036-11-23
Also published as: CN106776768B

Abstract

A kind of URL grasping means of distributed reptile engine of the present invention, comprises the following steps：S100：Collection URL tasks are simultaneously stored；S200：Based on the partitioning strategy of multitask of website cryptographic Hash, the set of URL with same domain name is closed and is distributed to the same node that crawls and is crawled, and collected and crawl result；S300：The original web page corresponding to URL tasks that will be crawled carries out distribution parallel clustering, removes noise URL tasks；S400：Eliminate the URL tasks repeated in the URL tasks after removal noise；S500：Whether the URL tasks that elimination repetition has been passed through in judgement have captured the number of plies less than preset value, if then return to step S300；If otherwise performing step S600；S600：The corresponding original web page of URL tasks of each layer of crawl is merged.URL tasks are allocated to the different nodes that crawl according to domain name, the different URL tasks for crawling node processing difference domain name mitigate each task load amount for crawling node.

Description

A kind of URL grasping means of distributed reptile engine and system

Technical field

The present invention relates to Internet technical field, more particularly to a kind of distributed reptile engine URL grasping means and be System.

Background technology

With internet information explosive growth, user's information interested is submerged in a large amount of irrelevant informations, using searching Index is held up to obtain information interested and have become people and obtains information more easily mode.As search engine basic component One of web crawlers, it is necessary to region be directly facing internet, it is continual that information is collected from internet, for search engine provides number According to source.Whether the information of search is accurately closely related with web crawlers.But internet scale is very huge, website number It is numerous, webpage quantity several hundred billion, the data of such magnanimity to the design of web crawlers with realize proposing requirement higher, component Distributed network crawler system is an effective solution.Web crawlers is a robot program, it from specified URL ground Location starts to download page documents, extracts URL addresses therein, then continue to creep since the URL addresses extracted.

Traditional distributed reptile engine is mainly master-slave mode, that is, have a special master server to wait to capture to safeguard URL queues, it is responsible for every time being distributed to URL different from server, and is then responsible for actual webpage capture work from server Make.Master server will also be responsible for reconciling each from the negative of server in addition to safeguarding URL queues to be captured and distribution URL Load situation, in case some are from server is excessively idle or fatigue.Under this pattern, master server tends to turn into system bottle Neck.

In the Chinese patent of Application No. 201210090259.0, URL in a kind of distributed network crawler system is disclosed De-weight method, realizes efficient partitioning strategy of multitask, so as to preferably adapt to distributed network by introducing virtual reptile node Actually creeped in network crawler system the dynamic change of node, gone using a kind of distributed URL on the basis of partitioning strategy of multitask Double recipe formula, so as to the repetition caused in the node change procedure that avoids actually creeping is creeped.The invention changes rule when task is divided Mould is small, can guarantee that the persistently operation of crawler system stabilization, and partition strategy has dynamic adaptable, can realize the negative of node that actually creep Carry balanced, but it cannot solve the problems, such as reptile engine crawl URL efficiency during high concurrent.

In the Chinese patent of Application No. 201210425213.X, a kind of URL rows of distributed network reptile are disclosed Weight system and method, the system includes that reptile gathers child node, central server and database server.Methods described includes Reptile collection child node is registered on central server；Reptile collection child node obtains URL from database waiting list, New URL information is obtained from this URL；Reptile gathers child node and carries out one-level re-scheduling to the new URL for obtaining, and such as one-level re-scheduling is not led to Cross, then abandon the URL；As one-level re-scheduling passes through, the new URL for obtaining is added into local URL abstracts and center service is sent to Device；Central server carries out two grades of re-schedulings to the new URL for obtaining, and such as two grades of re-schedulings pass through, and URL is added into overall situation URL abstracts； Be added in waiting list for the link of the URL by reptile collection child node.The method that the invention is provided is by being classified re-scheduling mechanism The re-scheduling task that will can originally concentrate on Centroid is carried out decomposes each reptile and gathers child node, center by one-level re-scheduling Server safeguards a global re-scheduling form by way of two grades of re-schedulings.The above method cannot solve reptile engine during high concurrent Capture the problem of URL efficiency and distributed reptile task load equalization problem cannot be solved.

The content of the invention

It is an object of the present invention to reptile engine crawl URL efficiency and solution point when proposing that one kind can improve high concurrent Cloth reptile task load the URL grasping means of distributed reptile engine and system in a balanced way, solve existing reptile engine efficiency It is low, the problem of load imbalance.

To achieve these goals, the technical solution adopted in the present invention is：

A kind of URL grasping means of distributed reptile engine, comprises the following steps：

S100：Collection URL tasks are simultaneously stored；

S200：Based on the partitioning strategy of multitask of website cryptographic Hash, the set of URL with same domain name is closed and distributes to same climbing Take node to be crawled, and collect and crawl result；

S300：The original web page corresponding to URL tasks that will be crawled carries out distribution parallel clustering, removal noise URL Task；

S400：Eliminate the URL tasks repeated in the URL tasks after removal noise

S500：Whether the URL tasks that elimination repetition has been passed through in judgement have captured the number of plies less than preset value, if then returning to step Rapid S300；If otherwise performing step S600；

S600：The corresponding original web page of URL tasks of each layer of crawl is merged.

In step S300, also included before distribution parallel clustering is carried out, tentatively gone using the DOM tree structure of webpage Except noise URL tasks, including：

S301：Utilize<td>、<p>、<div>The page is split Deng html labels, remove some to render it is related but It is the unrelated label of same URL tasks；

S302：Noise link is positioned using link characters ratio, will if the word ratio of node is higher than 1/4 Link where the node is judged to that initial noisc is linked and removed.

In step S300, the distribution parallel clustering, removal noise URL tasks comprise the following steps：

S311：Domain name mapping is carried out to original web page using mapping reduction mode, preliminary piecemeal is carried out；

S312：Each block is carried out into singlepass clusters, and cluster result is divided into many by the way of reduction is mapped Individual race；

S313：According to existing noise sample, multiple races of cluster result are carried out into similarity meter using mapping reduction mode Calculate, noise URL tasks are removed according to Similarity value；

S314：The URL tasks after denoising are stored using mapping reduction mode.

In step S400, the elimination crawls the URL tasks repeated in result to be included：

S401：The URL tasks that will be gathered are added in access list, record the access time of URL tasks, and set weight Again number is 1；

S402：The URL tasks removed after noise are compared with access list and cache list successively, if in Access Column URL tasks are inquired in table or cache list, is then abandoned the URL tasks and is updated URL tasks access time and number of repetition；If URL tasks are not inquired in access list or cache list, then accesses stored URL tasks；

S403：Judge to whether there is the URL tasks in stored URL tasks, if then abandoning the URL tasks, if Otherwise update access list and URL task-sets.

Wherein, during distribution parallel clustering is carried out, the similarity accounts for two using maximum public word length The mean measures of URL task length ratios.

Invention additionally discloses a kind of URL grasping systems of distributed reptile engine, including：

Acquisition module, for gathering URL tasks and storing；

Sort module, for the partitioning strategy of multitask based on website cryptographic Hash, distribution is closed by the set of URL with same domain name Crawled to the same node that crawls, and collect and crawl result；

Denoising module, distribution parallel clustering is carried out for the original web page corresponding to the URL tasks that will be crawled, is removed Noise URL tasks；

Deduplication module, for eliminating the URL tasks repeated in the URL tasks after removal noise；

Judge module, for judging that whether passed through the URL tasks for eliminating repeating part has captured the number of plies less than preset value, If then turning the execution of denoising module, if otherwise turning merging module；

Merging module, for the corresponding original web page of URL tasks of each layer of crawl to be merged.

The denoising module, also included before distribution parallel clustering is carried out, and was tentatively gone using the DOM tree structure of webpage Except noise URL tasks, including：

Cutting unit, for utilizing<td>、<p>、<div>The page is split Deng html labels, removes some and wash with watercolours But dye is related with the unrelated label of URL tasks；

Comparing unit, for being positioned to noise link using link characters ratio, if the word ratio of node is high The link where the node is judged to that initial noisc is linked and removed in 1/4.

The denoising module carries out distribution parallel clustering, and removal noise URL tasks include：

Blocking unit, for carrying out domain name mapping to original web page using mapping reduction mode, carries out preliminary piecemeal；

Cluster cell, for each block to be carried out into singlepass clusters, and is tied cluster by the way of reduction is mapped Structure is divided into multiple races；

Computing unit, for according to existing noise sample, carrying out multiple races of cluster result using mapping reduction mode Similarity Measure, noise URL tasks are removed according to Similarity value；

Memory cell, the URL tasks after denoising are stored using mapping reduction mode.

The deduplication module, elimination crawls the URL tasks repeated in result to be included：

Unit is set up, the URL tasks for that will be gathered are added in access list, records the access time of URL tasks, And it is 1 to set number of repetition；

Matching unit, for the URL tasks removed after noise to be compared with access list and cache list successively, if URL tasks are inquired in access list or cache list, is then abandoned the URL tasks and is updated URL tasks access time and weight Again count；If not inquiring URL tasks in access list or cache list, stored URL tasks are accessed；

Duplicate checking unit, for judging to whether there is the URL tasks in stored URL tasks, if then abandoning the URL Task, if otherwise updating access list and URL task-sets.

During distribution parallel clustering is carried out, the similarity uses maximum public word length to the denoising module Account for two mean measures of URL task length ratios.

Beneficial effects of the present invention are：

First, URL tasks are allocated to the different nodes that crawl according to domain name, the different node processing difference domain names that crawl URL tasks, mitigate each task load amount for crawling node；

2nd, each crawls node and successively every URL tasks is processed, and will be made an uproar by the way of distribution parallel clustering Sound URL tasks are removed, and eliminate the URL tasks of repetition.Which realizes URL efficient process, and reptile is drawn when solving high concurrent The problem of crawl URL efficiency is held up, each node load balancing is realized.

Brief description of the drawings

Fig. 1 is a kind of method flow diagram of the URL grasping means of distributed reptile engine of the invention；

Fig. 2 is a kind of system block diagram of the URL grasping systems of distributed reptile engine of the invention.

Reference is：

Acquisition module -100, sort module -200, denoising module -300, deduplication module -400 judge Europe fast -500, merge Module -600.

Specific embodiment

Below with reference to specific embodiment shown in the drawings, the present invention will be described in detail.But these implementation methods are simultaneously The present invention is not limited, structure that one of ordinary skill in the art is made according to these implementation methods, method or functionally Conversion is all contained in protection scope of the present invention.

Hadoop distributed file systems (HDFS) are designed to be adapted to operate in common hardware (commodity Hardware the distributed file system on).It and existing distributed file system have many common ground.But meanwhile, it and The difference of other distributed file systems is also apparent.HDFS is a system for Error Tolerance, is adapted to be deployed in On cheap machine.HDFS can provide the data access of high-throughput, be especially suitable for the application on large-scale dataset.HDFS is put A part of POSIX constraints wide read the purpose of file system data realizing streaming.HDFS is most beginning as Apache The architecture of Nutch search engine projects and develop.HDFS is a part for Apache Hadoop Core projects.HDFS The characteristics of having high fault tolerance (fault-tolerant), and be designed to be deployed on cheap (low-cost) hardware. And it provides the data that high-throughput (high throughput) carrys out access application, being adapted to those has super large data Collect the application program of (large data set).The requirement (requirements) that HDFS relaxes (relax) POSIX so may be used Data in accessing (streaming access) file system in the form of realizing stream.

A kind of URL grasping means of distributed reptile engine is disclosed refering to the present invention shown in Fig. 1, and it applies HDFS files system System carries out data storage, comprises the following steps：

S100：Collection URL tasks are simultaneously stored；In the process, URL seeds are uploaded in the in files of HDFS, and is set The collection that the upload that the number of plies is 0, URL seeds realizes URL tasks is captured.

In step S200, CrawlerDriver functions are called to capture the corresponding webpage of URL tasks in files, as a result In the doc files being stored in HDFS.The CrawlerDriver functions implementation procedure is：

URL tasks to be captured are extracted from files, the corresponding original web page of URL tasks is then downloaded, is stored in Under DOC files.The location mode of original web page is that key value is URL, and property value is the corresponding webpage HTML informations pair of URL. Efficiency is crawled in order to improve each node, the partitioning strategy of multitask of website cryptographic Hash is taken based on, by with same domain name Set of URL is closed to distribute to as far as possible and same crawls node.

Partitioning strategy of multitask based on website cryptographic Hash is：The domain name part of URL tasks to be captured is calculated into its Hash Value, is then divided into different subsets, the URL tasks in each subset by URL set of tasks in files according to its cryptographic Hash The node that crawls all given on same node is crawled according to the map tasks in mapping reduction mode (map/reduce), Reduce tasks are recalled to be aggregated on HDFS all results for being crawled on node of crawling.

In step S300, the URL tasks that can be deposited using OptimizerDriver function optimizations, due in webpage There is substantial amounts of noise URL tasks, therefore the URL tasks deposited are clustered, be combined with the dom tree collection of webpage Noise sample carry out the removal of noise URL tasks.

S312：Each block is carried out into singlepass clusters, and cluster result is divided into many by the way of reduction is mapped Individual race；Specifically, in the step, calling the mapping tasks (map tasks) on crawl node to carry out respectively to each block Singlepass is clustered, and calls reduction task (reduce tasks) to collect the result of cluster.

S313：According to existing noise sample, multiple races of cluster result are carried out into similarity meter using mapping reduction mode Calculate, noise URL tasks are removed according to Similarity value；In this process, each race is sentenced according to the structure of Similarity Measure Whether disconnected is noise URL, the URL tasks in family it is larger compared with other URL similarity of tasks value deviations if be judged to make an uproar Sound URL is simultaneously removed, and then calls reduce tasks that the URL tasks after denoising are aggregated into HDFS.

The similarity used in cluster process uses longest common subsequence length (Longest Common Subsequence, referred to as LCS) accounting compared with two averages of URL length ratio measure.Two LCS of URL are using dynamic State planning algorithm is solved.Equation of transfer is shown below：

Wherein, c [i, j] record character strings X_i={ x₁,x₂,..x_iAnd character string Y_j={ y₁,y₂,y₃,...y_j The length of longest common subsequence.Here X_iAnd Y_jIt is the preceding i and j character string for needing the url1 and url2 for comparing.

The specific calculating process of LCS is as follows：

Two-dimensional array c [i] [j] records url1 sequence Xs i={ x1, x2 ..xi } and url2 sequences Yj in above-mentioned steps (2) The length of the longest common subsequence of={ y1, y2, y3 ... yj }.Step (4)~(9) are a double circulations, with sub- sequence Arrange the incremental value for being continuously updated c [i] [j] of Xi and Yi.After circulation terminates, the longest common subsequence length of url1 and url2 As c [m, n].LCS can preferably measure the structural similarity between url, be adapted to solve the meter of extensive url similarities Calculate.

By calculating ratio average, then it is compared with the threshold value of setting according to average, judges to whether there is in the race Noise URL tasks, the mode that it calculates ratio average is as follows：

S314：The URL tasks after denoising are stored using mapping reduction mode.

In step S300, also included before distribution parallel clustering is carried out, tentatively gone using the DOM tree structure of webpage Except noise URL tasks.The data processing amount in cluster process can be reduced using which, data-handling efficiency is improved, shortened Data processing time.The mode that the DOM tree structure of the application webpage tentatively removes noise URL tasks includes：

In step S301, the page is cleaned, remove and render the related content such as including CSS, using utilization<td>、<p>、 <div>The page is split Deng html labels, remove some to render it is related but with the unrelated label of URL tasks so that The clear in structure of dom tree.

S400：Eliminate the URL tasks repeated in the URL tasks after removal noise；

In the step, the URL tasks of stored repetition are eliminated, preserve the crawl that next round is waited into files.

It is excellent for the low problem of search efficiency for solving to exist during existing URL tasks Duplicate Removal Algorithm treatment magnanimity URL tasks The use of choosing is based on the URL task duplicate removal strategies of caching, and cache class contains two queues in caching system：Access list and slow List is deposited, URL classes contain URL character strings, three fields of URL numbers of repetition and URL access times, deposited in cache list The arrangement from high in the end of the weight calculated according to URL numbers of repetition and access time of URL task queues, the calculating of weight Formula is as follows：

Wherein, rep represents number of repetition, and t represents time index, and t_current represents the current accessed time of url, t_ The initial time that init representation programs are set when starting, min and max represent the t of url in cache list₁The lower bound of index and upper Boundary.

Here, the number of repetition of URL is higher, and shared weight is bigger.The access time of URL closer to current time, then The URL is more active, therefore the weight assigned to it is also bigger.The number of repetition of URL URL that represents high is a common link, More likely run into follow-up sentencing is operated again.So carrying out during URL duplicate removals judge operating process, can re-scheduling one by one, greatly The probability being hit is lifted greatly.Caching system can at any time according to the URL tasks that renewal has been accessed that change of URL weighted values, and URL appoints The strategy that business removal is repeated mainly is completed by following steps：

In step S401, after crawler system starts, caching system completes queue initialization work first, by under in files URL tasks are added in access list, record its access time, and it is 1 to set number of repetition；

In step S403, judge to whether there is the URL tasks in caching system, if there is the URL tasks, abandon The URL tasks simultaneously update its access time and number of repetition；If not existing the URL tasks, the URL tasks are added into texts Part folder is lower for next round crawl.

A kind of URL grasping means of the distributed reptile engine described in above-mentioned implementation method, URL tasks are drawn according to domain name The different nodes that crawl are given, the different URL tasks for crawling node processing difference domain name mitigate each the crawling node of the task Load capacity.Each crawls node and successively every URL tasks is processed, by noise by the way of distribution parallel clustering URL tasks are removed, and eliminate the URL tasks of repetition.Which realizes URL efficient process, solves reptile engine during high concurrent The problem of URL efficiency is captured, each node load balancing is realized.

Refering to shown in Fig. 2, one embodiment of the present invention is also disclosed a kind of URL grasping systems of distributed reptile engine, Including：

Acquisition module 100, for gathering URL tasks and storing；

Sort module 200, for the partitioning strategy of multitask based on website cryptographic Hash, the set of URL with same domain name is closed Distribute to the same node that crawls to be crawled, and collect and crawl result；

Denoising module 300, distribution parallel clustering is carried out for the original web page corresponding to the URL tasks that will be crawled, Removal noise URL tasks；

Deduplication module 400, for eliminating the URL tasks repeated in the URL tasks after removal noise；

Judge module 500, for judging that whether passed through the URL tasks for eliminating repeating part has captured the number of plies less than default Value, if then turning the execution of denoising module, if otherwise turning merging module；

Merging module 600, for the corresponding original web page of URL tasks of each layer of crawl to be merged.

The URL grasping systems of a kind of distributed reptile engine described in above-mentioned implementation method, in an improved embodiment party Denoising module described in formula, also included before distribution parallel clustering is carried out, and is tentatively removed using the DOM tree structure of webpage and made an uproar Sound URL tasks, including：

A kind of URL grasping systems of the distributed reptile engine described in above-mentioned implementation method, in improving implementation method one, The denoising module carries out distribution parallel clustering, and removal noise URL tasks include：

The URL grasping systems of a kind of distributed reptile engine described in above-mentioned implementation method, in an improved implementation method In, the deduplication module elimination crawls the URL tasks repeated in result to be included：

A kind of URL grasping systems of the distributed reptile engine described in above-mentioned implementation method, the denoising module is being entered During row distribution parallel clustering, the similarity accounts for two URL task length ratios using maximum public word length Mean measures.

The URL grasping systems of the distributed reptile engine in embodiment of the present invention are based on distributed reptile engine The corresponding system embodiment of URL grasping means, the relevant technical details mentioned in the URL grasping means of distributed reptile engine In the present embodiment still effectively, repeated no more here to reduce repetition.

It should be understood that, although the present specification is described in terms of embodiments, but not each implementation method only includes one Individual independent technical scheme, this narrating mode of specification is only that for clarity, those skilled in the art will should say Used as an entirety, technical scheme in each implementation method can also be through appropriately combined, and forming those skilled in the art can for bright book With the other embodiment for understanding.

Those listed above is a series of to be described in detail only for feasibility implementation method of the invention specifically Bright, they simultaneously are not used to limit the scope of the invention, all equivalent implementations made without departing from skill spirit of the present invention Or change should be included within the scope of the present invention.

Claims

1. a kind of URL grasping means of distributed reptile engine, it is characterised in that comprise the following steps：

S100：Collection URL tasks are simultaneously stored；

S200：Based on the partitioning strategy of multitask of website cryptographic Hash, the set of URL with same domain name is closed to distribute to same crawl section Point is crawled, and is collected and crawled result；

S300：The original web page corresponding to URL tasks that will be crawled carries out distribution parallel clustering, removes noise URL tasks；

S400：Eliminate the URL tasks repeated in the URL tasks after removal noise；

S500：Whether the URL tasks that elimination repetition has been passed through in judgement have captured the number of plies less than preset value, if then return to step S300；If otherwise performing step S600；

2. the URL grasping means of a kind of distributed reptile engine according to claim 1, it is characterised in that：In step In S300, also included before distribution parallel clustering is carried out, noise URL tasks tentatively removed using the DOM tree structure of webpage, Including：

S301：Utilize<td>、<p>、<div>The page is split Deng html labels, but remove some to render it is related same The unrelated label of URL tasks；

S302：Noise link is positioned using link characters ratio, saves this if the word ratio of node is higher than 1/4 Link where point is judged to that initial noisc is linked and removed.

3. the URL grasping means of a kind of distributed reptile engine according to claim 1 and 2, it is characterised in that：In step In S300, the distribution parallel clustering, removal noise URL tasks comprise the following steps：

S312：Each block is carried out into singlepass clusters, and cluster result is divided into multiple races by the way of reduction is mapped；

S313：According to existing noise sample, multiple races of cluster result are carried out into Similarity Measure using mapping reduction mode, according to Noise URL tasks are removed according to Similarity value；

S314：The URL tasks after denoising are stored using mapping reduction mode.

4. the URL grasping means of distributed reptile engine according to claim 1 and 2, it is characterised in that：Step S400 In, the elimination crawls the URL tasks repeated in result to be included：

S401：The URL tasks that will be gathered are added in access list, record the access time of URL tasks, and set repetition time Number is 1；

S402：The URL tasks removed after noise are compared with access list and cache list successively, if in access list or URL tasks are inquired in cache list, is then abandoned the URL tasks and is updated URL tasks access time and number of repetition；If not existing URL tasks are inquired in access list or cache list, then accesses stored URL tasks；

5. the URL grasping means of the distributed reptile engine according to claim 1 or 3 or 4, it is characterised in that：Carry out During distribution parallel clustering, the similarity accounts for the equal of two URL task length ratios using maximum public word length Value metric.

6. a kind of URL grasping systems of distributed reptile engine, it is characterised in that including：

Acquisition module, for gathering URL tasks and storing；

Sort module, for the partitioning strategy of multitask based on website cryptographic Hash, the set of URL with same domain name is closed and is distributed to together One crawls node is crawled, and is collected and crawled result；

Denoising module, distribution parallel clustering is carried out for the original web page corresponding to the URL tasks that will be crawled, removes noise URL tasks；

Judge module, for judging that whether passed through the URL tasks for eliminating repeating part has captured the number of plies less than preset value, if Then turn the execution of denoising module, if otherwise turning merging module；

7. URL grasping systems of a kind of distributed reptile engine according to claim 6, it is characterised in that：The denoising Module, also included before distribution parallel clustering is carried out, and noise URL tasks are tentatively removed using the DOM tree structure of webpage, bag Include：

Cutting unit, for utilizing<td>、<p>、<div>The page is split Deng html labels, is removed some and is rendered phase Close but with the unrelated label of URL tasks；

Comparing unit, for being positioned to noise link using link characters ratio, if the word ratio of node is higher than 1/4 The link where the node is then judged to that initial noisc is linked and removed.

8. URL grasping systems of a kind of distributed reptile engine according to claim 6 or 7, it is characterised in that：It is described to go Module of making an uproar carries out distribution parallel clustering, and removal noise URL tasks include：

Cluster cell, for each block to be carried out into singlepass clusters, and will cluster structure point by the way of reduction is mapped Into multiple races；

Computing unit, for according to existing noise sample, carrying out multiple races of cluster result using mapping reduction mode similar Degree is calculated, and noise URL tasks are removed according to Similarity value；

9. URL grasping systems of a kind of distributed reptile engine according to claim 6 or 7, it is characterised in that：It is described to go Molality block, elimination crawls the URL tasks repeated in result to be included：

Unit is set up, the URL tasks for that will be gathered are added in access list, record the access time of URL tasks, and set It is 1 to put number of repetition；

Matching unit, for the URL tasks removed after noise to be compared with access list and cache list successively, if visiting Ask and inquire URL tasks in list or cache list, then abandon the URL tasks and update URL tasks access time and repeat time Number；If not inquiring URL tasks in access list or cache list, stored URL tasks are accessed；

Duplicate checking unit, for judging to whether there is the URL tasks in stored URL tasks, if then abandon the URL appointing Business, if otherwise updating access list and URL task-sets.

10. the URL grasping means of a kind of distributed reptile engine according to claim 1 or 8 or 9, it is characterised in that：

During distribution parallel clustering is carried out, the similarity accounts for two to the denoising module using maximum public word length The mean measures of individual URL tasks length ratio.