CN106776768A - A kind of URL grasping means of distributed reptile engine and system - Google Patents

A kind of URL grasping means of distributed reptile engine and system Download PDF

Info

Publication number
CN106776768A
CN106776768A CN201611037722.XA CN201611037722A CN106776768A CN 106776768 A CN106776768 A CN 106776768A CN 201611037722 A CN201611037722 A CN 201611037722A CN 106776768 A CN106776768 A CN 106776768A
Authority
CN
China
Prior art keywords
url
tasks
url tasks
noise
module
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN201611037722.XA
Other languages
Chinese (zh)
Other versions
CN106776768B (en
Inventor
王�琦
林子忠
欧伟
茅晓萍
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
FUJIAN LIUREN NETWORK SECURITY Co Ltd
Original Assignee
FUJIAN LIUREN NETWORK SECURITY Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by FUJIAN LIUREN NETWORK SECURITY Co Ltd filed Critical FUJIAN LIUREN NETWORK SECURITY Co Ltd
Priority to CN201611037722.XA priority Critical patent/CN106776768B/en
Publication of CN106776768A publication Critical patent/CN106776768A/en
Application granted granted Critical
Publication of CN106776768B publication Critical patent/CN106776768B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/95Retrieval from the web
    • G06F16/951Indexing; Web crawling techniques
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/46Multiprogramming arrangements
    • G06F9/50Allocation of resources, e.g. of the central processing unit [CPU]
    • G06F9/5083Techniques for rebalancing the load in a distributed system

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Databases & Information Systems (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Software Systems (AREA)
  • Data Mining & Analysis (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
  • Information Transfer Between Computers (AREA)

Abstract

A kind of URL grasping means of distributed reptile engine of the present invention, comprises the following steps:S100:Collection URL tasks are simultaneously stored;S200:Based on the partitioning strategy of multitask of website cryptographic Hash, the set of URL with same domain name is closed and is distributed to the same node that crawls and is crawled, and collected and crawl result;S300:The original web page corresponding to URL tasks that will be crawled carries out distribution parallel clustering, removes noise URL tasks;S400:Eliminate the URL tasks repeated in the URL tasks after removal noise;S500:Whether the URL tasks that elimination repetition has been passed through in judgement have captured the number of plies less than preset value, if then return to step S300;If otherwise performing step S600;S600:The corresponding original web page of URL tasks of each layer of crawl is merged.URL tasks are allocated to the different nodes that crawl according to domain name, the different URL tasks for crawling node processing difference domain name mitigate each task load amount for crawling node.

Description

A kind of URL grasping means of distributed reptile engine and system
Technical field
The present invention relates to Internet technical field, more particularly to a kind of distributed reptile engine URL grasping means and be System.
Background technology
With internet information explosive growth, user's information interested is submerged in a large amount of irrelevant informations, using searching Index is held up to obtain information interested and have become people and obtains information more easily mode.As search engine basic component One of web crawlers, it is necessary to region be directly facing internet, it is continual that information is collected from internet, for search engine provides number According to source.Whether the information of search is accurately closely related with web crawlers.But internet scale is very huge, website number It is numerous, webpage quantity several hundred billion, the data of such magnanimity to the design of web crawlers with realize proposing requirement higher, component Distributed network crawler system is an effective solution.Web crawlers is a robot program, it from specified URL ground Location starts to download page documents, extracts URL addresses therein, then continue to creep since the URL addresses extracted.
Traditional distributed reptile engine is mainly master-slave mode, that is, have a special master server to wait to capture to safeguard URL queues, it is responsible for every time being distributed to URL different from server, and is then responsible for actual webpage capture work from server Make.Master server will also be responsible for reconciling each from the negative of server in addition to safeguarding URL queues to be captured and distribution URL Load situation, in case some are from server is excessively idle or fatigue.Under this pattern, master server tends to turn into system bottle Neck.
In the Chinese patent of Application No. 201210090259.0, URL in a kind of distributed network crawler system is disclosed De-weight method, realizes efficient partitioning strategy of multitask, so as to preferably adapt to distributed network by introducing virtual reptile node Actually creeped in network crawler system the dynamic change of node, gone using a kind of distributed URL on the basis of partitioning strategy of multitask Double recipe formula, so as to the repetition caused in the node change procedure that avoids actually creeping is creeped.The invention changes rule when task is divided Mould is small, can guarantee that the persistently operation of crawler system stabilization, and partition strategy has dynamic adaptable, can realize the negative of node that actually creep Carry balanced, but it cannot solve the problems, such as reptile engine crawl URL efficiency during high concurrent.
In the Chinese patent of Application No. 201210425213.X, a kind of URL rows of distributed network reptile are disclosed Weight system and method, the system includes that reptile gathers child node, central server and database server.Methods described includes Reptile collection child node is registered on central server;Reptile collection child node obtains URL from database waiting list, New URL information is obtained from this URL;Reptile gathers child node and carries out one-level re-scheduling to the new URL for obtaining, and such as one-level re-scheduling is not led to Cross, then abandon the URL;As one-level re-scheduling passes through, the new URL for obtaining is added into local URL abstracts and center service is sent to Device;Central server carries out two grades of re-schedulings to the new URL for obtaining, and such as two grades of re-schedulings pass through, and URL is added into overall situation URL abstracts; Be added in waiting list for the link of the URL by reptile collection child node.The method that the invention is provided is by being classified re-scheduling mechanism The re-scheduling task that will can originally concentrate on Centroid is carried out decomposes each reptile and gathers child node, center by one-level re-scheduling Server safeguards a global re-scheduling form by way of two grades of re-schedulings.The above method cannot solve reptile engine during high concurrent Capture the problem of URL efficiency and distributed reptile task load equalization problem cannot be solved.
The content of the invention
It is an object of the present invention to reptile engine crawl URL efficiency and solution point when proposing that one kind can improve high concurrent Cloth reptile task load the URL grasping means of distributed reptile engine and system in a balanced way, solve existing reptile engine efficiency It is low, the problem of load imbalance.
To achieve these goals, the technical solution adopted in the present invention is:
A kind of URL grasping means of distributed reptile engine, comprises the following steps:
S100:Collection URL tasks are simultaneously stored;
S200:Based on the partitioning strategy of multitask of website cryptographic Hash, the set of URL with same domain name is closed and distributes to same climbing Take node to be crawled, and collect and crawl result;
S300:The original web page corresponding to URL tasks that will be crawled carries out distribution parallel clustering, removal noise URL Task;
S400:Eliminate the URL tasks repeated in the URL tasks after removal noise
S500:Whether the URL tasks that elimination repetition has been passed through in judgement have captured the number of plies less than preset value, if then returning to step Rapid S300;If otherwise performing step S600;
S600:The corresponding original web page of URL tasks of each layer of crawl is merged.
In step S300, also included before distribution parallel clustering is carried out, tentatively gone using the DOM tree structure of webpage Except noise URL tasks, including:
S301:Utilize<td>、<p>、<div>The page is split Deng html labels, remove some to render it is related but It is the unrelated label of same URL tasks;
S302:Noise link is positioned using link characters ratio, will if the word ratio of node is higher than 1/4 Link where the node is judged to that initial noisc is linked and removed.
In step S300, the distribution parallel clustering, removal noise URL tasks comprise the following steps:
S311:Domain name mapping is carried out to original web page using mapping reduction mode, preliminary piecemeal is carried out;
S312:Each block is carried out into singlepass clusters, and cluster result is divided into many by the way of reduction is mapped Individual race;
S313:According to existing noise sample, multiple races of cluster result are carried out into similarity meter using mapping reduction mode Calculate, noise URL tasks are removed according to Similarity value;
S314:The URL tasks after denoising are stored using mapping reduction mode.
In step S400, the elimination crawls the URL tasks repeated in result to be included:
S401:The URL tasks that will be gathered are added in access list, record the access time of URL tasks, and set weight Again number is 1;
S402:The URL tasks removed after noise are compared with access list and cache list successively, if in Access Column URL tasks are inquired in table or cache list, is then abandoned the URL tasks and is updated URL tasks access time and number of repetition;If URL tasks are not inquired in access list or cache list, then accesses stored URL tasks;
S403:Judge to whether there is the URL tasks in stored URL tasks, if then abandoning the URL tasks, if Otherwise update access list and URL task-sets.
Wherein, during distribution parallel clustering is carried out, the similarity accounts for two using maximum public word length The mean measures of URL task length ratios.
Invention additionally discloses a kind of URL grasping systems of distributed reptile engine, including:
Acquisition module, for gathering URL tasks and storing;
Sort module, for the partitioning strategy of multitask based on website cryptographic Hash, distribution is closed by the set of URL with same domain name Crawled to the same node that crawls, and collect and crawl result;
Denoising module, distribution parallel clustering is carried out for the original web page corresponding to the URL tasks that will be crawled, is removed Noise URL tasks;
Deduplication module, for eliminating the URL tasks repeated in the URL tasks after removal noise;
Judge module, for judging that whether passed through the URL tasks for eliminating repeating part has captured the number of plies less than preset value, If then turning the execution of denoising module, if otherwise turning merging module;
Merging module, for the corresponding original web page of URL tasks of each layer of crawl to be merged.
The denoising module, also included before distribution parallel clustering is carried out, and was tentatively gone using the DOM tree structure of webpage Except noise URL tasks, including:
Cutting unit, for utilizing<td>、<p>、<div>The page is split Deng html labels, removes some and wash with watercolours But dye is related with the unrelated label of URL tasks;
Comparing unit, for being positioned to noise link using link characters ratio, if the word ratio of node is high The link where the node is judged to that initial noisc is linked and removed in 1/4.
The denoising module carries out distribution parallel clustering, and removal noise URL tasks include:
Blocking unit, for carrying out domain name mapping to original web page using mapping reduction mode, carries out preliminary piecemeal;
Cluster cell, for each block to be carried out into singlepass clusters, and is tied cluster by the way of reduction is mapped Structure is divided into multiple races;
Computing unit, for according to existing noise sample, carrying out multiple races of cluster result using mapping reduction mode Similarity Measure, noise URL tasks are removed according to Similarity value;
Memory cell, the URL tasks after denoising are stored using mapping reduction mode.
The deduplication module, elimination crawls the URL tasks repeated in result to be included:
Unit is set up, the URL tasks for that will be gathered are added in access list, records the access time of URL tasks, And it is 1 to set number of repetition;
Matching unit, for the URL tasks removed after noise to be compared with access list and cache list successively, if URL tasks are inquired in access list or cache list, is then abandoned the URL tasks and is updated URL tasks access time and weight Again count;If not inquiring URL tasks in access list or cache list, stored URL tasks are accessed;
Duplicate checking unit, for judging to whether there is the URL tasks in stored URL tasks, if then abandoning the URL Task, if otherwise updating access list and URL task-sets.
During distribution parallel clustering is carried out, the similarity uses maximum public word length to the denoising module Account for two mean measures of URL task length ratios.
Beneficial effects of the present invention are:
First, URL tasks are allocated to the different nodes that crawl according to domain name, the different node processing difference domain names that crawl URL tasks, mitigate each task load amount for crawling node;
2nd, each crawls node and successively every URL tasks is processed, and will be made an uproar by the way of distribution parallel clustering Sound URL tasks are removed, and eliminate the URL tasks of repetition.Which realizes URL efficient process, and reptile is drawn when solving high concurrent The problem of crawl URL efficiency is held up, each node load balancing is realized.
Brief description of the drawings
Fig. 1 is a kind of method flow diagram of the URL grasping means of distributed reptile engine of the invention;
Fig. 2 is a kind of system block diagram of the URL grasping systems of distributed reptile engine of the invention.
Reference is:
Acquisition module -100, sort module -200, denoising module -300, deduplication module -400 judge Europe fast -500, merge Module -600.
Specific embodiment
Below with reference to specific embodiment shown in the drawings, the present invention will be described in detail.But these implementation methods are simultaneously The present invention is not limited, structure that one of ordinary skill in the art is made according to these implementation methods, method or functionally Conversion is all contained in protection scope of the present invention.
Hadoop distributed file systems (HDFS) are designed to be adapted to operate in common hardware (commodity Hardware the distributed file system on).It and existing distributed file system have many common ground.But meanwhile, it and The difference of other distributed file systems is also apparent.HDFS is a system for Error Tolerance, is adapted to be deployed in On cheap machine.HDFS can provide the data access of high-throughput, be especially suitable for the application on large-scale dataset.HDFS is put A part of POSIX constraints wide read the purpose of file system data realizing streaming.HDFS is most beginning as Apache The architecture of Nutch search engine projects and develop.HDFS is a part for Apache Hadoop Core projects.HDFS The characteristics of having high fault tolerance (fault-tolerant), and be designed to be deployed on cheap (low-cost) hardware. And it provides the data that high-throughput (high throughput) carrys out access application, being adapted to those has super large data Collect the application program of (large data set).The requirement (requirements) that HDFS relaxes (relax) POSIX so may be used Data in accessing (streaming access) file system in the form of realizing stream.
A kind of URL grasping means of distributed reptile engine is disclosed refering to the present invention shown in Fig. 1, and it applies HDFS files system System carries out data storage, comprises the following steps:
S100:Collection URL tasks are simultaneously stored;In the process, URL seeds are uploaded in the in files of HDFS, and is set The collection that the upload that the number of plies is 0, URL seeds realizes URL tasks is captured.
S200:Based on the partitioning strategy of multitask of website cryptographic Hash, the set of URL with same domain name is closed and distributes to same climbing Take node to be crawled, and collect and crawl result;
In step S200, CrawlerDriver functions are called to capture the corresponding webpage of URL tasks in files, as a result In the doc files being stored in HDFS.The CrawlerDriver functions implementation procedure is:
URL tasks to be captured are extracted from files, the corresponding original web page of URL tasks is then downloaded, is stored in Under DOC files.The location mode of original web page is that key value is URL, and property value is the corresponding webpage HTML informations pair of URL. Efficiency is crawled in order to improve each node, the partitioning strategy of multitask of website cryptographic Hash is taken based on, by with same domain name Set of URL is closed to distribute to as far as possible and same crawls node.
Partitioning strategy of multitask based on website cryptographic Hash is:The domain name part of URL tasks to be captured is calculated into its Hash Value, is then divided into different subsets, the URL tasks in each subset by URL set of tasks in files according to its cryptographic Hash The node that crawls all given on same node is crawled according to the map tasks in mapping reduction mode (map/reduce), Reduce tasks are recalled to be aggregated on HDFS all results for being crawled on node of crawling.
S300:The original web page corresponding to URL tasks that will be crawled carries out distribution parallel clustering, removal noise URL Task;
In step S300, the URL tasks that can be deposited using OptimizerDriver function optimizations, due in webpage There is substantial amounts of noise URL tasks, therefore the URL tasks deposited are clustered, be combined with the dom tree collection of webpage Noise sample carry out the removal of noise URL tasks.
In step S300, the distribution parallel clustering, removal noise URL tasks comprise the following steps:
S311:Domain name mapping is carried out to original web page using mapping reduction mode, preliminary piecemeal is carried out;
S312:Each block is carried out into singlepass clusters, and cluster result is divided into many by the way of reduction is mapped Individual race;Specifically, in the step, calling the mapping tasks (map tasks) on crawl node to carry out respectively to each block Singlepass is clustered, and calls reduction task (reduce tasks) to collect the result of cluster.
S313:According to existing noise sample, multiple races of cluster result are carried out into similarity meter using mapping reduction mode Calculate, noise URL tasks are removed according to Similarity value;In this process, each race is sentenced according to the structure of Similarity Measure Whether disconnected is noise URL, the URL tasks in family it is larger compared with other URL similarity of tasks value deviations if be judged to make an uproar Sound URL is simultaneously removed, and then calls reduce tasks that the URL tasks after denoising are aggregated into HDFS.
The similarity used in cluster process uses longest common subsequence length (Longest Common Subsequence, referred to as LCS) accounting compared with two averages of URL length ratio measure.Two LCS of URL are using dynamic State planning algorithm is solved.Equation of transfer is shown below:
Wherein, c [i, j] record character strings Xi={ x1,x2,..xiAnd character string Yj={ y1,y2,y3,...yj The length of longest common subsequence.Here XiAnd YjIt is the preceding i and j character string for needing the url1 and url2 for comparing.
The specific calculating process of LCS is as follows:
Two-dimensional array c [i] [j] records url1 sequence Xs i={ x1, x2 ..xi } and url2 sequences Yj in above-mentioned steps (2) The length of the longest common subsequence of={ y1, y2, y3 ... yj }.Step (4)~(9) are a double circulations, with sub- sequence Arrange the incremental value for being continuously updated c [i] [j] of Xi and Yi.After circulation terminates, the longest common subsequence length of url1 and url2 As c [m, n].LCS can preferably measure the structural similarity between url, be adapted to solve the meter of extensive url similarities Calculate.
By calculating ratio average, then it is compared with the threshold value of setting according to average, judges to whether there is in the race Noise URL tasks, the mode that it calculates ratio average is as follows:
S314:The URL tasks after denoising are stored using mapping reduction mode.
In step S300, also included before distribution parallel clustering is carried out, tentatively gone using the DOM tree structure of webpage Except noise URL tasks.The data processing amount in cluster process can be reduced using which, data-handling efficiency is improved, shortened Data processing time.The mode that the DOM tree structure of the application webpage tentatively removes noise URL tasks includes:
S301:Utilize<td>、<p>、<div>The page is split Deng html labels, remove some to render it is related but It is the unrelated label of same URL tasks;
In step S301, the page is cleaned, remove and render the related content such as including CSS, using utilization<td>、<p>、 <div>The page is split Deng html labels, remove some to render it is related but with the unrelated label of URL tasks so that The clear in structure of dom tree.
S302:Noise link is positioned using link characters ratio, will if the word ratio of node is higher than 1/4 Link where the node is judged to that initial noisc is linked and removed.
S400:Eliminate the URL tasks repeated in the URL tasks after removal noise;
In the step, the URL tasks of stored repetition are eliminated, preserve the crawl that next round is waited into files.
It is excellent for the low problem of search efficiency for solving to exist during existing URL tasks Duplicate Removal Algorithm treatment magnanimity URL tasks The use of choosing is based on the URL task duplicate removal strategies of caching, and cache class contains two queues in caching system:Access list and slow List is deposited, URL classes contain URL character strings, three fields of URL numbers of repetition and URL access times, deposited in cache list The arrangement from high in the end of the weight calculated according to URL numbers of repetition and access time of URL task queues, the calculating of weight Formula is as follows:
Wherein, rep represents number of repetition, and t represents time index, and t_current represents the current accessed time of url, t_ The initial time that init representation programs are set when starting, min and max represent the t of url in cache list1The lower bound of index and upper Boundary.
Here, the number of repetition of URL is higher, and shared weight is bigger.The access time of URL closer to current time, then The URL is more active, therefore the weight assigned to it is also bigger.The number of repetition of URL URL that represents high is a common link, More likely run into follow-up sentencing is operated again.So carrying out during URL duplicate removals judge operating process, can re-scheduling one by one, greatly The probability being hit is lifted greatly.Caching system can at any time according to the URL tasks that renewal has been accessed that change of URL weighted values, and URL appoints The strategy that business removal is repeated mainly is completed by following steps:
S401:The URL tasks that will be gathered are added in access list, record the access time of URL tasks, and set weight Again number is 1;
In step S401, after crawler system starts, caching system completes queue initialization work first, by under in files URL tasks are added in access list, record its access time, and it is 1 to set number of repetition;
S402:The URL tasks removed after noise are compared with access list and cache list successively, if in Access Column URL tasks are inquired in table or cache list, is then abandoned the URL tasks and is updated URL tasks access time and number of repetition;If URL tasks are not inquired in access list or cache list, then accesses stored URL tasks;
S403:Judge to whether there is the URL tasks in stored URL tasks, if then abandoning the URL tasks, if Otherwise update access list and URL task-sets.
In step S403, judge to whether there is the URL tasks in caching system, if there is the URL tasks, abandon The URL tasks simultaneously update its access time and number of repetition;If not existing the URL tasks, the URL tasks are added into texts Part folder is lower for next round crawl.
S500:Whether the URL tasks that elimination repetition has been passed through in judgement have captured the number of plies less than preset value, if then returning to step Rapid S300;If otherwise performing step S600;
S600:The corresponding original web page of URL tasks of each layer of crawl is merged.
A kind of URL grasping means of the distributed reptile engine described in above-mentioned implementation method, URL tasks are drawn according to domain name The different nodes that crawl are given, the different URL tasks for crawling node processing difference domain name mitigate each the crawling node of the task Load capacity.Each crawls node and successively every URL tasks is processed, by noise by the way of distribution parallel clustering URL tasks are removed, and eliminate the URL tasks of repetition.Which realizes URL efficient process, solves reptile engine during high concurrent The problem of URL efficiency is captured, each node load balancing is realized.
Refering to shown in Fig. 2, one embodiment of the present invention is also disclosed a kind of URL grasping systems of distributed reptile engine, Including:
Acquisition module 100, for gathering URL tasks and storing;
Sort module 200, for the partitioning strategy of multitask based on website cryptographic Hash, the set of URL with same domain name is closed Distribute to the same node that crawls to be crawled, and collect and crawl result;
Denoising module 300, distribution parallel clustering is carried out for the original web page corresponding to the URL tasks that will be crawled, Removal noise URL tasks;
Deduplication module 400, for eliminating the URL tasks repeated in the URL tasks after removal noise;
Judge module 500, for judging that whether passed through the URL tasks for eliminating repeating part has captured the number of plies less than default Value, if then turning the execution of denoising module, if otherwise turning merging module;
Merging module 600, for the corresponding original web page of URL tasks of each layer of crawl to be merged.
The URL grasping systems of a kind of distributed reptile engine described in above-mentioned implementation method, in an improved embodiment party Denoising module described in formula, also included before distribution parallel clustering is carried out, and is tentatively removed using the DOM tree structure of webpage and made an uproar Sound URL tasks, including:
Cutting unit, for utilizing<td>、<p>、<div>The page is split Deng html labels, removes some and wash with watercolours But dye is related with the unrelated label of URL tasks;
Comparing unit, for being positioned to noise link using link characters ratio, if the word ratio of node is high The link where the node is judged to that initial noisc is linked and removed in 1/4.
A kind of URL grasping systems of the distributed reptile engine described in above-mentioned implementation method, in improving implementation method one, The denoising module carries out distribution parallel clustering, and removal noise URL tasks include:
Blocking unit, for carrying out domain name mapping to original web page using mapping reduction mode, carries out preliminary piecemeal;
Cluster cell, for each block to be carried out into singlepass clusters, and is tied cluster by the way of reduction is mapped Structure is divided into multiple races;
Computing unit, for according to existing noise sample, carrying out multiple races of cluster result using mapping reduction mode Similarity Measure, noise URL tasks are removed according to Similarity value;
Memory cell, the URL tasks after denoising are stored using mapping reduction mode.
The URL grasping systems of a kind of distributed reptile engine described in above-mentioned implementation method, in an improved implementation method In, the deduplication module elimination crawls the URL tasks repeated in result to be included:
Unit is set up, the URL tasks for that will be gathered are added in access list, records the access time of URL tasks, And it is 1 to set number of repetition;
Matching unit, for the URL tasks removed after noise to be compared with access list and cache list successively, if URL tasks are inquired in access list or cache list, is then abandoned the URL tasks and is updated URL tasks access time and weight Again count;If not inquiring URL tasks in access list or cache list, stored URL tasks are accessed;
Duplicate checking unit, for judging to whether there is the URL tasks in stored URL tasks, if then abandoning the URL Task, if otherwise updating access list and URL task-sets.
A kind of URL grasping systems of the distributed reptile engine described in above-mentioned implementation method, the denoising module is being entered During row distribution parallel clustering, the similarity accounts for two URL task length ratios using maximum public word length Mean measures.
The URL grasping systems of the distributed reptile engine in embodiment of the present invention are based on distributed reptile engine The corresponding system embodiment of URL grasping means, the relevant technical details mentioned in the URL grasping means of distributed reptile engine In the present embodiment still effectively, repeated no more here to reduce repetition.
It should be understood that, although the present specification is described in terms of embodiments, but not each implementation method only includes one Individual independent technical scheme, this narrating mode of specification is only that for clarity, those skilled in the art will should say Used as an entirety, technical scheme in each implementation method can also be through appropriately combined, and forming those skilled in the art can for bright book With the other embodiment for understanding.
Those listed above is a series of to be described in detail only for feasibility implementation method of the invention specifically Bright, they simultaneously are not used to limit the scope of the invention, all equivalent implementations made without departing from skill spirit of the present invention Or change should be included within the scope of the present invention.

Claims (10)

1. a kind of URL grasping means of distributed reptile engine, it is characterised in that comprise the following steps:
S100:Collection URL tasks are simultaneously stored;
S200:Based on the partitioning strategy of multitask of website cryptographic Hash, the set of URL with same domain name is closed to distribute to same crawl section Point is crawled, and is collected and crawled result;
S300:The original web page corresponding to URL tasks that will be crawled carries out distribution parallel clustering, removes noise URL tasks;
S400:Eliminate the URL tasks repeated in the URL tasks after removal noise;
S500:Whether the URL tasks that elimination repetition has been passed through in judgement have captured the number of plies less than preset value, if then return to step S300;If otherwise performing step S600;
S600:The corresponding original web page of URL tasks of each layer of crawl is merged.
2. the URL grasping means of a kind of distributed reptile engine according to claim 1, it is characterised in that:In step In S300, also included before distribution parallel clustering is carried out, noise URL tasks tentatively removed using the DOM tree structure of webpage, Including:
S301:Utilize<td>、<p>、<div>The page is split Deng html labels, but remove some to render it is related same The unrelated label of URL tasks;
S302:Noise link is positioned using link characters ratio, saves this if the word ratio of node is higher than 1/4 Link where point is judged to that initial noisc is linked and removed.
3. the URL grasping means of a kind of distributed reptile engine according to claim 1 and 2, it is characterised in that:In step In S300, the distribution parallel clustering, removal noise URL tasks comprise the following steps:
S311:Domain name mapping is carried out to original web page using mapping reduction mode, preliminary piecemeal is carried out;
S312:Each block is carried out into singlepass clusters, and cluster result is divided into multiple races by the way of reduction is mapped;
S313:According to existing noise sample, multiple races of cluster result are carried out into Similarity Measure using mapping reduction mode, according to Noise URL tasks are removed according to Similarity value;
S314:The URL tasks after denoising are stored using mapping reduction mode.
4. the URL grasping means of distributed reptile engine according to claim 1 and 2, it is characterised in that:Step S400 In, the elimination crawls the URL tasks repeated in result to be included:
S401:The URL tasks that will be gathered are added in access list, record the access time of URL tasks, and set repetition time Number is 1;
S402:The URL tasks removed after noise are compared with access list and cache list successively, if in access list or URL tasks are inquired in cache list, is then abandoned the URL tasks and is updated URL tasks access time and number of repetition;If not existing URL tasks are inquired in access list or cache list, then accesses stored URL tasks;
S403:Judge to whether there is the URL tasks in stored URL tasks, if then abandoning the URL tasks, if otherwise Update access list and URL task-sets.
5. the URL grasping means of the distributed reptile engine according to claim 1 or 3 or 4, it is characterised in that:Carry out During distribution parallel clustering, the similarity accounts for the equal of two URL task length ratios using maximum public word length Value metric.
6. a kind of URL grasping systems of distributed reptile engine, it is characterised in that including:
Acquisition module, for gathering URL tasks and storing;
Sort module, for the partitioning strategy of multitask based on website cryptographic Hash, the set of URL with same domain name is closed and is distributed to together One crawls node is crawled, and is collected and crawled result;
Denoising module, distribution parallel clustering is carried out for the original web page corresponding to the URL tasks that will be crawled, removes noise URL tasks;
Deduplication module, for eliminating the URL tasks repeated in the URL tasks after removal noise;
Judge module, for judging that whether passed through the URL tasks for eliminating repeating part has captured the number of plies less than preset value, if Then turn the execution of denoising module, if otherwise turning merging module;
Merging module, for the corresponding original web page of URL tasks of each layer of crawl to be merged.
7. URL grasping systems of a kind of distributed reptile engine according to claim 6, it is characterised in that:The denoising Module, also included before distribution parallel clustering is carried out, and noise URL tasks are tentatively removed using the DOM tree structure of webpage, bag Include:
Cutting unit, for utilizing<td>、<p>、<div>The page is split Deng html labels, is removed some and is rendered phase Close but with the unrelated label of URL tasks;
Comparing unit, for being positioned to noise link using link characters ratio, if the word ratio of node is higher than 1/4 The link where the node is then judged to that initial noisc is linked and removed.
8. URL grasping systems of a kind of distributed reptile engine according to claim 6 or 7, it is characterised in that:It is described to go Module of making an uproar carries out distribution parallel clustering, and removal noise URL tasks include:
Blocking unit, for carrying out domain name mapping to original web page using mapping reduction mode, carries out preliminary piecemeal;
Cluster cell, for each block to be carried out into singlepass clusters, and will cluster structure point by the way of reduction is mapped Into multiple races;
Computing unit, for according to existing noise sample, carrying out multiple races of cluster result using mapping reduction mode similar Degree is calculated, and noise URL tasks are removed according to Similarity value;
Memory cell, the URL tasks after denoising are stored using mapping reduction mode.
9. URL grasping systems of a kind of distributed reptile engine according to claim 6 or 7, it is characterised in that:It is described to go Molality block, elimination crawls the URL tasks repeated in result to be included:
Unit is set up, the URL tasks for that will be gathered are added in access list, record the access time of URL tasks, and set It is 1 to put number of repetition;
Matching unit, for the URL tasks removed after noise to be compared with access list and cache list successively, if visiting Ask and inquire URL tasks in list or cache list, then abandon the URL tasks and update URL tasks access time and repeat time Number;If not inquiring URL tasks in access list or cache list, stored URL tasks are accessed;
Duplicate checking unit, for judging to whether there is the URL tasks in stored URL tasks, if then abandon the URL appointing Business, if otherwise updating access list and URL task-sets.
10. the URL grasping means of a kind of distributed reptile engine according to claim 1 or 8 or 9, it is characterised in that:
During distribution parallel clustering is carried out, the similarity accounts for two to the denoising module using maximum public word length The mean measures of individual URL tasks length ratio.
CN201611037722.XA 2016-11-23 2016-11-23 A kind of URL grasping means of distributed reptile engine and system Active CN106776768B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201611037722.XA CN106776768B (en) 2016-11-23 2016-11-23 A kind of URL grasping means of distributed reptile engine and system

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201611037722.XA CN106776768B (en) 2016-11-23 2016-11-23 A kind of URL grasping means of distributed reptile engine and system

Publications (2)

Publication Number Publication Date
CN106776768A true CN106776768A (en) 2017-05-31
CN106776768B CN106776768B (en) 2018-02-02

Family

ID=58974402

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201611037722.XA Active CN106776768B (en) 2016-11-23 2016-11-23 A kind of URL grasping means of distributed reptile engine and system

Country Status (1)

Country Link
CN (1) CN106776768B (en)

Cited By (13)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107766581A (en) * 2017-11-23 2018-03-06 安徽科创智慧知识产权服务有限公司 The method that Data duplication record cleaning is carried out to URL
CN107943588A (en) * 2017-11-22 2018-04-20 用友金融信息技术股份有限公司 Data processing method, system, computer equipment and readable storage medium storing program for executing
CN107992534A (en) * 2017-11-23 2018-05-04 安徽科创智慧知识产权服务有限公司 The method that improved sort key sorts data set
CN108804576A (en) * 2018-05-22 2018-11-13 华中科技大学 A kind of domain name hierarchical structure detection method based on link analysis
CN109165334A (en) * 2018-09-20 2019-01-08 恒安嘉新(北京)科技股份公司 A method of establishing CDN producer primary knowledge base
CN109740037A (en) * 2019-01-02 2019-05-10 山东省科学院情报研究所 The distributed online real-time processing method of multi-source, isomery fluidised form big data and system
CN109739849A (en) * 2019-01-02 2019-05-10 山东省科学院情报研究所 A kind of network sensitive information of data-driven excavates and early warning platform
CN111274467A (en) * 2019-12-31 2020-06-12 中国电子科技集团公司第二十八研究所 Large-scale data acquisition-oriented three-layer distributed deduplication architecture and method
CN112597369A (en) * 2020-12-22 2021-04-02 荆门汇易佳信息科技有限公司 Webpage spider theme type search system based on improved cloud platform
CN112612939A (en) * 2020-12-18 2021-04-06 山东中创软件工程股份有限公司 Crawler deployment method, system, device, equipment and storage medium
CN113807087A (en) * 2020-06-16 2021-12-17 中国电信股份有限公司 Website domain name similarity detection method and device
CN113821754A (en) * 2021-09-18 2021-12-21 上海观安信息技术股份有限公司 Sensitive data interface crawler identification method and device
CN113965371A (en) * 2021-10-19 2022-01-21 北京天融信网络安全技术有限公司 Task processing method, device, terminal and storage medium in website monitoring process

Families Citing this family (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2019180491A1 (en) * 2018-03-22 2019-09-26 Pratik Sharma Uniform resource locator identification service

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20110307467A1 (en) * 2010-06-10 2011-12-15 Stephen Severance Distributed web crawler architecture
CN103389983A (en) * 2012-05-08 2013-11-13 阿里巴巴集团控股有限公司 Webpage content grabbing method and device applied to network crawler system
CN103714139A (en) * 2013-12-20 2014-04-09 华南理工大学 Parallel data mining method for identifying a mass of mobile client bases
CN103970788A (en) * 2013-02-01 2014-08-06 北京英富森信息技术有限公司 Webpage-crawling-based crawler technology
CN104657399A (en) * 2014-01-03 2015-05-27 广西科技大学 Web crawler control method

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20110307467A1 (en) * 2010-06-10 2011-12-15 Stephen Severance Distributed web crawler architecture
CN103389983A (en) * 2012-05-08 2013-11-13 阿里巴巴集团控股有限公司 Webpage content grabbing method and device applied to network crawler system
CN103970788A (en) * 2013-02-01 2014-08-06 北京英富森信息技术有限公司 Webpage-crawling-based crawler technology
CN103714139A (en) * 2013-12-20 2014-04-09 华南理工大学 Parallel data mining method for identifying a mass of mobile client bases
CN104657399A (en) * 2014-01-03 2015-05-27 广西科技大学 Web crawler control method

Cited By (19)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107943588A (en) * 2017-11-22 2018-04-20 用友金融信息技术股份有限公司 Data processing method, system, computer equipment and readable storage medium storing program for executing
CN107766581A (en) * 2017-11-23 2018-03-06 安徽科创智慧知识产权服务有限公司 The method that Data duplication record cleaning is carried out to URL
CN107992534A (en) * 2017-11-23 2018-05-04 安徽科创智慧知识产权服务有限公司 The method that improved sort key sorts data set
CN108804576A (en) * 2018-05-22 2018-11-13 华中科技大学 A kind of domain name hierarchical structure detection method based on link analysis
CN108804576B (en) * 2018-05-22 2021-08-20 华中科技大学 Domain name hierarchical structure detection method based on link analysis
CN109165334A (en) * 2018-09-20 2019-01-08 恒安嘉新(北京)科技股份公司 A method of establishing CDN producer primary knowledge base
CN109165334B (en) * 2018-09-20 2022-05-27 恒安嘉新(北京)科技股份公司 Method for establishing CDN manufacturer basic knowledge base
CN109740037A (en) * 2019-01-02 2019-05-10 山东省科学院情报研究所 The distributed online real-time processing method of multi-source, isomery fluidised form big data and system
CN109739849A (en) * 2019-01-02 2019-05-10 山东省科学院情报研究所 A kind of network sensitive information of data-driven excavates and early warning platform
CN109740037B (en) * 2019-01-02 2023-11-24 山东省科学院情报研究所 Multi-source heterogeneous flow state big data distributed online real-time processing method and system
CN109739849B (en) * 2019-01-02 2021-06-29 山东省科学院情报研究所 Data-driven network sensitive information mining and early warning platform
CN111274467A (en) * 2019-12-31 2020-06-12 中国电子科技集团公司第二十八研究所 Large-scale data acquisition-oriented three-layer distributed deduplication architecture and method
CN113807087A (en) * 2020-06-16 2021-12-17 中国电信股份有限公司 Website domain name similarity detection method and device
CN113807087B (en) * 2020-06-16 2023-11-28 中国电信股份有限公司 Method and device for detecting similarity of website domain names
CN112612939A (en) * 2020-12-18 2021-04-06 山东中创软件工程股份有限公司 Crawler deployment method, system, device, equipment and storage medium
CN112597369A (en) * 2020-12-22 2021-04-02 荆门汇易佳信息科技有限公司 Webpage spider theme type search system based on improved cloud platform
CN113821754A (en) * 2021-09-18 2021-12-21 上海观安信息技术股份有限公司 Sensitive data interface crawler identification method and device
CN113965371A (en) * 2021-10-19 2022-01-21 北京天融信网络安全技术有限公司 Task processing method, device, terminal and storage medium in website monitoring process
CN113965371B (en) * 2021-10-19 2023-08-29 北京天融信网络安全技术有限公司 Task processing method, device, terminal and storage medium in website monitoring process

Also Published As

Publication number Publication date
CN106776768B (en) 2018-02-02

Similar Documents

Publication Publication Date Title
CN106776768B (en) A kind of URL grasping means of distributed reptile engine and system
Cambazoglu et al. Scalability challenges in web search engines
Ma et al. Big graph search: challenges and techniques
CN103310012A (en) Distributed web crawler system
CN106776929A (en) A kind of method for information retrieval and device
CN103488680A (en) Combinators to build a search engine
CN104866471B (en) A kind of example match method based on local sensitivity Hash strategy
CN103678550B (en) Mass data real-time query method based on dynamic index structure
CN107423535A (en) For the methods, devices and systems for the medical conditions for determining user
CN108322428A (en) A kind of abnormal access detection method and equipment
CN103226609A (en) Searching method for WEB focus searching system
CN107122238A (en) Efficient iterative Mechanism Design method based on Hadoop cloud Computational frame
CN103258017A (en) Method and system for parallel square crossing network data collection
CN104573082B (en) Space small documents distributed data storage method and system based on access log information
Vrbić Data mining and cloud computing
CN112597369A (en) Webpage spider theme type search system based on improved cloud platform
do Carmo Oliveira et al. Set similarity joins with complex expressions on distributed platforms
Amalarethinam et al. A study on performance evaluation of peer-to-peer distributed databases
Sundarakumar et al. An Approach in Big Data Analytics to Improve the Velocity of Unstructured Data Using MapReduce
da Silva et al. Efficient and distributed dbscan algorithm using mapreduce to detect density areas on traffic data
Zhong et al. A web crawler system design based on distributed technology
Ren et al. [Retracted] A Study on Information Classification and Storage in Cloud Computing Data Centers Based on Group Collaborative Intelligent Clustering
Maratea et al. An heuristic approach to page recommendation in web usage mining
Henrique et al. A new approach for verifying url uniqueness in web crawlers
Zhang et al. Scalable Online Interval Join on Modern Multicore Processors in OpenMLDB

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant