CN106776768A - A kind of URL grasping means of distributed reptile engine and system - Google Patents
A kind of URL grasping means of distributed reptile engine and system Download PDFInfo
- Publication number
- CN106776768A CN106776768A CN201611037722.XA CN201611037722A CN106776768A CN 106776768 A CN106776768 A CN 106776768A CN 201611037722 A CN201611037722 A CN 201611037722A CN 106776768 A CN106776768 A CN 106776768A
- Authority
- CN
- China
- Prior art keywords
- url
- tasks
- url tasks
- noise
- module
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Granted
Links
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/90—Details of database functions independent of the retrieved data types
- G06F16/95—Retrieval from the web
- G06F16/951—Indexing; Web crawling techniques
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F9/00—Arrangements for program control, e.g. control units
- G06F9/06—Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
- G06F9/46—Multiprogramming arrangements
- G06F9/50—Allocation of resources, e.g. of the central processing unit [CPU]
- G06F9/5083—Techniques for rebalancing the load in a distributed system
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Databases & Information Systems (AREA)
- Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Software Systems (AREA)
- Data Mining & Analysis (AREA)
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
- Information Transfer Between Computers (AREA)
Abstract
A kind of URL grasping means of distributed reptile engine of the present invention, comprises the following steps:S100:Collection URL tasks are simultaneously stored;S200:Based on the partitioning strategy of multitask of website cryptographic Hash, the set of URL with same domain name is closed and is distributed to the same node that crawls and is crawled, and collected and crawl result;S300:The original web page corresponding to URL tasks that will be crawled carries out distribution parallel clustering, removes noise URL tasks;S400:Eliminate the URL tasks repeated in the URL tasks after removal noise;S500:Whether the URL tasks that elimination repetition has been passed through in judgement have captured the number of plies less than preset value, if then return to step S300;If otherwise performing step S600;S600:The corresponding original web page of URL tasks of each layer of crawl is merged.URL tasks are allocated to the different nodes that crawl according to domain name, the different URL tasks for crawling node processing difference domain name mitigate each task load amount for crawling node.
Description
Technical field
The present invention relates to Internet technical field, more particularly to a kind of distributed reptile engine URL grasping means and be
System.
Background technology
With internet information explosive growth, user's information interested is submerged in a large amount of irrelevant informations, using searching
Index is held up to obtain information interested and have become people and obtains information more easily mode.As search engine basic component
One of web crawlers, it is necessary to region be directly facing internet, it is continual that information is collected from internet, for search engine provides number
According to source.Whether the information of search is accurately closely related with web crawlers.But internet scale is very huge, website number
It is numerous, webpage quantity several hundred billion, the data of such magnanimity to the design of web crawlers with realize proposing requirement higher, component
Distributed network crawler system is an effective solution.Web crawlers is a robot program, it from specified URL ground
Location starts to download page documents, extracts URL addresses therein, then continue to creep since the URL addresses extracted.
Traditional distributed reptile engine is mainly master-slave mode, that is, have a special master server to wait to capture to safeguard
URL queues, it is responsible for every time being distributed to URL different from server, and is then responsible for actual webpage capture work from server
Make.Master server will also be responsible for reconciling each from the negative of server in addition to safeguarding URL queues to be captured and distribution URL
Load situation, in case some are from server is excessively idle or fatigue.Under this pattern, master server tends to turn into system bottle
Neck.
In the Chinese patent of Application No. 201210090259.0, URL in a kind of distributed network crawler system is disclosed
De-weight method, realizes efficient partitioning strategy of multitask, so as to preferably adapt to distributed network by introducing virtual reptile node
Actually creeped in network crawler system the dynamic change of node, gone using a kind of distributed URL on the basis of partitioning strategy of multitask
Double recipe formula, so as to the repetition caused in the node change procedure that avoids actually creeping is creeped.The invention changes rule when task is divided
Mould is small, can guarantee that the persistently operation of crawler system stabilization, and partition strategy has dynamic adaptable, can realize the negative of node that actually creep
Carry balanced, but it cannot solve the problems, such as reptile engine crawl URL efficiency during high concurrent.
In the Chinese patent of Application No. 201210425213.X, a kind of URL rows of distributed network reptile are disclosed
Weight system and method, the system includes that reptile gathers child node, central server and database server.Methods described includes
Reptile collection child node is registered on central server;Reptile collection child node obtains URL from database waiting list,
New URL information is obtained from this URL;Reptile gathers child node and carries out one-level re-scheduling to the new URL for obtaining, and such as one-level re-scheduling is not led to
Cross, then abandon the URL;As one-level re-scheduling passes through, the new URL for obtaining is added into local URL abstracts and center service is sent to
Device;Central server carries out two grades of re-schedulings to the new URL for obtaining, and such as two grades of re-schedulings pass through, and URL is added into overall situation URL abstracts;
Be added in waiting list for the link of the URL by reptile collection child node.The method that the invention is provided is by being classified re-scheduling mechanism
The re-scheduling task that will can originally concentrate on Centroid is carried out decomposes each reptile and gathers child node, center by one-level re-scheduling
Server safeguards a global re-scheduling form by way of two grades of re-schedulings.The above method cannot solve reptile engine during high concurrent
Capture the problem of URL efficiency and distributed reptile task load equalization problem cannot be solved.
The content of the invention
It is an object of the present invention to reptile engine crawl URL efficiency and solution point when proposing that one kind can improve high concurrent
Cloth reptile task load the URL grasping means of distributed reptile engine and system in a balanced way, solve existing reptile engine efficiency
It is low, the problem of load imbalance.
To achieve these goals, the technical solution adopted in the present invention is:
A kind of URL grasping means of distributed reptile engine, comprises the following steps:
S100:Collection URL tasks are simultaneously stored;
S200:Based on the partitioning strategy of multitask of website cryptographic Hash, the set of URL with same domain name is closed and distributes to same climbing
Take node to be crawled, and collect and crawl result;
S300:The original web page corresponding to URL tasks that will be crawled carries out distribution parallel clustering, removal noise URL
Task;
S400:Eliminate the URL tasks repeated in the URL tasks after removal noise
S500:Whether the URL tasks that elimination repetition has been passed through in judgement have captured the number of plies less than preset value, if then returning to step
Rapid S300;If otherwise performing step S600;
S600:The corresponding original web page of URL tasks of each layer of crawl is merged.
In step S300, also included before distribution parallel clustering is carried out, tentatively gone using the DOM tree structure of webpage
Except noise URL tasks, including:
S301:Utilize<td>、<p>、<div>The page is split Deng html labels, remove some to render it is related but
It is the unrelated label of same URL tasks;
S302:Noise link is positioned using link characters ratio, will if the word ratio of node is higher than 1/4
Link where the node is judged to that initial noisc is linked and removed.
In step S300, the distribution parallel clustering, removal noise URL tasks comprise the following steps:
S311:Domain name mapping is carried out to original web page using mapping reduction mode, preliminary piecemeal is carried out;
S312:Each block is carried out into singlepass clusters, and cluster result is divided into many by the way of reduction is mapped
Individual race;
S313:According to existing noise sample, multiple races of cluster result are carried out into similarity meter using mapping reduction mode
Calculate, noise URL tasks are removed according to Similarity value;
S314:The URL tasks after denoising are stored using mapping reduction mode.
In step S400, the elimination crawls the URL tasks repeated in result to be included:
S401:The URL tasks that will be gathered are added in access list, record the access time of URL tasks, and set weight
Again number is 1;
S402:The URL tasks removed after noise are compared with access list and cache list successively, if in Access Column
URL tasks are inquired in table or cache list, is then abandoned the URL tasks and is updated URL tasks access time and number of repetition;If
URL tasks are not inquired in access list or cache list, then accesses stored URL tasks;
S403:Judge to whether there is the URL tasks in stored URL tasks, if then abandoning the URL tasks, if
Otherwise update access list and URL task-sets.
Wherein, during distribution parallel clustering is carried out, the similarity accounts for two using maximum public word length
The mean measures of URL task length ratios.
Invention additionally discloses a kind of URL grasping systems of distributed reptile engine, including:
Acquisition module, for gathering URL tasks and storing;
Sort module, for the partitioning strategy of multitask based on website cryptographic Hash, distribution is closed by the set of URL with same domain name
Crawled to the same node that crawls, and collect and crawl result;
Denoising module, distribution parallel clustering is carried out for the original web page corresponding to the URL tasks that will be crawled, is removed
Noise URL tasks;
Deduplication module, for eliminating the URL tasks repeated in the URL tasks after removal noise;
Judge module, for judging that whether passed through the URL tasks for eliminating repeating part has captured the number of plies less than preset value,
If then turning the execution of denoising module, if otherwise turning merging module;
Merging module, for the corresponding original web page of URL tasks of each layer of crawl to be merged.
The denoising module, also included before distribution parallel clustering is carried out, and was tentatively gone using the DOM tree structure of webpage
Except noise URL tasks, including:
Cutting unit, for utilizing<td>、<p>、<div>The page is split Deng html labels, removes some and wash with watercolours
But dye is related with the unrelated label of URL tasks;
Comparing unit, for being positioned to noise link using link characters ratio, if the word ratio of node is high
The link where the node is judged to that initial noisc is linked and removed in 1/4.
The denoising module carries out distribution parallel clustering, and removal noise URL tasks include:
Blocking unit, for carrying out domain name mapping to original web page using mapping reduction mode, carries out preliminary piecemeal;
Cluster cell, for each block to be carried out into singlepass clusters, and is tied cluster by the way of reduction is mapped
Structure is divided into multiple races;
Computing unit, for according to existing noise sample, carrying out multiple races of cluster result using mapping reduction mode
Similarity Measure, noise URL tasks are removed according to Similarity value;
Memory cell, the URL tasks after denoising are stored using mapping reduction mode.
The deduplication module, elimination crawls the URL tasks repeated in result to be included:
Unit is set up, the URL tasks for that will be gathered are added in access list, records the access time of URL tasks,
And it is 1 to set number of repetition;
Matching unit, for the URL tasks removed after noise to be compared with access list and cache list successively, if
URL tasks are inquired in access list or cache list, is then abandoned the URL tasks and is updated URL tasks access time and weight
Again count;If not inquiring URL tasks in access list or cache list, stored URL tasks are accessed;
Duplicate checking unit, for judging to whether there is the URL tasks in stored URL tasks, if then abandoning the URL
Task, if otherwise updating access list and URL task-sets.
During distribution parallel clustering is carried out, the similarity uses maximum public word length to the denoising module
Account for two mean measures of URL task length ratios.
Beneficial effects of the present invention are:
First, URL tasks are allocated to the different nodes that crawl according to domain name, the different node processing difference domain names that crawl
URL tasks, mitigate each task load amount for crawling node;
2nd, each crawls node and successively every URL tasks is processed, and will be made an uproar by the way of distribution parallel clustering
Sound URL tasks are removed, and eliminate the URL tasks of repetition.Which realizes URL efficient process, and reptile is drawn when solving high concurrent
The problem of crawl URL efficiency is held up, each node load balancing is realized.
Brief description of the drawings
Fig. 1 is a kind of method flow diagram of the URL grasping means of distributed reptile engine of the invention;
Fig. 2 is a kind of system block diagram of the URL grasping systems of distributed reptile engine of the invention.
Reference is:
Acquisition module -100, sort module -200, denoising module -300, deduplication module -400 judge Europe fast -500, merge
Module -600.
Specific embodiment
Below with reference to specific embodiment shown in the drawings, the present invention will be described in detail.But these implementation methods are simultaneously
The present invention is not limited, structure that one of ordinary skill in the art is made according to these implementation methods, method or functionally
Conversion is all contained in protection scope of the present invention.
Hadoop distributed file systems (HDFS) are designed to be adapted to operate in common hardware (commodity
Hardware the distributed file system on).It and existing distributed file system have many common ground.But meanwhile, it and
The difference of other distributed file systems is also apparent.HDFS is a system for Error Tolerance, is adapted to be deployed in
On cheap machine.HDFS can provide the data access of high-throughput, be especially suitable for the application on large-scale dataset.HDFS is put
A part of POSIX constraints wide read the purpose of file system data realizing streaming.HDFS is most beginning as Apache
The architecture of Nutch search engine projects and develop.HDFS is a part for Apache Hadoop Core projects.HDFS
The characteristics of having high fault tolerance (fault-tolerant), and be designed to be deployed on cheap (low-cost) hardware.
And it provides the data that high-throughput (high throughput) carrys out access application, being adapted to those has super large data
Collect the application program of (large data set).The requirement (requirements) that HDFS relaxes (relax) POSIX so may be used
Data in accessing (streaming access) file system in the form of realizing stream.
A kind of URL grasping means of distributed reptile engine is disclosed refering to the present invention shown in Fig. 1, and it applies HDFS files system
System carries out data storage, comprises the following steps:
S100:Collection URL tasks are simultaneously stored;In the process, URL seeds are uploaded in the in files of HDFS, and is set
The collection that the upload that the number of plies is 0, URL seeds realizes URL tasks is captured.
S200:Based on the partitioning strategy of multitask of website cryptographic Hash, the set of URL with same domain name is closed and distributes to same climbing
Take node to be crawled, and collect and crawl result;
In step S200, CrawlerDriver functions are called to capture the corresponding webpage of URL tasks in files, as a result
In the doc files being stored in HDFS.The CrawlerDriver functions implementation procedure is:
URL tasks to be captured are extracted from files, the corresponding original web page of URL tasks is then downloaded, is stored in
Under DOC files.The location mode of original web page is that key value is URL, and property value is the corresponding webpage HTML informations pair of URL.
Efficiency is crawled in order to improve each node, the partitioning strategy of multitask of website cryptographic Hash is taken based on, by with same domain name
Set of URL is closed to distribute to as far as possible and same crawls node.
Partitioning strategy of multitask based on website cryptographic Hash is:The domain name part of URL tasks to be captured is calculated into its Hash
Value, is then divided into different subsets, the URL tasks in each subset by URL set of tasks in files according to its cryptographic Hash
The node that crawls all given on same node is crawled according to the map tasks in mapping reduction mode (map/reduce),
Reduce tasks are recalled to be aggregated on HDFS all results for being crawled on node of crawling.
S300:The original web page corresponding to URL tasks that will be crawled carries out distribution parallel clustering, removal noise URL
Task;
In step S300, the URL tasks that can be deposited using OptimizerDriver function optimizations, due in webpage
There is substantial amounts of noise URL tasks, therefore the URL tasks deposited are clustered, be combined with the dom tree collection of webpage
Noise sample carry out the removal of noise URL tasks.
In step S300, the distribution parallel clustering, removal noise URL tasks comprise the following steps:
S311:Domain name mapping is carried out to original web page using mapping reduction mode, preliminary piecemeal is carried out;
S312:Each block is carried out into singlepass clusters, and cluster result is divided into many by the way of reduction is mapped
Individual race;Specifically, in the step, calling the mapping tasks (map tasks) on crawl node to carry out respectively to each block
Singlepass is clustered, and calls reduction task (reduce tasks) to collect the result of cluster.
S313:According to existing noise sample, multiple races of cluster result are carried out into similarity meter using mapping reduction mode
Calculate, noise URL tasks are removed according to Similarity value;In this process, each race is sentenced according to the structure of Similarity Measure
Whether disconnected is noise URL, the URL tasks in family it is larger compared with other URL similarity of tasks value deviations if be judged to make an uproar
Sound URL is simultaneously removed, and then calls reduce tasks that the URL tasks after denoising are aggregated into HDFS.
The similarity used in cluster process uses longest common subsequence length (Longest Common
Subsequence, referred to as LCS) accounting compared with two averages of URL length ratio measure.Two LCS of URL are using dynamic
State planning algorithm is solved.Equation of transfer is shown below:
Wherein, c [i, j] record character strings Xi={ x1,x2,..xiAnd character string Yj={ y1,y2,y3,...yj
The length of longest common subsequence.Here XiAnd YjIt is the preceding i and j character string for needing the url1 and url2 for comparing.
The specific calculating process of LCS is as follows:
Two-dimensional array c [i] [j] records url1 sequence Xs i={ x1, x2 ..xi } and url2 sequences Yj in above-mentioned steps (2)
The length of the longest common subsequence of={ y1, y2, y3 ... yj }.Step (4)~(9) are a double circulations, with sub- sequence
Arrange the incremental value for being continuously updated c [i] [j] of Xi and Yi.After circulation terminates, the longest common subsequence length of url1 and url2
As c [m, n].LCS can preferably measure the structural similarity between url, be adapted to solve the meter of extensive url similarities
Calculate.
By calculating ratio average, then it is compared with the threshold value of setting according to average, judges to whether there is in the race
Noise URL tasks, the mode that it calculates ratio average is as follows:
S314:The URL tasks after denoising are stored using mapping reduction mode.
In step S300, also included before distribution parallel clustering is carried out, tentatively gone using the DOM tree structure of webpage
Except noise URL tasks.The data processing amount in cluster process can be reduced using which, data-handling efficiency is improved, shortened
Data processing time.The mode that the DOM tree structure of the application webpage tentatively removes noise URL tasks includes:
S301:Utilize<td>、<p>、<div>The page is split Deng html labels, remove some to render it is related but
It is the unrelated label of same URL tasks;
In step S301, the page is cleaned, remove and render the related content such as including CSS, using utilization<td>、<p>、
<div>The page is split Deng html labels, remove some to render it is related but with the unrelated label of URL tasks so that
The clear in structure of dom tree.
S302:Noise link is positioned using link characters ratio, will if the word ratio of node is higher than 1/4
Link where the node is judged to that initial noisc is linked and removed.
S400:Eliminate the URL tasks repeated in the URL tasks after removal noise;
In the step, the URL tasks of stored repetition are eliminated, preserve the crawl that next round is waited into files.
It is excellent for the low problem of search efficiency for solving to exist during existing URL tasks Duplicate Removal Algorithm treatment magnanimity URL tasks
The use of choosing is based on the URL task duplicate removal strategies of caching, and cache class contains two queues in caching system:Access list and slow
List is deposited, URL classes contain URL character strings, three fields of URL numbers of repetition and URL access times, deposited in cache list
The arrangement from high in the end of the weight calculated according to URL numbers of repetition and access time of URL task queues, the calculating of weight
Formula is as follows:
Wherein, rep represents number of repetition, and t represents time index, and t_current represents the current accessed time of url, t_
The initial time that init representation programs are set when starting, min and max represent the t of url in cache list1The lower bound of index and upper
Boundary.
Here, the number of repetition of URL is higher, and shared weight is bigger.The access time of URL closer to current time, then
The URL is more active, therefore the weight assigned to it is also bigger.The number of repetition of URL URL that represents high is a common link,
More likely run into follow-up sentencing is operated again.So carrying out during URL duplicate removals judge operating process, can re-scheduling one by one, greatly
The probability being hit is lifted greatly.Caching system can at any time according to the URL tasks that renewal has been accessed that change of URL weighted values, and URL appoints
The strategy that business removal is repeated mainly is completed by following steps:
S401:The URL tasks that will be gathered are added in access list, record the access time of URL tasks, and set weight
Again number is 1;
In step S401, after crawler system starts, caching system completes queue initialization work first, by under in files
URL tasks are added in access list, record its access time, and it is 1 to set number of repetition;
S402:The URL tasks removed after noise are compared with access list and cache list successively, if in Access Column
URL tasks are inquired in table or cache list, is then abandoned the URL tasks and is updated URL tasks access time and number of repetition;If
URL tasks are not inquired in access list or cache list, then accesses stored URL tasks;
S403:Judge to whether there is the URL tasks in stored URL tasks, if then abandoning the URL tasks, if
Otherwise update access list and URL task-sets.
In step S403, judge to whether there is the URL tasks in caching system, if there is the URL tasks, abandon
The URL tasks simultaneously update its access time and number of repetition;If not existing the URL tasks, the URL tasks are added into texts
Part folder is lower for next round crawl.
S500:Whether the URL tasks that elimination repetition has been passed through in judgement have captured the number of plies less than preset value, if then returning to step
Rapid S300;If otherwise performing step S600;
S600:The corresponding original web page of URL tasks of each layer of crawl is merged.
A kind of URL grasping means of the distributed reptile engine described in above-mentioned implementation method, URL tasks are drawn according to domain name
The different nodes that crawl are given, the different URL tasks for crawling node processing difference domain name mitigate each the crawling node of the task
Load capacity.Each crawls node and successively every URL tasks is processed, by noise by the way of distribution parallel clustering
URL tasks are removed, and eliminate the URL tasks of repetition.Which realizes URL efficient process, solves reptile engine during high concurrent
The problem of URL efficiency is captured, each node load balancing is realized.
Refering to shown in Fig. 2, one embodiment of the present invention is also disclosed a kind of URL grasping systems of distributed reptile engine,
Including:
Acquisition module 100, for gathering URL tasks and storing;
Sort module 200, for the partitioning strategy of multitask based on website cryptographic Hash, the set of URL with same domain name is closed
Distribute to the same node that crawls to be crawled, and collect and crawl result;
Denoising module 300, distribution parallel clustering is carried out for the original web page corresponding to the URL tasks that will be crawled,
Removal noise URL tasks;
Deduplication module 400, for eliminating the URL tasks repeated in the URL tasks after removal noise;
Judge module 500, for judging that whether passed through the URL tasks for eliminating repeating part has captured the number of plies less than default
Value, if then turning the execution of denoising module, if otherwise turning merging module;
Merging module 600, for the corresponding original web page of URL tasks of each layer of crawl to be merged.
The URL grasping systems of a kind of distributed reptile engine described in above-mentioned implementation method, in an improved embodiment party
Denoising module described in formula, also included before distribution parallel clustering is carried out, and is tentatively removed using the DOM tree structure of webpage and made an uproar
Sound URL tasks, including:
Cutting unit, for utilizing<td>、<p>、<div>The page is split Deng html labels, removes some and wash with watercolours
But dye is related with the unrelated label of URL tasks;
Comparing unit, for being positioned to noise link using link characters ratio, if the word ratio of node is high
The link where the node is judged to that initial noisc is linked and removed in 1/4.
A kind of URL grasping systems of the distributed reptile engine described in above-mentioned implementation method, in improving implementation method one,
The denoising module carries out distribution parallel clustering, and removal noise URL tasks include:
Blocking unit, for carrying out domain name mapping to original web page using mapping reduction mode, carries out preliminary piecemeal;
Cluster cell, for each block to be carried out into singlepass clusters, and is tied cluster by the way of reduction is mapped
Structure is divided into multiple races;
Computing unit, for according to existing noise sample, carrying out multiple races of cluster result using mapping reduction mode
Similarity Measure, noise URL tasks are removed according to Similarity value;
Memory cell, the URL tasks after denoising are stored using mapping reduction mode.
The URL grasping systems of a kind of distributed reptile engine described in above-mentioned implementation method, in an improved implementation method
In, the deduplication module elimination crawls the URL tasks repeated in result to be included:
Unit is set up, the URL tasks for that will be gathered are added in access list, records the access time of URL tasks,
And it is 1 to set number of repetition;
Matching unit, for the URL tasks removed after noise to be compared with access list and cache list successively, if
URL tasks are inquired in access list or cache list, is then abandoned the URL tasks and is updated URL tasks access time and weight
Again count;If not inquiring URL tasks in access list or cache list, stored URL tasks are accessed;
Duplicate checking unit, for judging to whether there is the URL tasks in stored URL tasks, if then abandoning the URL
Task, if otherwise updating access list and URL task-sets.
A kind of URL grasping systems of the distributed reptile engine described in above-mentioned implementation method, the denoising module is being entered
During row distribution parallel clustering, the similarity accounts for two URL task length ratios using maximum public word length
Mean measures.
The URL grasping systems of the distributed reptile engine in embodiment of the present invention are based on distributed reptile engine
The corresponding system embodiment of URL grasping means, the relevant technical details mentioned in the URL grasping means of distributed reptile engine
In the present embodiment still effectively, repeated no more here to reduce repetition.
It should be understood that, although the present specification is described in terms of embodiments, but not each implementation method only includes one
Individual independent technical scheme, this narrating mode of specification is only that for clarity, those skilled in the art will should say
Used as an entirety, technical scheme in each implementation method can also be through appropriately combined, and forming those skilled in the art can for bright book
With the other embodiment for understanding.
Those listed above is a series of to be described in detail only for feasibility implementation method of the invention specifically
Bright, they simultaneously are not used to limit the scope of the invention, all equivalent implementations made without departing from skill spirit of the present invention
Or change should be included within the scope of the present invention.
Claims (10)
1. a kind of URL grasping means of distributed reptile engine, it is characterised in that comprise the following steps:
S100:Collection URL tasks are simultaneously stored;
S200:Based on the partitioning strategy of multitask of website cryptographic Hash, the set of URL with same domain name is closed to distribute to same crawl section
Point is crawled, and is collected and crawled result;
S300:The original web page corresponding to URL tasks that will be crawled carries out distribution parallel clustering, removes noise URL tasks;
S400:Eliminate the URL tasks repeated in the URL tasks after removal noise;
S500:Whether the URL tasks that elimination repetition has been passed through in judgement have captured the number of plies less than preset value, if then return to step
S300;If otherwise performing step S600;
S600:The corresponding original web page of URL tasks of each layer of crawl is merged.
2. the URL grasping means of a kind of distributed reptile engine according to claim 1, it is characterised in that:In step
In S300, also included before distribution parallel clustering is carried out, noise URL tasks tentatively removed using the DOM tree structure of webpage,
Including:
S301:Utilize<td>、<p>、<div>The page is split Deng html labels, but remove some to render it is related same
The unrelated label of URL tasks;
S302:Noise link is positioned using link characters ratio, saves this if the word ratio of node is higher than 1/4
Link where point is judged to that initial noisc is linked and removed.
3. the URL grasping means of a kind of distributed reptile engine according to claim 1 and 2, it is characterised in that:In step
In S300, the distribution parallel clustering, removal noise URL tasks comprise the following steps:
S311:Domain name mapping is carried out to original web page using mapping reduction mode, preliminary piecemeal is carried out;
S312:Each block is carried out into singlepass clusters, and cluster result is divided into multiple races by the way of reduction is mapped;
S313:According to existing noise sample, multiple races of cluster result are carried out into Similarity Measure using mapping reduction mode, according to
Noise URL tasks are removed according to Similarity value;
S314:The URL tasks after denoising are stored using mapping reduction mode.
4. the URL grasping means of distributed reptile engine according to claim 1 and 2, it is characterised in that:Step S400
In, the elimination crawls the URL tasks repeated in result to be included:
S401:The URL tasks that will be gathered are added in access list, record the access time of URL tasks, and set repetition time
Number is 1;
S402:The URL tasks removed after noise are compared with access list and cache list successively, if in access list or
URL tasks are inquired in cache list, is then abandoned the URL tasks and is updated URL tasks access time and number of repetition;If not existing
URL tasks are inquired in access list or cache list, then accesses stored URL tasks;
S403:Judge to whether there is the URL tasks in stored URL tasks, if then abandoning the URL tasks, if otherwise
Update access list and URL task-sets.
5. the URL grasping means of the distributed reptile engine according to claim 1 or 3 or 4, it is characterised in that:Carry out
During distribution parallel clustering, the similarity accounts for the equal of two URL task length ratios using maximum public word length
Value metric.
6. a kind of URL grasping systems of distributed reptile engine, it is characterised in that including:
Acquisition module, for gathering URL tasks and storing;
Sort module, for the partitioning strategy of multitask based on website cryptographic Hash, the set of URL with same domain name is closed and is distributed to together
One crawls node is crawled, and is collected and crawled result;
Denoising module, distribution parallel clustering is carried out for the original web page corresponding to the URL tasks that will be crawled, removes noise
URL tasks;
Deduplication module, for eliminating the URL tasks repeated in the URL tasks after removal noise;
Judge module, for judging that whether passed through the URL tasks for eliminating repeating part has captured the number of plies less than preset value, if
Then turn the execution of denoising module, if otherwise turning merging module;
Merging module, for the corresponding original web page of URL tasks of each layer of crawl to be merged.
7. URL grasping systems of a kind of distributed reptile engine according to claim 6, it is characterised in that:The denoising
Module, also included before distribution parallel clustering is carried out, and noise URL tasks are tentatively removed using the DOM tree structure of webpage, bag
Include:
Cutting unit, for utilizing<td>、<p>、<div>The page is split Deng html labels, is removed some and is rendered phase
Close but with the unrelated label of URL tasks;
Comparing unit, for being positioned to noise link using link characters ratio, if the word ratio of node is higher than 1/4
The link where the node is then judged to that initial noisc is linked and removed.
8. URL grasping systems of a kind of distributed reptile engine according to claim 6 or 7, it is characterised in that:It is described to go
Module of making an uproar carries out distribution parallel clustering, and removal noise URL tasks include:
Blocking unit, for carrying out domain name mapping to original web page using mapping reduction mode, carries out preliminary piecemeal;
Cluster cell, for each block to be carried out into singlepass clusters, and will cluster structure point by the way of reduction is mapped
Into multiple races;
Computing unit, for according to existing noise sample, carrying out multiple races of cluster result using mapping reduction mode similar
Degree is calculated, and noise URL tasks are removed according to Similarity value;
Memory cell, the URL tasks after denoising are stored using mapping reduction mode.
9. URL grasping systems of a kind of distributed reptile engine according to claim 6 or 7, it is characterised in that:It is described to go
Molality block, elimination crawls the URL tasks repeated in result to be included:
Unit is set up, the URL tasks for that will be gathered are added in access list, record the access time of URL tasks, and set
It is 1 to put number of repetition;
Matching unit, for the URL tasks removed after noise to be compared with access list and cache list successively, if visiting
Ask and inquire URL tasks in list or cache list, then abandon the URL tasks and update URL tasks access time and repeat time
Number;If not inquiring URL tasks in access list or cache list, stored URL tasks are accessed;
Duplicate checking unit, for judging to whether there is the URL tasks in stored URL tasks, if then abandon the URL appointing
Business, if otherwise updating access list and URL task-sets.
10. the URL grasping means of a kind of distributed reptile engine according to claim 1 or 8 or 9, it is characterised in that:
During distribution parallel clustering is carried out, the similarity accounts for two to the denoising module using maximum public word length
The mean measures of individual URL tasks length ratio.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201611037722.XA CN106776768B (en) | 2016-11-23 | 2016-11-23 | A kind of URL grasping means of distributed reptile engine and system |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201611037722.XA CN106776768B (en) | 2016-11-23 | 2016-11-23 | A kind of URL grasping means of distributed reptile engine and system |
Publications (2)
Publication Number | Publication Date |
---|---|
CN106776768A true CN106776768A (en) | 2017-05-31 |
CN106776768B CN106776768B (en) | 2018-02-02 |
Family
ID=58974402
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201611037722.XA Active CN106776768B (en) | 2016-11-23 | 2016-11-23 | A kind of URL grasping means of distributed reptile engine and system |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN106776768B (en) |
Cited By (13)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN107766581A (en) * | 2017-11-23 | 2018-03-06 | 安徽科创智慧知识产权服务有限公司 | The method that Data duplication record cleaning is carried out to URL |
CN107943588A (en) * | 2017-11-22 | 2018-04-20 | 用友金融信息技术股份有限公司 | Data processing method, system, computer equipment and readable storage medium storing program for executing |
CN107992534A (en) * | 2017-11-23 | 2018-05-04 | 安徽科创智慧知识产权服务有限公司 | The method that improved sort key sorts data set |
CN108804576A (en) * | 2018-05-22 | 2018-11-13 | 华中科技大学 | A kind of domain name hierarchical structure detection method based on link analysis |
CN109165334A (en) * | 2018-09-20 | 2019-01-08 | 恒安嘉新(北京)科技股份公司 | A method of establishing CDN producer primary knowledge base |
CN109740037A (en) * | 2019-01-02 | 2019-05-10 | 山东省科学院情报研究所 | The distributed online real-time processing method of multi-source, isomery fluidised form big data and system |
CN109739849A (en) * | 2019-01-02 | 2019-05-10 | 山东省科学院情报研究所 | A kind of network sensitive information of data-driven excavates and early warning platform |
CN111274467A (en) * | 2019-12-31 | 2020-06-12 | 中国电子科技集团公司第二十八研究所 | Large-scale data acquisition-oriented three-layer distributed deduplication architecture and method |
CN112597369A (en) * | 2020-12-22 | 2021-04-02 | 荆门汇易佳信息科技有限公司 | Webpage spider theme type search system based on improved cloud platform |
CN112612939A (en) * | 2020-12-18 | 2021-04-06 | 山东中创软件工程股份有限公司 | Crawler deployment method, system, device, equipment and storage medium |
CN113807087A (en) * | 2020-06-16 | 2021-12-17 | 中国电信股份有限公司 | Website domain name similarity detection method and device |
CN113821754A (en) * | 2021-09-18 | 2021-12-21 | 上海观安信息技术股份有限公司 | Sensitive data interface crawler identification method and device |
CN113965371A (en) * | 2021-10-19 | 2022-01-21 | 北京天融信网络安全技术有限公司 | Task processing method, device, terminal and storage medium in website monitoring process |
Families Citing this family (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
WO2019180491A1 (en) * | 2018-03-22 | 2019-09-26 | Pratik Sharma | Uniform resource locator identification service |
Citations (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20110307467A1 (en) * | 2010-06-10 | 2011-12-15 | Stephen Severance | Distributed web crawler architecture |
CN103389983A (en) * | 2012-05-08 | 2013-11-13 | 阿里巴巴集团控股有限公司 | Webpage content grabbing method and device applied to network crawler system |
CN103714139A (en) * | 2013-12-20 | 2014-04-09 | 华南理工大学 | Parallel data mining method for identifying a mass of mobile client bases |
CN103970788A (en) * | 2013-02-01 | 2014-08-06 | 北京英富森信息技术有限公司 | Webpage-crawling-based crawler technology |
CN104657399A (en) * | 2014-01-03 | 2015-05-27 | 广西科技大学 | Web crawler control method |
-
2016
- 2016-11-23 CN CN201611037722.XA patent/CN106776768B/en active Active
Patent Citations (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20110307467A1 (en) * | 2010-06-10 | 2011-12-15 | Stephen Severance | Distributed web crawler architecture |
CN103389983A (en) * | 2012-05-08 | 2013-11-13 | 阿里巴巴集团控股有限公司 | Webpage content grabbing method and device applied to network crawler system |
CN103970788A (en) * | 2013-02-01 | 2014-08-06 | 北京英富森信息技术有限公司 | Webpage-crawling-based crawler technology |
CN103714139A (en) * | 2013-12-20 | 2014-04-09 | 华南理工大学 | Parallel data mining method for identifying a mass of mobile client bases |
CN104657399A (en) * | 2014-01-03 | 2015-05-27 | 广西科技大学 | Web crawler control method |
Cited By (19)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN107943588A (en) * | 2017-11-22 | 2018-04-20 | 用友金融信息技术股份有限公司 | Data processing method, system, computer equipment and readable storage medium storing program for executing |
CN107766581A (en) * | 2017-11-23 | 2018-03-06 | 安徽科创智慧知识产权服务有限公司 | The method that Data duplication record cleaning is carried out to URL |
CN107992534A (en) * | 2017-11-23 | 2018-05-04 | 安徽科创智慧知识产权服务有限公司 | The method that improved sort key sorts data set |
CN108804576A (en) * | 2018-05-22 | 2018-11-13 | 华中科技大学 | A kind of domain name hierarchical structure detection method based on link analysis |
CN108804576B (en) * | 2018-05-22 | 2021-08-20 | 华中科技大学 | Domain name hierarchical structure detection method based on link analysis |
CN109165334A (en) * | 2018-09-20 | 2019-01-08 | 恒安嘉新(北京)科技股份公司 | A method of establishing CDN producer primary knowledge base |
CN109165334B (en) * | 2018-09-20 | 2022-05-27 | 恒安嘉新(北京)科技股份公司 | Method for establishing CDN manufacturer basic knowledge base |
CN109740037A (en) * | 2019-01-02 | 2019-05-10 | 山东省科学院情报研究所 | The distributed online real-time processing method of multi-source, isomery fluidised form big data and system |
CN109739849A (en) * | 2019-01-02 | 2019-05-10 | 山东省科学院情报研究所 | A kind of network sensitive information of data-driven excavates and early warning platform |
CN109740037B (en) * | 2019-01-02 | 2023-11-24 | 山东省科学院情报研究所 | Multi-source heterogeneous flow state big data distributed online real-time processing method and system |
CN109739849B (en) * | 2019-01-02 | 2021-06-29 | 山东省科学院情报研究所 | Data-driven network sensitive information mining and early warning platform |
CN111274467A (en) * | 2019-12-31 | 2020-06-12 | 中国电子科技集团公司第二十八研究所 | Large-scale data acquisition-oriented three-layer distributed deduplication architecture and method |
CN113807087A (en) * | 2020-06-16 | 2021-12-17 | 中国电信股份有限公司 | Website domain name similarity detection method and device |
CN113807087B (en) * | 2020-06-16 | 2023-11-28 | 中国电信股份有限公司 | Method and device for detecting similarity of website domain names |
CN112612939A (en) * | 2020-12-18 | 2021-04-06 | 山东中创软件工程股份有限公司 | Crawler deployment method, system, device, equipment and storage medium |
CN112597369A (en) * | 2020-12-22 | 2021-04-02 | 荆门汇易佳信息科技有限公司 | Webpage spider theme type search system based on improved cloud platform |
CN113821754A (en) * | 2021-09-18 | 2021-12-21 | 上海观安信息技术股份有限公司 | Sensitive data interface crawler identification method and device |
CN113965371A (en) * | 2021-10-19 | 2022-01-21 | 北京天融信网络安全技术有限公司 | Task processing method, device, terminal and storage medium in website monitoring process |
CN113965371B (en) * | 2021-10-19 | 2023-08-29 | 北京天融信网络安全技术有限公司 | Task processing method, device, terminal and storage medium in website monitoring process |
Also Published As
Publication number | Publication date |
---|---|
CN106776768B (en) | 2018-02-02 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN106776768B (en) | A kind of URL grasping means of distributed reptile engine and system | |
Cambazoglu et al. | Scalability challenges in web search engines | |
Ma et al. | Big graph search: challenges and techniques | |
CN103310012A (en) | Distributed web crawler system | |
CN106776929A (en) | A kind of method for information retrieval and device | |
CN103488680A (en) | Combinators to build a search engine | |
CN104866471B (en) | A kind of example match method based on local sensitivity Hash strategy | |
CN103678550B (en) | Mass data real-time query method based on dynamic index structure | |
CN107423535A (en) | For the methods, devices and systems for the medical conditions for determining user | |
CN108322428A (en) | A kind of abnormal access detection method and equipment | |
CN103226609A (en) | Searching method for WEB focus searching system | |
CN107122238A (en) | Efficient iterative Mechanism Design method based on Hadoop cloud Computational frame | |
CN103258017A (en) | Method and system for parallel square crossing network data collection | |
CN104573082B (en) | Space small documents distributed data storage method and system based on access log information | |
Vrbić | Data mining and cloud computing | |
CN112597369A (en) | Webpage spider theme type search system based on improved cloud platform | |
do Carmo Oliveira et al. | Set similarity joins with complex expressions on distributed platforms | |
Amalarethinam et al. | A study on performance evaluation of peer-to-peer distributed databases | |
Sundarakumar et al. | An Approach in Big Data Analytics to Improve the Velocity of Unstructured Data Using MapReduce | |
da Silva et al. | Efficient and distributed dbscan algorithm using mapreduce to detect density areas on traffic data | |
Zhong et al. | A web crawler system design based on distributed technology | |
Ren et al. | [Retracted] A Study on Information Classification and Storage in Cloud Computing Data Centers Based on Group Collaborative Intelligent Clustering | |
Maratea et al. | An heuristic approach to page recommendation in web usage mining | |
Henrique et al. | A new approach for verifying url uniqueness in web crawlers | |
Zhang et al. | Scalable Online Interval Join on Modern Multicore Processors in OpenMLDB |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |