CN103310012B - A kind of distributed network crawler system - Google Patents
A kind of distributed network crawler system Download PDFInfo
- Publication number
- CN103310012B CN103310012B CN201310274951.3A CN201310274951A CN103310012B CN 103310012 B CN103310012 B CN 103310012B CN 201310274951 A CN201310274951 A CN 201310274951A CN 103310012 B CN103310012 B CN 103310012B
- Authority
- CN
- China
- Prior art keywords
- url
- child node
- theme
- page
- server
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Active
Links
Abstract
A kind of distributed network crawler system, it is adaptable to network information gathering field, including: management door, Centroid server, distributed child node server;Management door is the web interface that manager is provided by crawler system, Centroid server and the daily record of distributed child node server can be checked, interpolation theme is set, update the URL seed of certain theme, the crawl frequency parameter of configuration theme, controls the state of reptile;Centroid server and distributed child node server reptile are the main bodys of system, complete theme operation, the study of data pick-up device, page analysis and the storage of target pages.Present invention achieves a reptile and accommodate the crawl of different themes, the speed and the quality that improve crawl webpage can not meet user's requirement.
Description
Technical field
The present invention relates to a kind of distributed network crawler system, belong to network information gathering field.
Background technology
The fast development of network brings the explosive increase of web message amount, retrieves as internet information
The traditional common search engine effect of instrument becomes more and more important, but due to itself exist the network coverage low,
The limitation such as loss is high, therefore can not provide the user the most comprehensively information.In order to overcome universal search
The above deficiency of engine, topic search engine arises at the historic moment, and its target is with limited bandwidth and hardware resource
Consumption, the most accurate in providing the user its care field
Result.
Theme Crawler of Content is the basis of topic search engine, and its speed capturing webpage and quality are to determine that search is drawn
Hold up the important indicator of quality.It is a system automatically downloading webpage in restriction field, according to the most excellent
First level order and degree of subject relativity are screened and are obtained the page.Different from general reptile, Theme Crawler of Content does not pursue height
Coverage rate, but optionally take theme related pages, have that resource occupation is low, index data base updates
Convenient, the caching accurate advantage of the page.
But currently available technology all cannot realize judging the page and the dependency of theme and at a crawler system
The theme crawl etc. that middle receiving is different, therefore causes the speed capturing webpage and quality can not meet user's requirement.
Summary of the invention
The technology of the present invention solves problem: overcome the deficiencies in the prior art, it is provided that a kind of distributed network reptile system
System a, it is achieved that reptile accommodates the crawl of different themes, the speed and the quality that improve crawl webpage can not
Meet user's requirement.
The technology of the present invention solution: a kind of distributed network crawler system, including: management door, center
Node server, distributed child node server;Management door is the Web that manager is provided by crawler system
Interface, it is possible to check Centroid server and the daily record of distributed child node server, interpolation theme is set,
Update the URL seed of certain theme, the crawl frequency parameter of configuration theme, control the state of reptile;In
Heart node server and distributed child node server reptile are the main bodys of system, complete theme operation, data
The storage of the study of withdrawal device, page analysis and target pages;
(1) Centroid server, including URL controller, decimator module and theme control module;
Theme control module, receives management door from management interface and sends the data of coming, including the description of theme
Data, interpolation and deletion action data, control theme capture the data of frequency, complete the operation about theme,
Including to the description of theme, add and delete;Control theme and capture frequency;Edit each theme seed queue,
And theme seed queue is sent to decimator module and URL controller module;
Decimator module, after receiving theme seed queue, first passes through fundamental analysis device and comes seed queue
The webpage that URL address represents is classified, and is divided into Deep Web page and data-intensive
(Data-intensive) two kinds of pages are carried out extraction and analyze, find each class after analysis by webpage the most respectively
The data pick-up device that type is corresponding, then URL address and corresponding data pick-up device traveling corresponding record, and handle
Record is sent to URL controller;
URL controller, receives seed queue and the URL of decimator module transmission that theme control module sends
The two data are integrated, URL address and corresponding data by address and corresponding withdrawal device record
Withdrawal device is corresponding, does not has the URL address of corresponding withdrawal device just to correspond to general withdrawal device, by all of
URL address is queued up, and by task split-run, task is sent to the distributed child node of each reptile;
(2) distributed child node server includes child node URL controller, data pick-up device, search control
Device processed, webpage capture device;
Child node URL controller, receives seed URL and corresponding number that Centroid server sends over
According to withdrawal device information;First carry out URL address duplicate checking after receiving URL, then will there is no repeated acquisition
URL address discharges into queue, and by queue URL address and corresponding data pick-up device information be sent to
Data pick-up device and webpage capture device;
Data pick-up device, carries out page analysis also the Deep Web page from the queue of child node URL
Extract in the page URL form new URL, form new URL, be equivalent to the object after list is submitted to,
Pass to webpage capture device;After receiving the page that search controller sends, use page URL address corresponding
Whose withdrawal device carries out the extraction of URL address in content extraction and the page, then URL is sent in URL address
Address bases etc. are to be collected;
Webpage capture device, connects the URL address sended over from URL controller and data withdrawal device
Receiving, then carry out the crawl of webpage, the webpage of crawl is supplied to search controller;
Search controller, is analyzed receiving the page collected, and satisfactory Page-saving enters page
Storehouse, face, otherwise passes to data pick-up device the page.
Described task split-run uses weighted least-connection scheduling method, implements process as follows:
(1) calculate the difference of PR minima in PR and the URL queue of seed, with PR maximum and
The ratio cc of PR minimum value differencei:;
PR is webpage rank, i=0,1 ..., n-1, Low (PR) and Top (PR) are respectively PR maximum
And minima;
(2) calculating the weight of search depth, the target pages degree of depth of the context guide is 3, the power of search depth
Ghost image rings factor-betaiFor itself degree of depth LiInverse:
βi=1/Li
(3) calculating crawl frequency, by Sigmoid function, Sigmoid smoothing of functions is uniformly strict single
Adjusting, threshold range is (0.5~1), concrete crawl frequency xiIt is calculated as follows:
Wherein, Fi is the crawl frequency of seed;Low and Top obtain respectively queue medium frequency maximum and
Minima.
Capture frequency influence factor gammaiIt is calculated as:
A value is more than 1, is the weighter factor after linear smoothing result, and target is to expand head step result of calculation.
(4) judging according to Sigmoid function curve, a takes 2.5 in a system of the invention.The most permissible
Drawing, the priority weighting of seed is the arithmetic average of 3 factors of influence:
(5) then sort descending is carried out according to Qi value.URL queue in child node inherits center
The URL Weight algorithm of node, the website that the reptile guided based on withdrawal device in system only can limit at seed
Inside crawl, Q-value captures frequency and 2 factors of website importance are constant, only can become with the search depth factor
Change, be calculated as follows:
Wherein, Qprev is the weights transmitted from father URL;β prev is that the search of father URL is deep
The degree factor;β is the search depth factor of object URL.Child node queue URL number is many, uses two
Point-score sequence space exchanges the raising of efficiency for.Through theory analysis and actual test, URL weights are at 0-1
Between be distributed in uniform smooth, it is to avoid the single factors caused that acutely decays of 1 factor plays a decisive role
Situation, taken into account simultaneously destination object, capture strategy and 3 principal elements of search depth, fine terrain
Show priority difference.Even if a kind of special circumstances that this algorithm realizes are the requests transmitted in the face of searcher,
Now priority is the highest, and Q-value is set to 1, and transmittance process is unattenuated.
(6) each child node represents its process performance with corresponding weights.Default weights are set to 1, system pipes
Reason person can dynamically arrange the weights of server.Weighted least-connection scheduling scheduling is new connect time as far as possible
Built connection number and its weights of making server are proportional.The algorithm flow of weighted least-connection scheduling is as follows:
Assume there is one group of serverRepresent the weights of server S i, C (Si) table
That shows server S i currently connects number.It is CSUM=Σ C (Si) that Servers-all currently connects the summation of number
(i=0,1 ..., n-1).
Current new connection request can be sent server S m, and server S m that and if only if meets following bar
Part retransmits seed:
Wherein, W (Si) is not 0.The daily record of child node feeds back to Centroid, sub-services at regular intervals
Device linking number C (Si) obtains by reading daily record.This method compares child node and connects number and the ratio of priori weights
Value, obtains the child node of minimum load, distributes and new crawls task.
Present invention advantage compared with prior art is:
(1) invention devise a kind of for field in the search engine of multiple themes, including a series of themes
The subsystem of search (such as air ticket, hotel), they share 1 reptile, it is achieved that a reptile accommodates difference
The crawl of theme, the task segmentation in a kind of new task distribution being directed to distributed reptile of the present invention is calculated
Method, the existing document currently relating to this framework is all that summary describes, and does not the most solve multi-threaded and deposits
In the case of the URL distribution that is likely to occur and the problem such as algorithm is compatible, the present invention solves this problem.
(2) framework of the present invention uses multi-threaded strategy based on classification annotation, solves in same crawler system
The problem that multi-threaded self adaptation is compatible, by two grades of weighting task partitioning algorithms, solve based on goal orientation,
The URL assignment problem of load balancing, enhances the system expandability.
(3) improved method of the URL storage strategy that the present invention proposes, can support that URL looks into efficiently
Ask, insert and repeatability detection.The subject search system of native system exploitation is supplied to what user's topicalization was enriched
Input interface, and return accurate structured content, its reptile have employed framework based on data pick-up device.
Relating to the existing document of this framework is all that summary describes, and does not the most solve multi-threaded and can in the case of depositing
The problems such as the URL distribution that can occur and algorithm compatibility.
Accompanying drawing explanation
Fig. 1 is the overall architecture schematic diagram of distributed reptile system of the present invention;
Fig. 2 is the Centroid server rack composition in the present invention;
Fig. 3 is the Organization Chart of distributed node server in the present invention.
Detailed description of the invention
Present system uses distributed system architecture based on data pick-up device, by a center main controlled node
Forming with distributed crawler server, whole system cooperates collaborative work, and its overall architecture is shown in Fig. 1.
As it is shown in figure 1, the present invention is mainly made up of following module:
1, management door
Management door is the web interface that manager is provided by crawler system, can check center and sub-services
The daily record of device, arranges interpolation theme, updates the URL seed of certain theme, the crawl frequency of configuration theme
Deng parameter, control the state etc. of reptile.Centroid and distribution reptile are the main bodys of system, complete theme behaviour
Work, the study of data pick-up device, page analysis and the storage of target pages.
2, Centroid server
Reptile center main controlled node is control axis, mainly includes URL controller, decimator module and master
Topic control module, as shown in Figure 2.The concrete function of three modules sees below introduction:
(1) theme control module
Theme control module completes the operation about theme, including to the description of theme, add and delete;Control
Theme processed captures frequency;Edit each theme seed queue.Seed team's column selection takes the authority page of corresponding theme,
I.e. in this theme, comparison is the most representational can be as the page of a series of target information initial positions, such as hotel
The Theme Crawler of Content of search, its authority page is exactly to book rooms to comprise the inquiry webpage of Form or its hotel letter in net
The start page of breath list.First use universal search engine searching motif descriptive text, obtain the expansion of corresponding theme
Exhibition page set, because limited amount, so obtained the seed queue of authority page again by artificial examination.
(2) decimator module
Using web page analysis algorithm based on content, start with from URL seed, training is formed for kind of a filial generation
The data pick-up device of the authoritative website of table.The seed meeting a upper module demand is broadly divided into 2 classes: Deep
Web page and data-intensive (Data-intensive) webpage, the basic classification device using memory character is permissible
Distinguish 2 kinds of pages, use specific neck improved to tour field dictionary for Deep Web page
The inquiry detection method of territory Case-based Reasoning matches the most complete interface input.Structuring for the latter is special
Levying, the strategy using page block and catalogue to find carries out the URL extraction of underlying pages.Through above procedure,
The data pick-up device (parser path and search depth) that URL seed can be found to be suitable for, grabs in child node
During taking, this model instructs the page parsing of the targeted sites that seed represents.
(3) URL controller
The sequence of the main URL queue being responsible in Centroid, and carry out according to each child node load feedback
Task is split.Because using two grades of URL configuration strategies, so Centroid server only stores seed
URL, sort algorithm captures website weight representated by frequency and seed according to theme and determines priority, and calculates
The concurrency that unit interval needs.Task segmentation uses weighted least-connection scheduling method.
The process that realizes of Centroid server is:
(1) theme control module receives management door from management interface and sends data (the description number of theme of coming
According to, add and deletion action data;Control theme and capture the data of frequency;) this module completes about theme
Operation, including to the description of theme, add and delete;Control theme and capture frequency;Edit each theme kind
Subqueue.And theme seed queue is sent to decimator module and URL controller module.(note: seed
Queue is exactly URL address queue, and URL address queue is exactly one group of URL address, but seed queue
There is its particularity URL address, because the URL address of seed queue the most all needs target to gather the head of website
Page URL address, or need the homepage URL address of the programme orientation etc. gathered.)
(2), after decimator module receives theme seed queue, first pass through fundamental analysis device and come seed queue
The webpage that represents of URL address carry out classifying that (first the process of classification is to need to access URL address to point to
Webpage, web page contents is acquired then carrying out subsequent classification operation), be divided into Deep Web page
With data-intensive (Data-intensive) webpage, the most respectively two kinds of pages are carried out extraction and analyzes (in detail
Introduce the function looked in article above for decimator module to be discussed in detail.), find each type after analysis
Corresponding data pick-up device.And URL address and corresponding data pick-up device traveling corresponding record, and this
Individual record is sent to URL controller.
(3) URL controller, receives seed queue and decimator module transmission that theme control module sends
URL address and corresponding withdrawal device record.And the two data are integrated, URL address and
Corresponding data pick-up device is corresponding, does not has the URL address of corresponding withdrawal device just to correspond to general withdrawal device,
All of URL address is queued up, by task split-run, task is sent to each reptile and is distributed formula
Node.
3, distributed child node server
As it is shown on figure 3, distributed child node server is the person of being embodied as crawled, mainly include child node
URL controller, data pick-up device, search controller, webpage capture device.
It is as follows that the distributed child node of reptile realizes process:
(1) child node URL controller receives seed URL and the correspondence that Centroid server sends over
Data pick-up device information;First URL address duplicate checking (the Internet reptile, URL ground is carried out after receiving URL
Location duplicate checking method has a lot, because not being this paper emphasis, is not the most detailed duplicate checking method introduction, Ke Yiying
URL address duplicate checking method with any one is general), then the URL address not having repeated acquisition is discharged into
Queue.And by queue URL address and corresponding data pick-up device information be sent to data pick-up device module
With webpage capture device module.
(2) data pick-up device carries out the page the Deep Web page from the queue of child node URL and divides
Analyse and extract in the page URL form new URL, form new URL, be equivalent to after list submits to
Object, passes to webpage capture device.
(3) webpage capture device module is received the URL address sended in the first two steps, so
After carry out the crawl of webpage.
(4) webpage that webpage capture device captures, it is provided that to search controller.
(5) search controller receives the page collected, and is analyzed the page, satisfactory page
Face is saved into pool of page, otherwise the page is passed to data pick-up device.
(6), after data pick-up device receives the page that search controller sends, page URL address pair is used
Whose withdrawal device answered carries out the extraction of URL address in content extraction and the page, then URL address is sent into
URL address bases etc. are to be collected.
Each module is discussed in detail the most again.
(1) child node URL controller
Child node URL controller receive from Centroid distribution seed URL and webpage extract
URL, storage to url database, storage uses Trie type data structure, can enter being newly added URL
Row duplicate detection and quick insertion.Url database applies a more New Policy in units of website,
Can ensure that the retardance that the renewal of content is not detected by repeatability.Pass through the legal URL of detection according to two grades
URL weighting transmission sort algorithm, receives the weight that parent page passes over the degree of depth combining in search strategy
Carry out the sequence of priority, pass to webpage capture device.
(2) data pick-up device
Deep Web object from URL queue uses the inquiry probe algorithm that Centroid trains,
Pattern match through concrete parameter inputs, and forms new URL, is equivalent to the object after list is submitted to, passes
Pass webpage capture device.This module another input hang oneself search controller judgement after the page, URL search for
Strategy ensures that this is the data-intensive page, finds algorithm according to the page block trained, and extracts 2 classes and closes
The URL of the heart: page turn information and subordinate's page of data information, send into url database.
(3) webpage capture device
This is a multi-threaded parallel module, is responsible for gathering the page according to http protocol.Basic step includes:
A. extract targeted sites address and port numbers out according to page URL, set up network with this address and port and be connected;
B. assembled HTTP request head by page URL, being sent to targeted sites, not receiving if exceeding certain time
Response message, then terminate capturing this page and being abandoned;Otherwise continue next step;C. analyze response to disappear
Breath, if the conditional code returned is 2xx, then returns the correct page, enters next step;If conditional code is 301
Or 302, representation page is redirected, and extracts target URL made new advances from response header, returns previous step;If
Return other conditional codes, instruction page connection failure, terminate capturing this page and being abandoned;D. from response
The page infos such as date, length, page type are extracted in Tou;E. the content of the page is read, for length
The bigger page, uses piecemeal to read the method spliced again and ensures the integrity of content of pages.
(4) search controller
Search strategy of the present invention uses the best-first search strategy combining concrete application enhancements, through to Deep
Web and the analysis of the directory block formula page, destination object major part is the text formula page, captures the degree of depth and is less than
3 grades.The web page contents captured is made decisions by this module according to search strategy, meets the text of search depth
The formula page is stored in pool of page, waits the structuring of index module, otherwise, passes to the data pick-up device of correspondence
Carry out page analysis and URL extracts.
Task partitioning algorithm is specific as follows:
In distributed reptile system, the equilibrium assignment crawling task is to affect systematic function and resource distribution
One of key issue.Centralized or based on two grades of Hash maps the task of the many employings of distributed reptile system at present
Segmentation strategy.These 2 kinds of strategies simply solve the problem of uniform distribution, do not account for the shadow of URL priority
Ring and child node loading condition.The task segmentation strategy of Theme Crawler of Content should take into account URL queue sequence and
Balance dispatching based on child node load.For native system framework, task partitioning algorithm includes two grades of URL
Weighting transmission sort algorithm and weighting Smallest connection URL dispatching method based on hash.
URL queue in Centroid and child node designs the sort algorithm of two grades of weighting transmission.?
Centroid level, its URL queue main body is the URL seed of different themes, and impact crawls the seed of quality
Attribute includes website importance, captures frequency and search depth.Seed is the power that theme is embodied in corresponding website
The prestige page, its PageRank may map to the website power of influence at this subject fields, PageRank
Evaluate use PageRank algorithm based on network topology as standard, the page PR that Google is given
Value is integer, and theoretical range is (0~10), but through statistics, major part the page PR value 7 with
Under, so, for uniform normalization, the factor of influence of website importance uses linear function to calculate, specifically
For the difference of PR minima in PR and the URL queue of corresponding seed and PR maximum and minimum value difference
Ratio cci:
Search depth refers to the number of plies that the page specifies in optimal preference strategy, 3 grades altogether, has Hidden
The seed degree of depth of Web list is 1, and the data-intensive page depth of catalogue block structured is 2, the context guide
The target pages degree of depth be 3, the weights influence factor-beta of search depthiFor itself degree of depth LiInverse.
βi=1/Li
Crawl frequency is the shadow that manager is corresponding with the time interval updating strategy setting according to searching for foreground demand
Ring the factor, update interval short, capture frequency big, then sub-priorities is higher.Capture frequency and be divided into seed frequency
Rate and theme frequency, according to theme character arrange theme frequency be required, and seed frequency without
Arranging, its value just inherits theme frequency.The sample value difference in distribution of grabbing interval is big, jumping characteristic strong, such as
Transferring the possession of ticket module because instantaneity requires height, its theme arranges and is spaced apart 15min;But hotel reservation because of
Little for price change amplitude, it arranges and is spaced in units of sky, if fully complying with its line shape function specification
Change, then the frequency influence factor there will be the situation of a value solely big then rapid decay, and this factor has just become sequence
Determiner, it is clear that unreasonable.Compare through research, tied initially with after linear normalization function
Really, then weight, eventually pass Sigmoid function uniform treatment.Sigmoid smoothing of functions is the strictest
Dullness, threshold range is (0.5~1), is specifically calculated as follows:
Wherein, Fi is the crawl frequency of seed;Low and Top obtain respectively queue medium frequency maximum and
Minima;xiFor capturing frequency;γiFor capturing the frequency influence factor;
A value is more than 1, is the weighter factor after linear smoothing result, and target is to expand head step result of calculation.
Judging according to Sigmoid function curve, a takes 2.5 in systems.Therefore deduce that, seed preferential
Level weight is the arithmetic average of 3 factors of influence:
Then sort descending is carried out according to Qi value, because Centroid queue number seeds is limited, so
Using insertion sort, can save memory consumption, the time is also similar to that other sort algorithms.In child node
URL queue inherits the URL Weight algorithm of Centroid, the analysis found that, based on extraction in system
The reptile that device guides only can crawl in the website that seed limits, and captures frequency and website importance 2 in Q-value
The individual factor is constant, only can be calculated as follows with search depth factor variations:
Wherein, Qprev is the weights transmitted from father URL;β prev is that the search of father URL is deep
The degree factor;β is the search depth factor of object URL.Child node queue URL number is many, uses two
Point-score sequence space exchanges the raising of efficiency for.Through theory analysis and actual test, URL weights are at 0-1
Between be distributed in uniform smooth, it is to avoid the single factors caused that acutely decays of 1 factor plays a decisive role
Situation, taken into account simultaneously destination object, capture strategy and 3 principal elements of search depth, fine terrain
Show priority difference.Even if a kind of special circumstances that this algorithm realizes are the requests transmitted in the face of searcher,
Now priority is the highest, and Q-value is set to 1, and transmittance process is unattenuated.
The segmentation of Centroid queue have employed weighted least-connection scheduling (Weighted
Least-Connection Scheduling) algorithm.Each child node represents its treatability with corresponding weights
Energy.Default weights are set to 1, and system manager can dynamically arrange the weights of child node server.Weighting
Least-Connection Scheduling makes the built connection number of server and its weights proportional when the new connection of scheduling as far as possible.
Distribution flow of task is as follows: assume to have one group of child node server S=S0, S1 ..., Sn-1}, W (Si)
Represent server S i weights, C (Si) represent child node server S i currently connect number.All child nodes
Server currently connect the summation of number be CSUM=Σ C (Si) (i=0,1 ..., n-1).
Current new connection request can be sent child node server S m, and if only if child node server
Sm meets following condition and retransmits seed:
Wherein, W (Si) is not 0.The daily record of child node feeds back to Centroid, sub-services at regular intervals
Device linking number C (Si) obtains by reading daily record.The present invention compares child node and connects number and the ratio of priori weights
Value, obtains the child node of minimum load, distributes and new crawls task.
Non-elaborated part of the present invention belongs to techniques well known.
Above content is to combine concrete preferred implementation further description made for the present invention, no
Can assert that the detailed description of the invention of the present invention is only limitted to this, for the ordinary skill of the technical field of the invention
For personnel, without departing from the inventive concept of the premise, it is also possible to make some simple deduction or replace,
All should be considered as belonging to the present invention and be determined scope of patent protection by the claims submitted to.
Claims (1)
1. a distributed network crawler system, it is characterised in that including: management door, Centroid server, distributed child node server;Management door is the web interface that manager is provided by crawler system, Centroid server and the daily record of distributed child node server can be checked, interpolation theme is set, update the URL seed of certain theme, the crawl frequency parameter of configuration theme, controls the state of reptile;Centroid server and distributed child node server are the main body of system as reptile, complete theme operation, the study of data pick-up device, page analysis and the storage of target pages;
(1) Centroid server, including URL controller, decimator module and theme control module;
Theme control module, receives management door from management interface and sends the data of coming, including theme descriptions data, interpolation and deletion action data, control the data of theme crawl frequency, complete the operation about theme, including to the description of theme, add and delete;Control theme and capture frequency;Edit each theme seed queue, and theme seed queue is sent to decimator module and URL controller module;
Decimator module, after receiving theme seed queue, first pass through fundamental analysis device the webpage of the URL address representative of seed queue is classified, it is divided into Deep Web page and data-intensive (Data-intensive) webpage, the most respectively two kinds of pages are extracted, find the data pick-up device that each type is corresponding, then URL address and corresponding data device are carried out corresponding record extraction, and record is sent to URL controller;
URL controller, receive the seed queue of theme control module transmission and the URL address of decimator module transmission and corresponding data pick-up device record, the two data are integrated, corresponding for corresponding with in distributed child node server for URL address data pick-up device, the URL address not having corresponding data withdrawal device just corresponds to general withdrawal device, all of URL address is queued up, by task split-run, task is sent to each distributed child node server;
(2) distributed child node server includes child node URL controller, data pick-up device, search controller, webpage capture device;
Child node URL controller, receives seed URL and corresponding data withdrawal device information that Centroid server sends over;First carry out URL address duplicate checking after receiving URL, then the URL address not having repeated acquisition is discharged into queue, and URL address in queue and corresponding data pick-up device information are sent to data pick-up device and webpage capture device;
Data pick-up device, the Deep Web page from the queue of child node URL carry out page analysis and extract in the page URL form new URL, be equivalent to the object after list is submitted to, pass to webpage capture device;After receiving the page that search controller sends, the withdrawal device using page URL address corresponding carries out the extraction of URL address in content extraction and the page, then that URL address feeding URL address base etc. is to be collected;
Webpage capture device, the URL address that reception child node URL controller and data withdrawal device send over, carry out the crawl of webpage, the webpage after crawl is supplied to search controller;
Search controller, is analyzed receiving the page collected, and satisfactory Page-saving enters pool of page, otherwise the page is passed to data pick-up device;
It is as follows that described task split-run implements process:
(21) difference of PR minima in PR and the URL queue of seed is calculated, with PR maximum and the ratio cc of PR minimum value differencei:
PR is webpage rank, i=0,1 ..., n-1, n are son node number, and low (PR) and top (PR) is respectively PR maximum and minima;
(22) calculating the weight of search depth, the target pages degree of depth of the context guide is 3, the weights influence factor-beta of search depthiFor itself degree of depth LiInverse:
βi=1/Li
(23) calculate crawl frequency by Sigmoid function, capture frequency xiIt is calculated as follows:
Wherein, Fi is the crawl frequency of seed;Top (F) and Low (F) is to obtain queue medium frequency maximum and minima respectively;
Capture frequency influence factor gammaiIt is calculated as:
A value is more than 1, is the weighter factor after linear smoothing result, and target is to expand head step result of calculation;
(24) judging according to Sigmoid function curve, the priority weighting of seed is arithmetic average Q of 3 factors of influencei:
(25) then carrying out sort descending according to Qi value, Q is to capture frequency and constant being the most only calculated as follows of 2 factors of website importance with the value of search depth factor variations:
Wherein, Qprev is the weights transmitted from father URL;β prev is the search depth factor of father URL;β is the search depth factor of object URL;
(26) distributing the new task that crawls is, assume to have one group of child node server S=S0, S1 ..., Sn-1}, W (Si) represent child node server S i weights, C (Si) represent child node server S i currently connect number, it is CSUM=Σ C (Si) that all child node servers currently connect the summation of number, i=0,1 ..., n-1;
Current new connection request can be sent child node server S m, and child node server S m that and if only if meets following condition and retransmits seed:
Wherein, W (Si) is not 0, the daily record of child node feeds back in Centroid server at regular intervals, child node server connections mesh C (Si) obtains by reading daily record, relatively distributed child node server connections and the ratio of priori weights, obtain the child node of minimum load, distribute and new crawl task.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201310274951.3A CN103310012B (en) | 2013-07-02 | 2013-07-02 | A kind of distributed network crawler system |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201310274951.3A CN103310012B (en) | 2013-07-02 | 2013-07-02 | A kind of distributed network crawler system |
Publications (2)
Publication Number | Publication Date |
---|---|
CN103310012A CN103310012A (en) | 2013-09-18 |
CN103310012B true CN103310012B (en) | 2016-09-28 |
Family
ID=49135230
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201310274951.3A Active CN103310012B (en) | 2013-07-02 | 2013-07-02 | A kind of distributed network crawler system |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN103310012B (en) |
Families Citing this family (41)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN103559219B (en) * | 2013-10-18 | 2016-12-07 | 北京京东尚科信息技术有限公司 | Distributed network crawler capturing method for scheduling task, dispatching terminal equipment and crawl node |
CN103605670B (en) * | 2013-10-29 | 2017-03-29 | 北京奇虎科技有限公司 | A kind of method and apparatus for determining the crawl frequency of network resource point |
CN104778164B (en) * | 2014-01-09 | 2018-01-30 | 中国银联股份有限公司 | Detection repeats URL method and device |
CN103997524A (en) * | 2014-05-21 | 2014-08-20 | 浪潮电子信息产业股份有限公司 | Distributed type modularized web crawler with high availability and extendibility |
CN104199893B (en) * | 2014-08-25 | 2018-01-30 | 成都华栖云科技有限公司 | A kind of system and method for quickly issuing full media content |
CN105656707B (en) * | 2014-11-18 | 2019-03-26 | 阿里巴巴集团控股有限公司 | A kind of method and system of test network crawler |
CN104572901B (en) * | 2014-12-25 | 2018-12-18 | 小米科技有限责任公司 | The method for down loading and device of web data |
CN104699757B (en) * | 2015-01-15 | 2018-03-13 | 南京邮电大学 | Distributed network information acquisition method under cloud environment |
CN105989151B (en) * | 2015-03-02 | 2019-09-06 | 阿里巴巴集团控股有限公司 | Webpage capture method and device |
CN104866555A (en) * | 2015-05-15 | 2015-08-26 | 浪潮软件集团有限公司 | Automatic acquisition method based on web crawler |
CN106294393A (en) * | 2015-05-20 | 2017-01-04 | 天脉聚源(北京)科技有限公司 | A kind of method and system of web search |
CN105045838A (en) * | 2015-07-01 | 2015-11-11 | 华东师范大学 | Network crawler system based on distributed storage system |
CN106570011B (en) * | 2015-10-09 | 2021-01-26 | 北京京东尚科信息技术有限公司 | Distributed crawler URL seed distribution method, scheduling node and capturing node |
CN105577684B (en) * | 2016-01-25 | 2018-09-28 | 北京京东尚科信息技术有限公司 | Method, server-side, client and the system of anti-crawler capturing |
CN107180050A (en) * | 2016-03-11 | 2017-09-19 | 精硕科技(北京)股份有限公司 | A kind of data grabber system and method |
CN105843965B (en) * | 2016-04-20 | 2019-06-04 | 广东精点数据科技股份有限公司 | A kind of Deep Web Crawler form filling method and apparatus based on URL subject classification |
CN106294822A (en) * | 2016-08-17 | 2017-01-04 | 国网上海市电力公司 | A kind of electric power data visualization system |
CN106572026B (en) * | 2016-10-28 | 2020-04-10 | 上海斐讯数据通信技术有限公司 | SDN-based load balancing method, device and system |
CN106776934B (en) * | 2016-11-30 | 2021-03-26 | 努比亚技术有限公司 | Mobile terminal and implementation method of web crawler |
CN106844475A (en) * | 2016-12-23 | 2017-06-13 | 北京奇虎科技有限公司 | It is determined that the method and device of hiding URL |
CN106874487B (en) * | 2017-02-21 | 2020-08-18 | 国信优易数据有限公司 | Distributed crawler management system and method thereof |
CN106803167A (en) * | 2017-02-28 | 2017-06-06 | 深圳海带宝网络科技股份有限公司 | A kind of cross-border electric business whole world goods clear customs system |
CN107066530A (en) * | 2017-03-01 | 2017-08-18 | 苏州朗动网络科技有限公司 | A kind of data refresh system and method for refreshing data |
CN106934027A (en) * | 2017-03-14 | 2017-07-07 | 深圳市博信诺达经贸咨询有限公司 | Distributed reptile realization method and system |
CN107092826B (en) * | 2017-03-24 | 2020-02-21 | 北京国舜科技股份有限公司 | Webpage content safety real-time monitoring method |
CN107241319B (en) * | 2017-05-26 | 2020-06-02 | 山东省科学院情报研究所 | Distributed network crawler system based on VPN and scheduling method |
CN107423382A (en) * | 2017-07-13 | 2017-12-01 | 中国物品编码中心 | network crawling method and device |
CN107562956A (en) * | 2017-09-30 | 2018-01-09 | 麦格创科技(深圳)有限公司 | Distributed reptile method for allocating tasks and system |
CN110309389A (en) * | 2018-03-14 | 2019-10-08 | 北京嘀嘀无限科技发展有限公司 | Cloud computing system |
CN109101521A (en) * | 2018-06-12 | 2018-12-28 | 江苏开拓信息与系统有限公司 | The automatic extraction system of data based on big data |
CN108959524A (en) * | 2018-06-28 | 2018-12-07 | 中译语通科技股份有限公司 | A kind of method, system and information data processing terminal identifying data crawler |
CN111092921B (en) * | 2018-10-24 | 2022-05-10 | 北大方正集团有限公司 | Data acquisition method, device and storage medium |
CN109548752B (en) * | 2018-11-16 | 2021-09-07 | 深圳市鑫稻田农业技术科技有限公司 | Forestry-based multifunctional storage device used during night scorpion catching |
CN109902220B (en) * | 2019-02-27 | 2023-11-24 | 腾讯科技(深圳)有限公司 | Webpage information acquisition method, device and computer readable storage medium |
CN111382332B (en) * | 2019-04-02 | 2021-12-17 | 江苏省地震局 | Earthquake disaster information processing method and system |
CN110245280B (en) * | 2019-05-06 | 2021-03-02 | 北京三快在线科技有限公司 | Method and device for identifying web crawler, storage medium and electronic equipment |
CN110532453B (en) * | 2019-08-12 | 2022-07-22 | 北京智游网安科技有限公司 | Method for adjusting crawler updating frequency, storage medium and crawler server |
CN111488507B (en) * | 2020-04-09 | 2023-05-23 | 西安影视数据评估中心有限公司 | Optimization method of network proxy |
CN111428115A (en) * | 2020-04-16 | 2020-07-17 | 行吟信息科技(上海)有限公司 | Webpage information processing method and device |
CN112035721A (en) * | 2020-07-22 | 2020-12-04 | 大箴(杭州)科技有限公司 | Crawler cluster monitoring method and device, storage medium and computer equipment |
CN114095207A (en) * | 2021-10-26 | 2022-02-25 | 北京连星科技有限公司 | IPv6 website detection method based on distributed scheduling |
Family Cites Families (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US8346763B2 (en) * | 2007-03-30 | 2013-01-01 | Microsoft Corporation | Ranking method using hyperlinks in blogs |
CN101561814B (en) * | 2009-05-08 | 2012-05-09 | 华中科技大学 | Topic crawler system based on social labels |
CN101630327A (en) * | 2009-08-14 | 2010-01-20 | 昆明理工大学 | Design method of theme network crawler system |
CN102646129B (en) * | 2012-03-09 | 2013-12-04 | 武汉大学 | Topic-relative distributed web crawler system |
-
2013
- 2013-07-02 CN CN201310274951.3A patent/CN103310012B/en active Active
Also Published As
Publication number | Publication date |
---|---|
CN103310012A (en) | 2013-09-18 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN103310012B (en) | A kind of distributed network crawler system | |
CN106776768B (en) | A kind of URL grasping means of distributed reptile engine and system | |
US6795815B2 (en) | Computer based knowledge system | |
CN103810224B (en) | information persistence and query method and device | |
CN103631922B (en) | Extensive Web information extracting method and system based on Hadoop clusters | |
CN109919316A (en) | The method, apparatus and equipment and storage medium of acquisition network representation study vector | |
CN107239892A (en) | Region talent's equilibrium of supply and demand quantitative analysis method based on big data | |
CN107203872A (en) | Region demand for talent based on big data quantifies analysis method | |
CN105677918A (en) | Distributed crawler architecture based on Kafka and Quartz and implementation method thereof | |
CN107193960A (en) | A kind of distributed reptile system and periodicity increment grasping means | |
CN103778262B (en) | Information retrieval method and device based on thesaurus | |
CN110516791A (en) | A kind of vision answering method and system based on multiple attention | |
CN101866561A (en) | Device and method for intellectually composing test paper by adjustable multi-variable asymptotic optimizing algorithm | |
CN110377689A (en) | Paper intelligent generation method, device, computer equipment and storage medium | |
CN106033428B (en) | The selection method of uniform resource locator and the selection device of uniform resource locator | |
Wang et al. | A novel blockchain oracle implementation scheme based on application specific knowledge engines | |
CN205845090U (en) | Electricity market main body credit evaluation system | |
CN104092744A (en) | Web service discovery method based on memorization service cluster mapping catalogue | |
CN106202467A (en) | A kind of definable towards peer-to-peer network searches for the web crawlers method of emphasis | |
CN104915388B (en) | It is a kind of that method is recommended based on spectral clustering and the book labels of mass-rent technology | |
CN103279492B (en) | A kind of method and apparatus capturing webpage | |
CN105512122B (en) | The sort method and device of information retrieval system | |
CN108228787A (en) | According to the method and apparatus of multistage classification processing information | |
CN107203623A (en) | The load balancing adjusting method of network crawler system | |
KR101515304B1 (en) | Reduce-side join query processing method for hadoop-based reduce-side join processing system |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
C06 | Publication | ||
PB01 | Publication | ||
C10 | Entry into substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
C14 | Grant of patent or utility model | ||
GR01 | Patent grant | ||
TR01 | Transfer of patent right | ||
TR01 | Transfer of patent right |
Effective date of registration: 20220624 Address after: Room 2101, block D, Zhizhen building, No. 7, Zhichun Road, Haidian District, Beijing 100191 Patentee after: HUIKE EDUCATION TECHNOLOGY GROUP Co.,Ltd. Address before: 100191 No. 37, Haidian District, Beijing, Xueyuan Road Patentee before: BEIHANG University |