CN103310012B - A kind of distributed network crawler system - Google Patents

A kind of distributed network crawler system Download PDF

Info

Publication number
CN103310012B
CN103310012B CN201310274951.3A CN201310274951A CN103310012B CN 103310012 B CN103310012 B CN 103310012B CN 201310274951 A CN201310274951 A CN 201310274951A CN 103310012 B CN103310012 B CN 103310012B
Authority
CN
China
Prior art keywords
url
child node
theme
page
server
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201310274951.3A
Other languages
Chinese (zh)
Other versions
CN103310012A (en
Inventor
王宝会
于雷
王丽华
王新河
尹科
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Huike Education Technology Group Co ltd
Original Assignee
Beihang University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beihang University filed Critical Beihang University
Priority to CN201310274951.3A priority Critical patent/CN103310012B/en
Publication of CN103310012A publication Critical patent/CN103310012A/en
Application granted granted Critical
Publication of CN103310012B publication Critical patent/CN103310012B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Abstract

A kind of distributed network crawler system, it is adaptable to network information gathering field, including: management door, Centroid server, distributed child node server;Management door is the web interface that manager is provided by crawler system, Centroid server and the daily record of distributed child node server can be checked, interpolation theme is set, update the URL seed of certain theme, the crawl frequency parameter of configuration theme, controls the state of reptile;Centroid server and distributed child node server reptile are the main bodys of system, complete theme operation, the study of data pick-up device, page analysis and the storage of target pages.Present invention achieves a reptile and accommodate the crawl of different themes, the speed and the quality that improve crawl webpage can not meet user's requirement.

Description

A kind of distributed network crawler system
Technical field
The present invention relates to a kind of distributed network crawler system, belong to network information gathering field.
Background technology
The fast development of network brings the explosive increase of web message amount, retrieves as internet information The traditional common search engine effect of instrument becomes more and more important, but due to itself exist the network coverage low, The limitation such as loss is high, therefore can not provide the user the most comprehensively information.In order to overcome universal search The above deficiency of engine, topic search engine arises at the historic moment, and its target is with limited bandwidth and hardware resource Consumption, the most accurate in providing the user its care field Result.
Theme Crawler of Content is the basis of topic search engine, and its speed capturing webpage and quality are to determine that search is drawn Hold up the important indicator of quality.It is a system automatically downloading webpage in restriction field, according to the most excellent First level order and degree of subject relativity are screened and are obtained the page.Different from general reptile, Theme Crawler of Content does not pursue height Coverage rate, but optionally take theme related pages, have that resource occupation is low, index data base updates Convenient, the caching accurate advantage of the page.
But currently available technology all cannot realize judging the page and the dependency of theme and at a crawler system The theme crawl etc. that middle receiving is different, therefore causes the speed capturing webpage and quality can not meet user's requirement.
Summary of the invention
The technology of the present invention solves problem: overcome the deficiencies in the prior art, it is provided that a kind of distributed network reptile system System a, it is achieved that reptile accommodates the crawl of different themes, the speed and the quality that improve crawl webpage can not Meet user's requirement.
The technology of the present invention solution: a kind of distributed network crawler system, including: management door, center Node server, distributed child node server;Management door is the Web that manager is provided by crawler system Interface, it is possible to check Centroid server and the daily record of distributed child node server, interpolation theme is set, Update the URL seed of certain theme, the crawl frequency parameter of configuration theme, control the state of reptile;In Heart node server and distributed child node server reptile are the main bodys of system, complete theme operation, data The storage of the study of withdrawal device, page analysis and target pages;
(1) Centroid server, including URL controller, decimator module and theme control module;
Theme control module, receives management door from management interface and sends the data of coming, including the description of theme Data, interpolation and deletion action data, control theme capture the data of frequency, complete the operation about theme, Including to the description of theme, add and delete;Control theme and capture frequency;Edit each theme seed queue, And theme seed queue is sent to decimator module and URL controller module;
Decimator module, after receiving theme seed queue, first passes through fundamental analysis device and comes seed queue The webpage that URL address represents is classified, and is divided into Deep Web page and data-intensive (Data-intensive) two kinds of pages are carried out extraction and analyze, find each class after analysis by webpage the most respectively The data pick-up device that type is corresponding, then URL address and corresponding data pick-up device traveling corresponding record, and handle Record is sent to URL controller;
URL controller, receives seed queue and the URL of decimator module transmission that theme control module sends The two data are integrated, URL address and corresponding data by address and corresponding withdrawal device record Withdrawal device is corresponding, does not has the URL address of corresponding withdrawal device just to correspond to general withdrawal device, by all of URL address is queued up, and by task split-run, task is sent to the distributed child node of each reptile;
(2) distributed child node server includes child node URL controller, data pick-up device, search control Device processed, webpage capture device;
Child node URL controller, receives seed URL and corresponding number that Centroid server sends over According to withdrawal device information;First carry out URL address duplicate checking after receiving URL, then will there is no repeated acquisition URL address discharges into queue, and by queue URL address and corresponding data pick-up device information be sent to Data pick-up device and webpage capture device;
Data pick-up device, carries out page analysis also the Deep Web page from the queue of child node URL Extract in the page URL form new URL, form new URL, be equivalent to the object after list is submitted to, Pass to webpage capture device;After receiving the page that search controller sends, use page URL address corresponding Whose withdrawal device carries out the extraction of URL address in content extraction and the page, then URL is sent in URL address Address bases etc. are to be collected;
Webpage capture device, connects the URL address sended over from URL controller and data withdrawal device Receiving, then carry out the crawl of webpage, the webpage of crawl is supplied to search controller;
Search controller, is analyzed receiving the page collected, and satisfactory Page-saving enters page Storehouse, face, otherwise passes to data pick-up device the page.
Described task split-run uses weighted least-connection scheduling method, implements process as follows:
(1) calculate the difference of PR minima in PR and the URL queue of seed, with PR maximum and The ratio cc of PR minimum value differencei:;
α i = PR i - l o w ( P R ) t o p ( P R ) - l o w ( P R )
PR is webpage rank, i=0,1 ..., n-1, Low (PR) and Top (PR) are respectively PR maximum And minima;
(2) calculating the weight of search depth, the target pages degree of depth of the context guide is 3, the power of search depth Ghost image rings factor-betaiFor itself degree of depth LiInverse:
βi=1/Li
(3) calculating crawl frequency, by Sigmoid function, Sigmoid smoothing of functions is uniformly strict single Adjusting, threshold range is (0.5~1), concrete crawl frequency xiIt is calculated as follows:
x i = F i - l o w ( F ) t o p ( F ) - l o w ( F )
Wherein, Fi is the crawl frequency of seed;Low and Top obtain respectively queue medium frequency maximum and Minima.
Capture frequency influence factor gammaiIt is calculated as:
γ i = 1 1 + e - ax i
A value is more than 1, is the weighter factor after linear smoothing result, and target is to expand head step result of calculation.
(4) judging according to Sigmoid function curve, a takes 2.5 in a system of the invention.The most permissible Drawing, the priority weighting of seed is the arithmetic average of 3 factors of influence:
Q i = α i + β i + γ i 3
(5) then sort descending is carried out according to Qi value.URL queue in child node inherits center The URL Weight algorithm of node, the website that the reptile guided based on withdrawal device in system only can limit at seed Inside crawl, Q-value captures frequency and 2 factors of website importance are constant, only can become with the search depth factor Change, be calculated as follows:
Q = Q p r e v - β p r e v - β 3
Wherein, Qprev is the weights transmitted from father URL;β prev is that the search of father URL is deep The degree factor;β is the search depth factor of object URL.Child node queue URL number is many, uses two Point-score sequence space exchanges the raising of efficiency for.Through theory analysis and actual test, URL weights are at 0-1 Between be distributed in uniform smooth, it is to avoid the single factors caused that acutely decays of 1 factor plays a decisive role Situation, taken into account simultaneously destination object, capture strategy and 3 principal elements of search depth, fine terrain Show priority difference.Even if a kind of special circumstances that this algorithm realizes are the requests transmitted in the face of searcher, Now priority is the highest, and Q-value is set to 1, and transmittance process is unattenuated.
(6) each child node represents its process performance with corresponding weights.Default weights are set to 1, system pipes Reason person can dynamically arrange the weights of server.Weighted least-connection scheduling scheduling is new connect time as far as possible Built connection number and its weights of making server are proportional.The algorithm flow of weighted least-connection scheduling is as follows: Assume there is one group of serverRepresent the weights of server S i, C (Si) table That shows server S i currently connects number.It is CSUM=Σ C (Si) that Servers-all currently connects the summation of number (i=0,1 ..., n-1).
Current new connection request can be sent server S m, and server S m that and if only if meets following bar Part retransmits seed:
C ( S m ) / C S U M W ( S m ) = min { C ( S i ) / C S U M W ( S i ) } ⇒ C ( S m ) W ( S m ) = min { C ( S i ) W ( S i ) }
Wherein, W (Si) is not 0.The daily record of child node feeds back to Centroid, sub-services at regular intervals Device linking number C (Si) obtains by reading daily record.This method compares child node and connects number and the ratio of priori weights Value, obtains the child node of minimum load, distributes and new crawls task.
Present invention advantage compared with prior art is:
(1) invention devise a kind of for field in the search engine of multiple themes, including a series of themes The subsystem of search (such as air ticket, hotel), they share 1 reptile, it is achieved that a reptile accommodates difference The crawl of theme, the task segmentation in a kind of new task distribution being directed to distributed reptile of the present invention is calculated Method, the existing document currently relating to this framework is all that summary describes, and does not the most solve multi-threaded and deposits In the case of the URL distribution that is likely to occur and the problem such as algorithm is compatible, the present invention solves this problem.
(2) framework of the present invention uses multi-threaded strategy based on classification annotation, solves in same crawler system The problem that multi-threaded self adaptation is compatible, by two grades of weighting task partitioning algorithms, solve based on goal orientation, The URL assignment problem of load balancing, enhances the system expandability.
(3) improved method of the URL storage strategy that the present invention proposes, can support that URL looks into efficiently Ask, insert and repeatability detection.The subject search system of native system exploitation is supplied to what user's topicalization was enriched Input interface, and return accurate structured content, its reptile have employed framework based on data pick-up device. Relating to the existing document of this framework is all that summary describes, and does not the most solve multi-threaded and can in the case of depositing The problems such as the URL distribution that can occur and algorithm compatibility.
Accompanying drawing explanation
Fig. 1 is the overall architecture schematic diagram of distributed reptile system of the present invention;
Fig. 2 is the Centroid server rack composition in the present invention;
Fig. 3 is the Organization Chart of distributed node server in the present invention.
Detailed description of the invention
Present system uses distributed system architecture based on data pick-up device, by a center main controlled node Forming with distributed crawler server, whole system cooperates collaborative work, and its overall architecture is shown in Fig. 1.
As it is shown in figure 1, the present invention is mainly made up of following module:
1, management door
Management door is the web interface that manager is provided by crawler system, can check center and sub-services The daily record of device, arranges interpolation theme, updates the URL seed of certain theme, the crawl frequency of configuration theme Deng parameter, control the state etc. of reptile.Centroid and distribution reptile are the main bodys of system, complete theme behaviour Work, the study of data pick-up device, page analysis and the storage of target pages.
2, Centroid server
Reptile center main controlled node is control axis, mainly includes URL controller, decimator module and master Topic control module, as shown in Figure 2.The concrete function of three modules sees below introduction:
(1) theme control module
Theme control module completes the operation about theme, including to the description of theme, add and delete;Control Theme processed captures frequency;Edit each theme seed queue.Seed team's column selection takes the authority page of corresponding theme, I.e. in this theme, comparison is the most representational can be as the page of a series of target information initial positions, such as hotel The Theme Crawler of Content of search, its authority page is exactly to book rooms to comprise the inquiry webpage of Form or its hotel letter in net The start page of breath list.First use universal search engine searching motif descriptive text, obtain the expansion of corresponding theme Exhibition page set, because limited amount, so obtained the seed queue of authority page again by artificial examination.
(2) decimator module
Using web page analysis algorithm based on content, start with from URL seed, training is formed for kind of a filial generation The data pick-up device of the authoritative website of table.The seed meeting a upper module demand is broadly divided into 2 classes: Deep Web page and data-intensive (Data-intensive) webpage, the basic classification device using memory character is permissible Distinguish 2 kinds of pages, use specific neck improved to tour field dictionary for Deep Web page The inquiry detection method of territory Case-based Reasoning matches the most complete interface input.Structuring for the latter is special Levying, the strategy using page block and catalogue to find carries out the URL extraction of underlying pages.Through above procedure, The data pick-up device (parser path and search depth) that URL seed can be found to be suitable for, grabs in child node During taking, this model instructs the page parsing of the targeted sites that seed represents.
(3) URL controller
The sequence of the main URL queue being responsible in Centroid, and carry out according to each child node load feedback Task is split.Because using two grades of URL configuration strategies, so Centroid server only stores seed URL, sort algorithm captures website weight representated by frequency and seed according to theme and determines priority, and calculates The concurrency that unit interval needs.Task segmentation uses weighted least-connection scheduling method.
The process that realizes of Centroid server is:
(1) theme control module receives management door from management interface and sends data (the description number of theme of coming According to, add and deletion action data;Control theme and capture the data of frequency;) this module completes about theme Operation, including to the description of theme, add and delete;Control theme and capture frequency;Edit each theme kind Subqueue.And theme seed queue is sent to decimator module and URL controller module.(note: seed Queue is exactly URL address queue, and URL address queue is exactly one group of URL address, but seed queue There is its particularity URL address, because the URL address of seed queue the most all needs target to gather the head of website Page URL address, or need the homepage URL address of the programme orientation etc. gathered.)
(2), after decimator module receives theme seed queue, first pass through fundamental analysis device and come seed queue The webpage that represents of URL address carry out classifying that (first the process of classification is to need to access URL address to point to Webpage, web page contents is acquired then carrying out subsequent classification operation), be divided into Deep Web page With data-intensive (Data-intensive) webpage, the most respectively two kinds of pages are carried out extraction and analyzes (in detail Introduce the function looked in article above for decimator module to be discussed in detail.), find each type after analysis Corresponding data pick-up device.And URL address and corresponding data pick-up device traveling corresponding record, and this Individual record is sent to URL controller.
(3) URL controller, receives seed queue and decimator module transmission that theme control module sends URL address and corresponding withdrawal device record.And the two data are integrated, URL address and Corresponding data pick-up device is corresponding, does not has the URL address of corresponding withdrawal device just to correspond to general withdrawal device, All of URL address is queued up, by task split-run, task is sent to each reptile and is distributed formula Node.
3, distributed child node server
As it is shown on figure 3, distributed child node server is the person of being embodied as crawled, mainly include child node URL controller, data pick-up device, search controller, webpage capture device.
It is as follows that the distributed child node of reptile realizes process:
(1) child node URL controller receives seed URL and the correspondence that Centroid server sends over Data pick-up device information;First URL address duplicate checking (the Internet reptile, URL ground is carried out after receiving URL Location duplicate checking method has a lot, because not being this paper emphasis, is not the most detailed duplicate checking method introduction, Ke Yiying URL address duplicate checking method with any one is general), then the URL address not having repeated acquisition is discharged into Queue.And by queue URL address and corresponding data pick-up device information be sent to data pick-up device module With webpage capture device module.
(2) data pick-up device carries out the page the Deep Web page from the queue of child node URL and divides Analyse and extract in the page URL form new URL, form new URL, be equivalent to after list submits to Object, passes to webpage capture device.
(3) webpage capture device module is received the URL address sended in the first two steps, so After carry out the crawl of webpage.
(4) webpage that webpage capture device captures, it is provided that to search controller.
(5) search controller receives the page collected, and is analyzed the page, satisfactory page Face is saved into pool of page, otherwise the page is passed to data pick-up device.
(6), after data pick-up device receives the page that search controller sends, page URL address pair is used Whose withdrawal device answered carries out the extraction of URL address in content extraction and the page, then URL address is sent into URL address bases etc. are to be collected.
Each module is discussed in detail the most again.
(1) child node URL controller
Child node URL controller receive from Centroid distribution seed URL and webpage extract URL, storage to url database, storage uses Trie type data structure, can enter being newly added URL Row duplicate detection and quick insertion.Url database applies a more New Policy in units of website, Can ensure that the retardance that the renewal of content is not detected by repeatability.Pass through the legal URL of detection according to two grades URL weighting transmission sort algorithm, receives the weight that parent page passes over the degree of depth combining in search strategy Carry out the sequence of priority, pass to webpage capture device.
(2) data pick-up device
Deep Web object from URL queue uses the inquiry probe algorithm that Centroid trains, Pattern match through concrete parameter inputs, and forms new URL, is equivalent to the object after list is submitted to, passes Pass webpage capture device.This module another input hang oneself search controller judgement after the page, URL search for Strategy ensures that this is the data-intensive page, finds algorithm according to the page block trained, and extracts 2 classes and closes The URL of the heart: page turn information and subordinate's page of data information, send into url database.
(3) webpage capture device
This is a multi-threaded parallel module, is responsible for gathering the page according to http protocol.Basic step includes: A. extract targeted sites address and port numbers out according to page URL, set up network with this address and port and be connected; B. assembled HTTP request head by page URL, being sent to targeted sites, not receiving if exceeding certain time Response message, then terminate capturing this page and being abandoned;Otherwise continue next step;C. analyze response to disappear Breath, if the conditional code returned is 2xx, then returns the correct page, enters next step;If conditional code is 301 Or 302, representation page is redirected, and extracts target URL made new advances from response header, returns previous step;If Return other conditional codes, instruction page connection failure, terminate capturing this page and being abandoned;D. from response The page infos such as date, length, page type are extracted in Tou;E. the content of the page is read, for length The bigger page, uses piecemeal to read the method spliced again and ensures the integrity of content of pages.
(4) search controller
Search strategy of the present invention uses the best-first search strategy combining concrete application enhancements, through to Deep Web and the analysis of the directory block formula page, destination object major part is the text formula page, captures the degree of depth and is less than 3 grades.The web page contents captured is made decisions by this module according to search strategy, meets the text of search depth The formula page is stored in pool of page, waits the structuring of index module, otherwise, passes to the data pick-up device of correspondence Carry out page analysis and URL extracts.
Task partitioning algorithm is specific as follows:
In distributed reptile system, the equilibrium assignment crawling task is to affect systematic function and resource distribution One of key issue.Centralized or based on two grades of Hash maps the task of the many employings of distributed reptile system at present Segmentation strategy.These 2 kinds of strategies simply solve the problem of uniform distribution, do not account for the shadow of URL priority Ring and child node loading condition.The task segmentation strategy of Theme Crawler of Content should take into account URL queue sequence and Balance dispatching based on child node load.For native system framework, task partitioning algorithm includes two grades of URL Weighting transmission sort algorithm and weighting Smallest connection URL dispatching method based on hash.
URL queue in Centroid and child node designs the sort algorithm of two grades of weighting transmission.? Centroid level, its URL queue main body is the URL seed of different themes, and impact crawls the seed of quality Attribute includes website importance, captures frequency and search depth.Seed is the power that theme is embodied in corresponding website The prestige page, its PageRank may map to the website power of influence at this subject fields, PageRank Evaluate use PageRank algorithm based on network topology as standard, the page PR that Google is given Value is integer, and theoretical range is (0~10), but through statistics, major part the page PR value 7 with Under, so, for uniform normalization, the factor of influence of website importance uses linear function to calculate, specifically For the difference of PR minima in PR and the URL queue of corresponding seed and PR maximum and minimum value difference Ratio cci:
α i = PR i - l o w ( P R ) t o p ( P R ) - l o w ( P R )
Search depth refers to the number of plies that the page specifies in optimal preference strategy, 3 grades altogether, has Hidden The seed degree of depth of Web list is 1, and the data-intensive page depth of catalogue block structured is 2, the context guide The target pages degree of depth be 3, the weights influence factor-beta of search depthiFor itself degree of depth LiInverse.
βi=1/Li
Crawl frequency is the shadow that manager is corresponding with the time interval updating strategy setting according to searching for foreground demand Ring the factor, update interval short, capture frequency big, then sub-priorities is higher.Capture frequency and be divided into seed frequency Rate and theme frequency, according to theme character arrange theme frequency be required, and seed frequency without Arranging, its value just inherits theme frequency.The sample value difference in distribution of grabbing interval is big, jumping characteristic strong, such as Transferring the possession of ticket module because instantaneity requires height, its theme arranges and is spaced apart 15min;But hotel reservation because of Little for price change amplitude, it arranges and is spaced in units of sky, if fully complying with its line shape function specification Change, then the frequency influence factor there will be the situation of a value solely big then rapid decay, and this factor has just become sequence Determiner, it is clear that unreasonable.Compare through research, tied initially with after linear normalization function Really, then weight, eventually pass Sigmoid function uniform treatment.Sigmoid smoothing of functions is the strictest Dullness, threshold range is (0.5~1), is specifically calculated as follows:
x i = F i - l o w ( F ) t o p ( F ) - l o w ( F )
Wherein, Fi is the crawl frequency of seed;Low and Top obtain respectively queue medium frequency maximum and Minima;xiFor capturing frequency;γiFor capturing the frequency influence factor;
γ i = 1 1 + e - ax i
A value is more than 1, is the weighter factor after linear smoothing result, and target is to expand head step result of calculation. Judging according to Sigmoid function curve, a takes 2.5 in systems.Therefore deduce that, seed preferential Level weight is the arithmetic average of 3 factors of influence:
Q i = α i + β i + γ i 3
Then sort descending is carried out according to Qi value, because Centroid queue number seeds is limited, so Using insertion sort, can save memory consumption, the time is also similar to that other sort algorithms.In child node URL queue inherits the URL Weight algorithm of Centroid, the analysis found that, based on extraction in system The reptile that device guides only can crawl in the website that seed limits, and captures frequency and website importance 2 in Q-value The individual factor is constant, only can be calculated as follows with search depth factor variations:
Q = Q p r e v - β p r e v - β 3
Wherein, Qprev is the weights transmitted from father URL;β prev is that the search of father URL is deep The degree factor;β is the search depth factor of object URL.Child node queue URL number is many, uses two Point-score sequence space exchanges the raising of efficiency for.Through theory analysis and actual test, URL weights are at 0-1 Between be distributed in uniform smooth, it is to avoid the single factors caused that acutely decays of 1 factor plays a decisive role Situation, taken into account simultaneously destination object, capture strategy and 3 principal elements of search depth, fine terrain Show priority difference.Even if a kind of special circumstances that this algorithm realizes are the requests transmitted in the face of searcher, Now priority is the highest, and Q-value is set to 1, and transmittance process is unattenuated.
The segmentation of Centroid queue have employed weighted least-connection scheduling (Weighted Least-Connection Scheduling) algorithm.Each child node represents its treatability with corresponding weights Energy.Default weights are set to 1, and system manager can dynamically arrange the weights of child node server.Weighting Least-Connection Scheduling makes the built connection number of server and its weights proportional when the new connection of scheduling as far as possible. Distribution flow of task is as follows: assume to have one group of child node server S=S0, S1 ..., Sn-1}, W (Si) Represent server S i weights, C (Si) represent child node server S i currently connect number.All child nodes Server currently connect the summation of number be CSUM=Σ C (Si) (i=0,1 ..., n-1).
Current new connection request can be sent child node server S m, and if only if child node server Sm meets following condition and retransmits seed:
C ( S m ) / C S U M ) W ( S m ) = min { C ( S i ) / C S U M W ( S i ) } ⇒ C ( S m ) W ( S m ) = min { C ( Σ t ) W ( S i ) } , ( i = 0 , 1 , ... , n - 1 )
Wherein, W (Si) is not 0.The daily record of child node feeds back to Centroid, sub-services at regular intervals Device linking number C (Si) obtains by reading daily record.The present invention compares child node and connects number and the ratio of priori weights Value, obtains the child node of minimum load, distributes and new crawls task.
Non-elaborated part of the present invention belongs to techniques well known.
Above content is to combine concrete preferred implementation further description made for the present invention, no Can assert that the detailed description of the invention of the present invention is only limitted to this, for the ordinary skill of the technical field of the invention For personnel, without departing from the inventive concept of the premise, it is also possible to make some simple deduction or replace, All should be considered as belonging to the present invention and be determined scope of patent protection by the claims submitted to.

Claims (1)

1. a distributed network crawler system, it is characterised in that including: management door, Centroid server, distributed child node server;Management door is the web interface that manager is provided by crawler system, Centroid server and the daily record of distributed child node server can be checked, interpolation theme is set, update the URL seed of certain theme, the crawl frequency parameter of configuration theme, controls the state of reptile;Centroid server and distributed child node server are the main body of system as reptile, complete theme operation, the study of data pick-up device, page analysis and the storage of target pages;
(1) Centroid server, including URL controller, decimator module and theme control module;
Theme control module, receives management door from management interface and sends the data of coming, including theme descriptions data, interpolation and deletion action data, control the data of theme crawl frequency, complete the operation about theme, including to the description of theme, add and delete;Control theme and capture frequency;Edit each theme seed queue, and theme seed queue is sent to decimator module and URL controller module;
Decimator module, after receiving theme seed queue, first pass through fundamental analysis device the webpage of the URL address representative of seed queue is classified, it is divided into Deep Web page and data-intensive (Data-intensive) webpage, the most respectively two kinds of pages are extracted, find the data pick-up device that each type is corresponding, then URL address and corresponding data device are carried out corresponding record extraction, and record is sent to URL controller;
URL controller, receive the seed queue of theme control module transmission and the URL address of decimator module transmission and corresponding data pick-up device record, the two data are integrated, corresponding for corresponding with in distributed child node server for URL address data pick-up device, the URL address not having corresponding data withdrawal device just corresponds to general withdrawal device, all of URL address is queued up, by task split-run, task is sent to each distributed child node server;
(2) distributed child node server includes child node URL controller, data pick-up device, search controller, webpage capture device;
Child node URL controller, receives seed URL and corresponding data withdrawal device information that Centroid server sends over;First carry out URL address duplicate checking after receiving URL, then the URL address not having repeated acquisition is discharged into queue, and URL address in queue and corresponding data pick-up device information are sent to data pick-up device and webpage capture device;
Data pick-up device, the Deep Web page from the queue of child node URL carry out page analysis and extract in the page URL form new URL, be equivalent to the object after list is submitted to, pass to webpage capture device;After receiving the page that search controller sends, the withdrawal device using page URL address corresponding carries out the extraction of URL address in content extraction and the page, then that URL address feeding URL address base etc. is to be collected;
Webpage capture device, the URL address that reception child node URL controller and data withdrawal device send over, carry out the crawl of webpage, the webpage after crawl is supplied to search controller;
Search controller, is analyzed receiving the page collected, and satisfactory Page-saving enters pool of page, otherwise the page is passed to data pick-up device;
It is as follows that described task split-run implements process:
(21) difference of PR minima in PR and the URL queue of seed is calculated, with PR maximum and the ratio cc of PR minimum value differencei:
PR is webpage rank, i=0,1 ..., n-1, n are son node number, and low (PR) and top (PR) is respectively PR maximum and minima;
(22) calculating the weight of search depth, the target pages degree of depth of the context guide is 3, the weights influence factor-beta of search depthiFor itself degree of depth LiInverse:
βi=1/Li
(23) calculate crawl frequency by Sigmoid function, capture frequency xiIt is calculated as follows:
Wherein, Fi is the crawl frequency of seed;Top (F) and Low (F) is to obtain queue medium frequency maximum and minima respectively;
Capture frequency influence factor gammaiIt is calculated as:
A value is more than 1, is the weighter factor after linear smoothing result, and target is to expand head step result of calculation;
(24) judging according to Sigmoid function curve, the priority weighting of seed is arithmetic average Q of 3 factors of influencei:
(25) then carrying out sort descending according to Qi value, Q is to capture frequency and constant being the most only calculated as follows of 2 factors of website importance with the value of search depth factor variations:
Wherein, Qprev is the weights transmitted from father URL;β prev is the search depth factor of father URL;β is the search depth factor of object URL;
(26) distributing the new task that crawls is, assume to have one group of child node server S=S0, S1 ..., Sn-1}, W (Si) represent child node server S i weights, C (Si) represent child node server S i currently connect number, it is CSUM=Σ C (Si) that all child node servers currently connect the summation of number, i=0,1 ..., n-1;
Current new connection request can be sent child node server S m, and child node server S m that and if only if meets following condition and retransmits seed:
Wherein, W (Si) is not 0, the daily record of child node feeds back in Centroid server at regular intervals, child node server connections mesh C (Si) obtains by reading daily record, relatively distributed child node server connections and the ratio of priori weights, obtain the child node of minimum load, distribute and new crawl task.
CN201310274951.3A 2013-07-02 2013-07-02 A kind of distributed network crawler system Active CN103310012B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201310274951.3A CN103310012B (en) 2013-07-02 2013-07-02 A kind of distributed network crawler system

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201310274951.3A CN103310012B (en) 2013-07-02 2013-07-02 A kind of distributed network crawler system

Publications (2)

Publication Number Publication Date
CN103310012A CN103310012A (en) 2013-09-18
CN103310012B true CN103310012B (en) 2016-09-28

Family

ID=49135230

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201310274951.3A Active CN103310012B (en) 2013-07-02 2013-07-02 A kind of distributed network crawler system

Country Status (1)

Country Link
CN (1) CN103310012B (en)

Families Citing this family (41)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103559219B (en) * 2013-10-18 2016-12-07 北京京东尚科信息技术有限公司 Distributed network crawler capturing method for scheduling task, dispatching terminal equipment and crawl node
CN103605670B (en) * 2013-10-29 2017-03-29 北京奇虎科技有限公司 A kind of method and apparatus for determining the crawl frequency of network resource point
CN104778164B (en) * 2014-01-09 2018-01-30 中国银联股份有限公司 Detection repeats URL method and device
CN103997524A (en) * 2014-05-21 2014-08-20 浪潮电子信息产业股份有限公司 Distributed type modularized web crawler with high availability and extendibility
CN104199893B (en) * 2014-08-25 2018-01-30 成都华栖云科技有限公司 A kind of system and method for quickly issuing full media content
CN105656707B (en) * 2014-11-18 2019-03-26 阿里巴巴集团控股有限公司 A kind of method and system of test network crawler
CN104572901B (en) * 2014-12-25 2018-12-18 小米科技有限责任公司 The method for down loading and device of web data
CN104699757B (en) * 2015-01-15 2018-03-13 南京邮电大学 Distributed network information acquisition method under cloud environment
CN105989151B (en) * 2015-03-02 2019-09-06 阿里巴巴集团控股有限公司 Webpage capture method and device
CN104866555A (en) * 2015-05-15 2015-08-26 浪潮软件集团有限公司 Automatic acquisition method based on web crawler
CN106294393A (en) * 2015-05-20 2017-01-04 天脉聚源(北京)科技有限公司 A kind of method and system of web search
CN105045838A (en) * 2015-07-01 2015-11-11 华东师范大学 Network crawler system based on distributed storage system
CN106570011B (en) * 2015-10-09 2021-01-26 北京京东尚科信息技术有限公司 Distributed crawler URL seed distribution method, scheduling node and capturing node
CN105577684B (en) * 2016-01-25 2018-09-28 北京京东尚科信息技术有限公司 Method, server-side, client and the system of anti-crawler capturing
CN107180050A (en) * 2016-03-11 2017-09-19 精硕科技(北京)股份有限公司 A kind of data grabber system and method
CN105843965B (en) * 2016-04-20 2019-06-04 广东精点数据科技股份有限公司 A kind of Deep Web Crawler form filling method and apparatus based on URL subject classification
CN106294822A (en) * 2016-08-17 2017-01-04 国网上海市电力公司 A kind of electric power data visualization system
CN106572026B (en) * 2016-10-28 2020-04-10 上海斐讯数据通信技术有限公司 SDN-based load balancing method, device and system
CN106776934B (en) * 2016-11-30 2021-03-26 努比亚技术有限公司 Mobile terminal and implementation method of web crawler
CN106844475A (en) * 2016-12-23 2017-06-13 北京奇虎科技有限公司 It is determined that the method and device of hiding URL
CN106874487B (en) * 2017-02-21 2020-08-18 国信优易数据有限公司 Distributed crawler management system and method thereof
CN106803167A (en) * 2017-02-28 2017-06-06 深圳海带宝网络科技股份有限公司 A kind of cross-border electric business whole world goods clear customs system
CN107066530A (en) * 2017-03-01 2017-08-18 苏州朗动网络科技有限公司 A kind of data refresh system and method for refreshing data
CN106934027A (en) * 2017-03-14 2017-07-07 深圳市博信诺达经贸咨询有限公司 Distributed reptile realization method and system
CN107092826B (en) * 2017-03-24 2020-02-21 北京国舜科技股份有限公司 Webpage content safety real-time monitoring method
CN107241319B (en) * 2017-05-26 2020-06-02 山东省科学院情报研究所 Distributed network crawler system based on VPN and scheduling method
CN107423382A (en) * 2017-07-13 2017-12-01 中国物品编码中心 network crawling method and device
CN107562956A (en) * 2017-09-30 2018-01-09 麦格创科技(深圳)有限公司 Distributed reptile method for allocating tasks and system
CN110309389A (en) * 2018-03-14 2019-10-08 北京嘀嘀无限科技发展有限公司 Cloud computing system
CN109101521A (en) * 2018-06-12 2018-12-28 江苏开拓信息与系统有限公司 The automatic extraction system of data based on big data
CN108959524A (en) * 2018-06-28 2018-12-07 中译语通科技股份有限公司 A kind of method, system and information data processing terminal identifying data crawler
CN111092921B (en) * 2018-10-24 2022-05-10 北大方正集团有限公司 Data acquisition method, device and storage medium
CN109548752B (en) * 2018-11-16 2021-09-07 深圳市鑫稻田农业技术科技有限公司 Forestry-based multifunctional storage device used during night scorpion catching
CN109902220B (en) * 2019-02-27 2023-11-24 腾讯科技(深圳)有限公司 Webpage information acquisition method, device and computer readable storage medium
CN111382332B (en) * 2019-04-02 2021-12-17 江苏省地震局 Earthquake disaster information processing method and system
CN110245280B (en) * 2019-05-06 2021-03-02 北京三快在线科技有限公司 Method and device for identifying web crawler, storage medium and electronic equipment
CN110532453B (en) * 2019-08-12 2022-07-22 北京智游网安科技有限公司 Method for adjusting crawler updating frequency, storage medium and crawler server
CN111488507B (en) * 2020-04-09 2023-05-23 西安影视数据评估中心有限公司 Optimization method of network proxy
CN111428115A (en) * 2020-04-16 2020-07-17 行吟信息科技(上海)有限公司 Webpage information processing method and device
CN112035721A (en) * 2020-07-22 2020-12-04 大箴(杭州)科技有限公司 Crawler cluster monitoring method and device, storage medium and computer equipment
CN114095207A (en) * 2021-10-26 2022-02-25 北京连星科技有限公司 IPv6 website detection method based on distributed scheduling

Family Cites Families (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US8346763B2 (en) * 2007-03-30 2013-01-01 Microsoft Corporation Ranking method using hyperlinks in blogs
CN101561814B (en) * 2009-05-08 2012-05-09 华中科技大学 Topic crawler system based on social labels
CN101630327A (en) * 2009-08-14 2010-01-20 昆明理工大学 Design method of theme network crawler system
CN102646129B (en) * 2012-03-09 2013-12-04 武汉大学 Topic-relative distributed web crawler system

Also Published As

Publication number Publication date
CN103310012A (en) 2013-09-18

Similar Documents

Publication Publication Date Title
CN103310012B (en) A kind of distributed network crawler system
CN106776768B (en) A kind of URL grasping means of distributed reptile engine and system
US6795815B2 (en) Computer based knowledge system
CN103810224B (en) information persistence and query method and device
CN103631922B (en) Extensive Web information extracting method and system based on Hadoop clusters
CN109919316A (en) The method, apparatus and equipment and storage medium of acquisition network representation study vector
CN107239892A (en) Region talent's equilibrium of supply and demand quantitative analysis method based on big data
CN107203872A (en) Region demand for talent based on big data quantifies analysis method
CN105677918A (en) Distributed crawler architecture based on Kafka and Quartz and implementation method thereof
CN107193960A (en) A kind of distributed reptile system and periodicity increment grasping means
CN103778262B (en) Information retrieval method and device based on thesaurus
CN110516791A (en) A kind of vision answering method and system based on multiple attention
CN101866561A (en) Device and method for intellectually composing test paper by adjustable multi-variable asymptotic optimizing algorithm
CN110377689A (en) Paper intelligent generation method, device, computer equipment and storage medium
CN106033428B (en) The selection method of uniform resource locator and the selection device of uniform resource locator
Wang et al. A novel blockchain oracle implementation scheme based on application specific knowledge engines
CN205845090U (en) Electricity market main body credit evaluation system
CN104092744A (en) Web service discovery method based on memorization service cluster mapping catalogue
CN106202467A (en) A kind of definable towards peer-to-peer network searches for the web crawlers method of emphasis
CN104915388B (en) It is a kind of that method is recommended based on spectral clustering and the book labels of mass-rent technology
CN103279492B (en) A kind of method and apparatus capturing webpage
CN105512122B (en) The sort method and device of information retrieval system
CN108228787A (en) According to the method and apparatus of multistage classification processing information
CN107203623A (en) The load balancing adjusting method of network crawler system
KR101515304B1 (en) Reduce-side join query processing method for hadoop-based reduce-side join processing system

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
C14 Grant of patent or utility model
GR01 Patent grant
TR01 Transfer of patent right
TR01 Transfer of patent right

Effective date of registration: 20220624

Address after: Room 2101, block D, Zhizhen building, No. 7, Zhichun Road, Haidian District, Beijing 100191

Patentee after: HUIKE EDUCATION TECHNOLOGY GROUP Co.,Ltd.

Address before: 100191 No. 37, Haidian District, Beijing, Xueyuan Road

Patentee before: BEIHANG University