CN106202108A

CN106202108A - Web crawlers captures method for allocating tasks and device and data grab method and device

Info

Publication number: CN106202108A
Application number: CN201510227531.9A
Authority: CN
Inventors: 刘庆; 张美德; 殷贤君; 邹启蒙
Original assignee: Alibaba Group Holding Ltd
Current assignee: Alibaba Group Holding Ltd
Priority date: 2015-05-06
Filing date: 2015-05-06
Publication date: 2016-12-07
Anticipated expiration: 2035-05-06
Also published as: CN106202108B

Abstract

The embodiment of the present application discloses a kind of web crawlers and captures method for allocating tasks and device and data grab method and device.Described web crawlers data grab method, preserve the history page captured and be stored in history crawl data, when there being crawl task, judge history to capture in data and whether exist and URL bunch of corresponding history page of the task of crawl, if there is and exist history page can use, then to available history page extracting directly first object page data, without repeating this partial history page to capture, save system resource.Simultaneously, first timing detects the network bandwidth utilization rate of all-network reptile, calculate network bandwidth utilization rate average E (w) of each web crawlers and variance D (w) and store, then the availability of each web crawlers is calculated according to E (w) and D (w), to available web crawlers, the web crawlers of execution task is selected by the descending order of probability of availability, thus reasonable distribution web crawlers resource.

Description

Web crawlers captures method for allocating tasks and device and data grab method and device

Technical field

The present invention relates to networking technology area, particularly relate to a kind of web crawlers and capture method for allocating tasks and device and data Grasping means and device.

Background technology

Web crawlers (Computer Robot is also called webpage Aranea or network robot), is a kind of according to certain Rule captures the program of internet web page data automatically, is the important composition of search engine.Generally web crawlers is according to configuration Crawl task from the Internet download webpage, webpage is resolved and filters, obtain target web data.All by net The target web data of network crawler capturing are stored in crawler system, and set up index, in order to inquiry afterwards and retrieval.

The development of network and information technology makes the very fast growth of quantity of website, webpage and web data, a crawler system Needing numerous web crawlers to capture substantial amounts of web data, these web crawlers are likely distributed among same LAN, It is likely to be dispersed in different geographical position.Degree of scatter according to web crawlers is different, and crawler system is broadly divided into two kinds Framework: 1, distributed structure/architecture based on LAN, in this framework, all-network reptile runs in same LAN, Accessing external the Internet by same LAN, download webpage, all of offered load all concentrates on the outlet of LAN On, owing to the total bandwidth upper limit of network egress is fixing, the quantity of web crawlers can be limited by LAN outlet bandwidth System；2, distributed structure/architecture based on wide area network: parallel web crawlers is separately operable in diverse geographic location or network site, Such as, web crawlers is likely located at different machine room, or is positioned at China, Japan and the U.S., is each responsible for downloading this three ground Webpage, or be positioned at CHINANET, CERNET and CEINET, be each responsible for downloading in these three network Webpage, under this framework, each web crawlers accesses external the Internet by respective network, only by each self-corresponding outlet The impact of bandwidth, thus scatternet flow, alleviate offered load.

Distributed structure/architecture based on wide area network is to apply more crawler system framework at present, but under this framework, due to net Network reptile quantity is big and content crawl demand is many, and web crawlers disposes dispersion, and the outlet bandwidth that each web crawlers is corresponding can Can be able to vary, the unreasonable meeting of Internet usage situation makes crawl efficiency reduce.On the other hand, task quantity is captured Increasing the easily generation page and repeat crawl problem, such as A task needs to capture the top data of the C page, works as crawler system After completing task, the next B task arrived needs to capture the bottom data of the C page, due to web crawlers in A task Only return and store the top data of the C page, therefore the C page the most once can only be downloaded and obtain the C page Bottom data, so can cause the waste of reptile resource, cause web crawlers cluster pressure to increase.

Summary of the invention

Use unreasonable for overcoming web crawlers in correlation technique to capture Task Network resource, and the page repeats asking of crawl Topic, the application provides a kind of web crawlers to capture method for allocating tasks and device and data grab method and device.

First aspect according to the embodiment of the present application, it is provided that a kind of web crawlers captures method for allocating tasks, including:

Timing detects the network bandwidth utilization rate of web crawlers, calculates the network bandwidth utilization rate average of each web crawlers E (w) and variance D (w) also store, and wherein, w is network bandwidth utilization rate；

When receiving crawl task, calculate each net according to described network bandwidth utilization rate average E (w) and variance D (w) The availability of network reptile；

The web crawlers of execution task is determined according to described availability.

Optionally, the network bandwidth utilization rate of described timing detection web crawlers, calculate the network bandwidth of each web crawlers Utilization rate average E (w) and variance D (w) also store, including:

Timing acquiring preset times network bandwidth utilization rate；

Calculate network bandwidth utilization rate averageWith D (w)=E [w-E (w)]²And store, wherein n is secondary for presetting Number.

Optionally, described each web crawlers is calculated according to described network bandwidth utilization rate average E (w) and variance D (w) Availability, including:

Extract network bandwidth utilization rate average E (w) and the side of each web crawlers in predetermined time period before current time Difference D (w)；

According to network bandwidth utilization rate average E (w) of each web crawlers in predetermined time period and variance D (w), use Naive Bayes Classification Algorithm calculates probability of availability and the unavailability probability of each web crawlers.

Optionally, the described web crawlers determining execution task according to described availability, including:

The probability of availability of comparing cell reptile and unavailability probability；

If the probability of availability of web crawlers is more than unavailability probability, determine that described web crawlers can be used, otherwise, institute State web crawlers unavailable；

To available web crawlers, select the web crawlers of execution task by the order that probability of availability is descending.

Second aspect according to the embodiment of the present application, it is provided that a kind of web crawlers data grab method, including:

When receiving crawl task, capture in data to inquire about whether have the according to the configuration information of the task of crawl in history One history page, described configuration information includes the uniform resource position mark URL bunch newly capturing task, and described history captures number According to including that performing history captures the history page that task is downloaded, described first history page is and described new crawl task The history page that configuration information is corresponding；

If there is described first history page, detect whether described first history page can be used；

If described first history page can be used, resolve the first available history page and obtain first object page data；

If described first history page is unavailable, according to the URL that disabled first history page is corresponding URL bunch, and history crawl data do not exist the uniform resource position mark URL bunch of described first history page, under Carry the corresponding page to resolve, obtain the second target pages data.

Optionally, whether described first history page of described detection can be used, including:

First history page described in pre-parsed, it is judged that in crawl task, the page-tag at target pages data place is described Whether one history page exists；

If described page-tag exists in described first history page, it is judged that the page described in described first history page The attribute of label is the most identical with the attribute of the page-tag at target pages data place in crawl task；

If the attribute of page-tag described in described first history page and target pages data place in crawl task The attribute of page-tag is identical, it is determined that described first history page can be used；

If described page-tag does not exists in described first history page, if or institute in described first history page The attribute stating page-tag differs with the attribute of the page-tag at target pages data place in crawl task, it is determined that institute State the first history page unavailable.

Optionally, described configuration information also includes history page effective time, and whether described first history page of described detection Available, including:

Judge that described first history page is whether in history page effective time；

If described first history page is in history page effective time, the first history page described in pre-parsed, it is judged that grab Take whether the page-tag at target pages data place in task exists in described first history page；

If the first history page is not in history page effective time, or described page-tag is at described first history page Face does not exists, if or target pages in the attribute of page-tag described in described first history page and crawl task The attribute of the page-tag at data place differs, it is determined that described first history page is unavailable.

The third aspect according to the embodiment of the present application, it is provided that another kind of web crawlers data grab method, including:

If described first history page is unavailable, fixed according to the unified resource that described disabled first history page is corresponding Position symbol URL bunch, and history crawl data do not exist the uniform resource position mark URL bunch of described first history page, Form the second crawl task；

The availability of each web crawlers is calculated according to described network bandwidth utilization rate average E (w) and variance D (w)；

Availability according to each web crawlers determines the web crawlers performing described second crawl task；

Perform the described second web crawlers capturing task and capture the second target pages data,

Wherein, first object page data and the second target pages data form the target pages data of described crawl task.

Timing acquiring preset times network bandwidth utilization rate；

If the probability of availability of web crawlers is more than unavailability probability, determine that described web crawlers can be used, otherwise, really Fixed described web crawlers is unavailable；

Corresponding to the first aspect of the embodiment of the present application, according to the fourth aspect of the embodiment of the present application, it is provided that a kind of network is climbed Worm captures task allocation apparatus, including:

Network bandwidth utilization rate processing unit, for regularly detecting the network bandwidth utilization rate of web crawlers, calculates each net Network bandwidth utilization rate average E (w) of network reptile and variance D (w) also store, and wherein, w is network bandwidth utilization rate；

Reptile availability calculations unit, for when receiving crawl task, according to described network bandwidth utilization rate average E (w) and variance D (w) calculate the availability of each web crawlers；

First reptile determines unit, for determining the web crawlers of execution task according to described availability.

Optionally, described network bandwidth utilization rate processing unit, including:

Network bandwidth utilization rate acquisition module, for timing acquiring preset times network bandwidth utilization rate；

Mean variance computing module, is used for calculating network bandwidth utilization rate averageWith D (w)=E [w-E (w)]² And store, wherein n is preset times.

Optionally, described reptile availability calculations unit, including:

Mean variance extraction module, before extracting current time, in predetermined time period, the network bandwidth of each web crawlers makes By rate average E (w) and variance D (w)；

Probability evaluation entity, for according to network bandwidth utilization rate average E (w) of each web crawlers in predetermined time period With variance D (w), use Naive Bayes Classification Algorithm to calculate the probability of availability of each web crawlers and unavailability is general Rate.

Optionally, described first reptile determines unit, including:

Comparison module, for probability of availability and the unavailability probability of comparing cell reptile；

Availability determines module, if the probability of availability of web crawlers is more than unavailability probability, determines that described network is climbed Worm can be used, otherwise, it determines described web crawlers is unavailable；

Selecting module, for available web crawlers, the order descending by probability of availability selects execution task Web crawlers.

Corresponding to the second aspect of the embodiment of the present application, according to the 5th aspect of the embodiment of the present application, it is provided that a kind of network is climbed Borer population according to grabbing device, including:

First history page query unit, for when receiving crawl task, is going through according to the configuration information of the task of crawl History captures in data to inquire about whether there is the first history page, and described configuration information includes that the unified resource newly capturing task is fixed Position symbol URL bunch, described history captures data and includes that performing history captures the history page that task is downloaded, and described first goes through The history page is the history page corresponding with the configuration information of described new crawl task；

History page availability detector unit, for if there is described first history page, detects described first history page Whether face can be used；

First resolution unit, if can using for described first history page, resolving the first available history page and obtaining the One target pages data；

First placement unit, if unavailable, according to described disabled first history page for described first history page The uniform resource position mark URL bunch that face is corresponding, and history crawl data do not exist the unification of described first history page URLs URL bunch, downloads the corresponding page and resolves, obtain the second target pages data.

Optionally, described history page availability detector unit, including:

First judge module, for the first history page described in pre-parsed, it is judged that target pages data place in crawl task Page-tag whether exist in described first history page；

Second judge module, if existed in described first history page for described page-tag, it is judged that described first In the attribute of page-tag described in history page and crawl task, whether the attribute of the page-tag at target pages data place Identical；

Can use and determine module, if for the attribute of page-tag described in described first history page and mesh in crawl task The attribute of the page-tag at mark page data place is identical, it is determined that described first history page can be used；

Unavailable determine module, if do not existed in described first history page for described page-tag, if or The attribute of page-tag described in described first history page and the page-tag at target pages data place in crawl task Attribute differs, and the most described first history page is unavailable.

Optionally, described configuration information also includes history page effective time, described history page availability detector unit, Including:

3rd judge module, is used for judging that described first history page is whether in history page effective time；

4th judge module, if for described first history page in history page effective time, described in pre-parsed the One history page, it is judged that in crawl task the page-tag at target pages data place in described first history page whether Exist；

5th judge module, if existed in described first history page for described page-tag, it is judged that described first In the attribute of page-tag described in history page and crawl task, whether the attribute of the page-tag at target pages data place Identical；

Unavailable determine module, if for the first history page not in history page effective time, or the described page Label does not exists in described first history page, if or the attribute of page-tag described in described first history page Differ with the attribute of the page-tag at target pages data place in crawl task, it is determined that described first history page is not Available.

Corresponding to the third aspect of the embodiment of the present application, according to the 6th aspect of the embodiment of the present application, it is provided that another kind of network Reptile data grabber device, including:

Second captures task signal generating unit, if unavailable for described first history page, goes through according to disabled first There is not described first history page in the uniform resource position mark URL bunch that the history page is corresponding, and history crawl data Uniform resource position mark URL bunch, forms the second crawl task；

Reptile availability calculations unit, for calculating each according to described network bandwidth utilization rate average E (w) and variance D (w) The availability of individual web crawlers；

Second reptile determines unit, for determining that described second crawl of execution is appointed according to according to the availability of each web crawlers The web crawlers of business；

Second placement unit, captures the second target pages data for performing the described second web crawlers capturing task,

Mean variance computing module, is used for calculating network bandwidth utilization rate averageWith D (w)=E [w-E (w)] 2 And store, wherein n is preset times.

Optionally, described history page availability detector unit, including:

Optionally, described reptile availability calculations unit, including:

Optionally, described second reptile determines unit, including:

The technical scheme that the embodiment of the present application provides can include following beneficial effect: the network that the embodiment of the present application provides is climbed Worm data grab method, crawler system preserves the history page full page captured and is contained in history crawl data, when there being crawl When task enters, it is judged that whether history crawl data exist and URL bunch of corresponding history page of the task of crawl, if The history page that there is corresponding history page and existence can be used, then to available history page extracting directly first object page Face data, without repeating this partial history page to capture, thus save system resource.For disabled history page Corresponding URL bunch in face, and history captures and there is not URL bunch of corresponding history page in data, forms the second crawl and appoints Business, is captured the corresponding page by web crawlers and obtains the second target pages data, thus completing all target pages data Crawl.Meanwhile, it is different from common use all-network reptile or randomly chooses web crawlers and complete crawl task Situation, first timing detects the network bandwidth utilization rate of all-network reptile, and the network bandwidth calculating each web crawlers uses Rate average E (w) and variance D (w) also store, and then calculate the availability of each web crawlers according to E (w) and D (w), and To available web crawlers, select the web crawlers of execution task by the order that probability of availability is descending, to select more Good web crawlers performs crawl task, thus reasonable distribution web crawlers resource, improve task and capture efficiency.

It should be appreciated that it is only exemplary and explanatory that above general description and details hereinafter describe, can not Limit the application.

Accompanying drawing explanation

In order to be illustrated more clearly that the embodiment of the present application or technical scheme of the prior art, below will be to embodiment or existing In technology description, the required accompanying drawing used is briefly described, it should be apparent that, for those of ordinary skill in the art Speech, on the premise of not paying creative work, it is also possible to obtain other accompanying drawing according to these accompanying drawings.

Fig. 1 is the flow process signal that a kind of web crawlers shown in the application one exemplary embodiment captures method for allocating tasks Figure.

Fig. 2 is the schematic flow sheet of a kind of web crawlers data grab method shown in the application one exemplary embodiment.

Fig. 3 is the schematic flow sheet of the another kind of web crawlers data grab method shown in the application one exemplary embodiment.

Fig. 4 is the block diagram that a kind of web crawlers shown in the application one exemplary embodiment captures task allocation apparatus.

Fig. 5 is the block diagram of a kind of web crawlers data grabber device shown in the application one exemplary embodiment.

Fig. 6 is the block diagram of the another kind of web crawlers data grabber device shown in the application one exemplary embodiment.

Detailed description of the invention

Here will illustrate exemplary embodiment in detail, its example represents in the accompanying drawings.Explained below relates to attached During figure, unless otherwise indicated, the same numbers in different accompanying drawings represents same or analogous key element.Following exemplary is implemented Embodiment described in example does not represent all embodiments consistent with the application.On the contrary, they be only with such as The example of the apparatus and method that some aspects that described in detail in appended claims, the application are consistent.

In order to understand the application comprehensively, refer to numerous concrete details in the following detailed description, but art technology Personnel are it should be understood that the application can realize without these details.In other embodiments, it is not described in detail public affairs Method, process, assembly and the circuit known, obscures in order to avoid undesirably resulting in embodiment.

Fig. 1 is the flow process signal that a kind of web crawlers shown in the application one exemplary embodiment captures method for allocating tasks Figure, as it is shown in figure 1, described method includes:

S101, timing detects the network bandwidth utilization rate of web crawlers, calculates the network bandwidth utilization rate of each web crawlers Average E (w) and variance D (w) also store, and wherein, w is network bandwidth utilization rate.

Wherein, timing detects the network bandwidth service condition of each web crawlers, and described timing can be periodic intervals one The fixed time, such as, every 20min detects once, it is also possible to fix multiple time point according to the ruuning situation of crawler system Detect, such as the 10min of each hour, 25min, 30min, 40min and 60min in every day Detect.When detecting, gather preset times network bandwidth utilization rate, to calculate network bandwidth utilization rate average E (w) every time With variance D (w), the collection to network bandwidth utilization rate can use conventional network bandwidth utilization rate acquisition method, need Illustrate, when arranging preset times, make the time of collection preset times less than or equal to the assay intervals time.If it is pre-

Network bandwidth utilization rate average E (w) of calculated each web crawlers and variance D (w) are stored in reptile system In system, each network bandwidth utilization rate average E (w) is corresponding with the detection time with variance D (w).

S102, when receiving crawl task, calculates according to described network bandwidth utilization rate average E (w) and variance D (w) The availability of each web crawlers.

Wherein, calculate the availability of each web crawlers according to described network bandwidth utilization rate average E (w) and variance D (w), Including:

Network bandwidth utilization rate average E (w) of each web crawlers in predetermined time period before a1, extraction current time With variance D (w)；

A2, according to network bandwidth utilization rate average E (w) of each web crawlers in predetermined time period and variance D (w), Naive Bayes Classification Algorithm is used to calculate probability of availability and the unavailability probability of each web crawlers.

After crawler system receives crawl task, first from the network bandwidth utilization rate average and variance data of storage, with Current time (i.e. receiving the time of crawl task) is starting point, extracts before current time in predetermined time period each Network bandwidth utilization rate average E (w) of web crawlers and variance D (w), such as, before extraction current time within 2h The network bandwidth utilization rate average of each web crawlers and variance data.Then Naive Bayes Classification Algorithm is used to calculate each The probability of availability of individual web crawlers and unavailability probability, its process is as follows:

If the availability of web crawlers represents with A (0,1), wherein, 0 represents that the network bandwidth utilization rate of this web crawlers permits Being permitted to receive new task, 1 expression does not allows；If the network attribute of web crawlers is X={b, c}, wherein b represents network Bandwidth utilization rate average E (w), c represents network bandwidth utilization rate variance D (w), then 0≤b≤1,0≤c≤1；Net The probability of availability of network reptile is P (a=0 | x), and unavailability probability is P (a=1 | x).Determine the naive Bayesian of b and c Sorting algorithm segmentation marginal value, this marginal value, previously according to the integrated network service condition of crawler system, passes through many experiments After determine, and being stored in crawler system, the segmentation marginal value of such as b is 0.3 and 0.8, and the segmentation marginal value of c is 0.2 With 0.6, then Naive Bayes Classification Algorithm is used to be calculated as follows conditional probability:

P (0 <b≤0.3 | a=0), P (0.3 <b≤0.8 | a=0), P (0.8 <b≤1 | a=0),

P (0 < c≤0.2 | a=0), P (0.2 < c≤0.6 | a=0), P (0.6 < c≤1 | a=0),

P (0 <b≤0.3 | a=1), P (0.3 <b≤0.8 | a=1), P (0.8 <b≤1 | a=1),

P (0 < c≤0.2 | a=1), P (0.2 < c≤0.6 | a=1), P (0.6 < c≤1 | a=1), it is assumed that currently certain web crawlers Network attribute is X={b=0.35, c=0.15}, then have:

P (a=0 | x)=P (a=0),

P (0.3 <b≤0.8 | a=0) P (0 < c≤0.2 | a=0) P (a=1 | x)=P (a=1) P (0.3 <b≤0.8 | a=1) P (0 < c≤0.2 | a= 1)

Thus can be calculated P (a=0 | x) and P (a=1 | x).If P (a=0 | x) >=P (a=1 | x), then current network reptile The network bandwidth be not reaching to saturated, new crawl task can be received, namely can use, on the contrary the most unavailable.Use Network bandwidth utilization rate average E (w) and the probability of availability of variance D (w) calculating web crawlers and unavailability probability also may be used With the method using other, (end-to-end connection, is commonly used to by time interval ping preset such as to use detection equipment Carry out availability inspection) run the device of each web crawlers, in record certain period of time, each web crawlers is led to by ping Number of times and the obstructed number of times of ping, and calculate can leading to probability and probability can not be led to by ping by ping of correspondence, can ping Logical probability, as probability of availability, can not lead to probability as unavailability probability, if probability of availability is not more than by ping Probability of availability, or probability of availability is more than the threshold values preset, then it is assumed that and web crawlers is available.

S103, determines the web crawlers of execution task according to described availability.

Wherein, step S103, including:

Wherein, all-network reptile is calculated above-mentioned probability of availability and unavailability probability, and determines that available network is climbed After worm, for available web crawlers, it is arranged from big to small according to the value of probability of availability P (a=0 | x), then presses This order selects web crawlers distribution crawl task successively.Because probability of availability is the biggest, represent the Netowrk tape of web crawlers Wide saturation is the lowest, the network bandwidth flow needed according to the task of crawl, and the network bandwidth of available web crawlers is satisfied With degree situation and probability of availability sequence, needs how many web crawlers can be calculated.If available web crawlers is not enough to Realize crawl task, then, after crawler system can wait other web crawlers free time, use the web crawlers of free time to complete remainder Workload.

Fig. 2 is the schematic flow sheet of a kind of web crawlers data grab method shown in the application one exemplary embodiment, as Shown in Fig. 2, described method includes:

Step S201, when receiving crawl task, inquires about in history captures data according to the configuration information of the task of crawl Whether there is the first history page, described configuration information includes the uniform resource position mark URL bunch of crawl task, described in go through History captures data and includes that performing history captures the history page that task is downloaded, and described first history page is for newly to grab with described Take the history page that the configuration information of task is corresponding.

Wherein, web crawlers is performed the full page of the page captured during reptile task, the history page i.e. captured by crawler system Face, is saved in history and captures in data.Described history captures data can only include history page, it is also possible to include capturing History target pages data.When crawler system receives crawl task, capture in data to inquire about whether have the in history One history page, described first history page is the history page corresponding with the configuration information of described crawl task, such as, User's http://to be captured is s.1688.com/selloffer/offer_search.htm？Keywords=$ key}&n=y this The search-type page corresponding for URL, uploads this URL to crawler system and uploads mp3, mobile phone and 3, computer Key word (key), namely to capture

http://s.1688.com/selloffer/offer_search.htm？Keywords=mp3&n=y,

http://s.1688.com/selloffer/offer_search.htm？Keywords=Shou Ji &n=y and

http://s.1688.com/selloffer/offer_search.htm？Data in tri-pages of keywords=electricity Nao &n=y, as Shown in above-mentioned, with the $ in 3 key words replacement URL, { (unified resource is fixed for the URL of the crawl task that key} part is formed Position symbol) bunch, it is included in the configuration information of reptile task.After crawler system receives reptile task, according to described configuration In information URL bunch, searches whether 3 history corresponding for URL existing in URL bunch in history captures data The page.

Step S202, if there is described first history page, detects whether described first history page can be used.

Wherein, history there may be all URL bunch of the first corresponding history page in reptile task, also in capturing data First history page corresponding with part URL bunch may be only existed, it is also possible to there is not first corresponding with URL bunch and go through The history page.Html (HyperText Markup Language, HyperText Markup Language) due to the crawled page Code is typically to be continually changing, it is impossible to know at which time point, it there occurs that what kind of changes, such as, capture task mesh Mark is the title extracting page A, and the title data of the page A once captured before 3 days is under a label, and today captures When task generates, the title data of page A has been altered under b label, although history in this time captures in data and exists History page A, but history captures in history page A in data and is likely not to have b label, it is also possible to there is b label but aobvious So title data is not under b label, extracts title data will send out if finding b label in 3 days front page A Raw mistake.Therefore, when there is the first history page during history captures data, carrying out the first history page existed can Detect by property.

Detect whether described first history page can be used, in the embodiment that the first is possible, including:

First history page described in b1, pre-parsed, it is judged that in crawl task, the page-tag at target pages data place is in institute State in the first history page and whether exist.

Wherein, described first history page is carried out pre-parsed, obtains the label in the first history page, task will be captured The page-tag at middle target pages data place, such as in the page-tag at title data place, with the first history page Page-tag is compared, it is judged that whether there is the page at target pages data place in crawl task in the first history page Label.

If the described page-tag of b2 exists in described first history page, then judge institute in described first history page The attribute stating page-tag is the most identical with the attribute of the page-tag at target pages data place in crawl task.

Wherein, if the page-tag at target pages data place exists in described first history page in crawl task, Then determine whether this page-tag in described first history page and the page at target pages data place in crawl task The attribute of label is the most identical, to judge whether the data under this same page label there occurs change.

If the attribute of page-tag described in described first history page of b3 and target pages data place in crawl task The attribute of page-tag identical, it is determined that described first history page can be used.

If the described page-tag of b4 does not exists in described first history page, if or described first history page Described in the attribute of page-tag differ, the most really with the attribute of the page-tag at target pages data place in crawl task Fixed described first history page is unavailable.

Wherein, if target pages data institute in the attribute of page-tag described in described first history page and crawl task The attribute of page-tag identical, then under same page label, the data of the first history page and the page to be captured do not occur Change, therefore can use the first history page to obtain target pages data, and the i.e. first history page can be used.If institute Stating page-tag not exist in described first history page, the most described first history page is unavailable.If the described page Label exists in described first history page, but the attribute of page-tag described in described first history page is appointed with capturing In business, the attribute of the page-tag at target pages data place differs, namely page-tag described in the first history page Under data there occurs change, the most described first history page is unavailable.

Step S202 carries out availability detection to each first history page, and testing result is probably all first history The page all can be used, it is also possible to only part the first history page can be used, it is also possible to the most unavailable.

Step S203, if described first history page can be used, resolves the first available history page and obtains first object page Face data.

Step S204, if described first history page is unavailable, according to the unification that disabled first history page is corresponding URLs URL bunch, and history crawl data do not exist the URL of described first history page URL bunch, download the corresponding page and resolve, obtain the second target pages data.

Wherein, the first available history page is directly used in crawl first object page data, and namely history captures data If first history page corresponding and available with URL bunch of the task of crawl is had, then without the most again capturing this part in The URL bunch of corresponding page, so as not to constitute the page repeat capture.History is captured in data and does not exist described in correspondence , although and there is the first corresponding history page in history crawl data in URL bunch in the crawl task of the first history page Face but disabled URL bunch of first history page of correspondence, download these URL bunch corresponding page by network and carry out Resolve, namely capture these URL bunch corresponding page, obtain the second target pages data.The described first object page Data and the target pages data of described second page target data composition crawl task.

Wherein, owing to the change of Webpage is the most frequent, possible with the history page of current time interval overlong time The most completely different with the current page, simultaneously in order to reduce history crawl memory data output, save database space, grabbing Take addition history page effective time in the configuration information of task, in history crawl data within history page effective time The first history page it is possible to be used for capture first object page data, correspondingly, step S202 can at the second In the embodiment of energy, including:

C1, judge that described first history page is whether in history page effective time；

If described first history page of c2 is in history page effective time, the first history page described in pre-parsed is sentenced In disconnected crawl task, whether the page-tag at target pages data place exists in described first history page；

If the described page-tag of c3 exists in described first history page, it is judged that described in described first history page The attribute of page-tag is the most identical with the attribute of the page-tag at target pages data place in crawl task；

If the attribute of page-tag described in described first history page of c4 and target pages data place in crawl task The attribute of page-tag identical, it is determined that described first history page can be used；

If c5 the first history page is not in history page effective time, or described page-tag is gone through described first The history page does not exists, if or target in the attribute of page-tag described in described first history page and crawl task The attribute of the page-tag at page data place differs, it is determined that described first history page is unavailable.

Wherein, compared with the first possible embodiment of step S202, whether first judge described first history page In history page effective time, if the first history page is not in history page effective time, it is determined that described first History page is unavailable, if the first history page is in effective time, just performs step c2 to c4.Such as, history Page effective time is 3 days, if the first history page is not the history page in 3 days, then this first history page is not Available.

Fig. 3 is the schematic flow sheet of a kind of web crawlers data grab method shown in the application one exemplary embodiment, as Shown in Fig. 3, described method includes:

Step S301, timing detects the network bandwidth utilization rate of web crawlers, and the network bandwidth calculating each web crawlers makes By rate average E (w) and variance D (w) and store, wherein, w is network bandwidth utilization rate；

Step S302, when receiving crawl task, inquires about in history captures data according to the configuration information of the task of crawl Whether there is the first history page, described configuration information includes the uniform resource position mark URL bunch newly capturing task, described History captures data and includes that performing history captures the history page downloaded of task, described first history page be with described newly The history page that the configuration information of crawl task is corresponding；

Step S303, if there is described first history page, detects whether described first history page can be used；

Step S304, if described first history page can be used, resolves the first available history page and obtains first object page Face data；

Step S305, if described first history page is unavailable, according to the unification that disabled first history page is corresponding URLs URL bunch, and history crawl data do not exist the URL of described first history page URL bunch, form the second crawl task；

Step S306, according to described network bandwidth utilization rate average E (w) and variance D (w) calculate each web crawlers can The property used；

Step S307, determines according to the availability of each web crawlers and performs the described second web crawlers capturing task；

Step S308, performs the described second web crawlers capturing task and captures the second target pages data, wherein, first Target pages data and the second target pages data form the target pages data of described crawl task.

Wherein, described step S301 includes:

D1, timing acquiring preset times network bandwidth utilization rate；

D2, calculates network bandwidth utilization rate average E (w)=and D (w)=E [w-E (w)] 2 and stores, and wherein n is default Number of times.

Wherein, timing detects the network bandwidth service condition of each web crawlers, and described timing can be periodic intervals one The fixed time, such as, every 20min detects once, it is also possible to fix multiple time point according to the ruuning situation of crawler system Detect, such as the 10min of each hour, 25min, 30min, 40min and 60min in every day Detect.When detecting, gather preset times network bandwidth utilization rate, to calculate network bandwidth utilization rate average E (w) every time With variance D (w), the collection to network bandwidth utilization rate can use conventional network bandwidth utilization rate acquisition method, need Illustrate, when arranging preset times, the time of collection preset times should be made less than or equal to the assay intervals time.If

Wherein, step S303, in the embodiment that the first is possible, including:

Wherein, described configuration information can also include history page effective time, correspondingly, described step S303, In two kinds of possible embodiments, including:

Wherein, after determining the first available history page by step S303, to the first available history page, directly Carry out parsing and obtain first object page data, such that it is able to avoid the crawl that repeats of the page, saving system resource.

For disabled first history page, by URL bunch in the crawl task of its correspondence, and capture in history Data are searched less than URL bunch in the crawl task of the first history page, form the second crawl according to these URL bunch Task, namely the second crawl task is with these URL bunch for target URL bunch, and follow-up general is only according to these URL bunch Capture the second corresponding target pages data.

Wherein, step S306, described calculate each net according to described network bandwidth utilization rate average E (w) and variance D (w) The availability of network reptile, including:

Correspondingly, step S307, the web crawlers of execution task is determined according to described availability, including:

Determine the web crawlers of the second crawl task that performs in step S307 after, these web crawlers are used to capture the second mesh Mark page data.First object page data and the second target pages data form the target pages data of described crawl task.

The web crawlers data grab method that the embodiment of the present application provides, crawler system preserves the history page full page captured It is stored in history and captures data, when there being crawl task to enter, it is judged that whether history crawl data exist and crawl task URL bunch of corresponding history page, the history page if there is corresponding history page and existence can be used, then to available History page extracting directly first object page data, without repeating this partial history page to capture, thus save System resource.For corresponding URL bunch of disabled history page, and history captures and there is not corresponding history page in data URL bunch of face, using these URL bunch as target URL bunch, is captured the corresponding page by web crawlers and obtains the Two target pages data, thus complete the crawl of all target pages data.Meanwhile, it is different from and generally uses all-network Reptile or randomly choose web crawlers and complete the situation of crawl task, first timing detects the network bandwidth of all-network reptile Utilization rate, calculates network bandwidth utilization rate average E (w) of each web crawlers and variance D (w) and stores, then basis E (w) and D (w) calculate the availability of each web crawlers, the namely unsaturated web crawlers of the network bandwidth, and to available Web crawlers, select the web crawlers of execution task by the descending order of probability of availability, to select more preferable net Network reptile performs crawl task, thus reasonable distribution web crawlers resource, improve task and capture efficiency.

By the description of above embodiment of the method, those skilled in the art is it can be understood that can borrow to the application The mode helping software to add required general hardware platform realizes, naturally it is also possible to by hardware, but a lot of in the case of the former It it is more preferably embodiment.Based on such understanding, prior art is made by the technical scheme of the application the most in other words The part of contribution can embody with the form of software product, and is stored in a storage medium, including some instructions With so that a smart machine performs all or part of step of method described in each embodiment of the application.And aforesaid deposit Storage media includes: read only memory (ROM), random access memory (RAM), magnetic disc or CD etc. are various can With storage data and the medium of program code.

Fig. 4 is the block diagram that a kind of web crawlers shown in the application one exemplary embodiment captures task allocation apparatus.Such as figure Shown in 4, described device includes:

Network bandwidth utilization rate processing unit U401, for regularly detecting the network bandwidth utilization rate of web crawlers, calculates Network bandwidth utilization rate average E (w) of each web crawlers and variance D (w) also store, and wherein, w is that the network bandwidth makes By rate；

Reptile availability calculations unit U402, for when receiving crawl task, according to described network bandwidth utilization rate Average E (w) and variance D (w) calculate the availability of each web crawlers；

First reptile determines unit U403, for determining the web crawlers of execution task according to described availability.

Wherein, described network bandwidth utilization rate processing unit, including:

Wherein, described reptile availability calculations unit, including:

Wherein, described first reptile determines unit, including:

Fig. 5 is the block diagram of a kind of web crawlers data grabber device shown in the application one exemplary embodiment.Such as Fig. 5 institute Showing, described device includes:

First history page query unit U501, for when receiving crawl task, according to the configuration letter of the task of crawl Ceasing and inquire about whether there is the first history page in history captures data, described configuration information includes the unification newly capturing task URLs URL bunch, described history captures data and includes that performing history captures the history page that task is downloaded, described First history page is the history page corresponding with the configuration information of described new crawl task；

History page availability detector unit U502, for if there is described first history page, detects described first Whether history page can be used；

First resolution unit U503, if can use for described first history page, resolves the first available history page Obtain first object page data；

First placement unit U504, if unavailable, according to described disabled first for described first history page There is not described first history page in the uniform resource position mark URL bunch that history page is corresponding, and history crawl data Uniform resource position mark URL bunch, download the corresponding page and resolve, obtain the second target pages data.

Wherein, described history page availability detector unit, in the embodiment that the first is possible, including:

Wherein, described configuration information also includes history page effective time, described history page availability detector unit, In the embodiment that the second is possible, including:

Fig. 6 is the block diagram of the another kind of web crawlers data grabber device shown in the application one exemplary embodiment.Such as Fig. 6 Shown in, described device includes:

Network bandwidth utilization rate processing unit U601, for regularly detecting the network bandwidth utilization rate of web crawlers, calculates Network bandwidth utilization rate average E (w) of each web crawlers and variance D (w) also store, and wherein, w is that the network bandwidth makes By rate；

First history page query unit U602, for when receiving crawl task, according to the configuration letter of the task of crawl Ceasing and inquire about whether there is the first history page in history captures data, described configuration information includes the unification newly capturing task URLs URL bunch, described history captures data and includes that performing history captures the history page that task is downloaded, described First history page is the history page corresponding with the configuration information of described new crawl task；

History page availability detector unit U603, for if there is described first history page, detects described first Whether history page can be used；

First resolution unit U604, if can use for described first history page, resolves the first available history page Obtain first object page data；

Second captures task signal generating unit U605, if unavailable, according to disabled for described first history page There is not described first history in the uniform resource position mark URL bunch that the first history page is corresponding, and history crawl data The uniform resource position mark URL bunch of the page, forms the second crawl task；

Reptile availability calculations unit U606, for according to described network bandwidth utilization rate average E (w) and variance D (w) Calculate the availability of each web crawlers；

Second reptile determines unit U607, for determining that execution described second captures according to the availability of each web crawlers The web crawlers of task；

Second placement unit U608, captures the second target pages number for performing the described second web crawlers capturing task According to, wherein, first object page data and the second target pages data form the target pages data of described crawl task.

Wherein, described reptile availability calculations unit, including:

Wherein, described second reptile determines unit, including:

For convenience of description, it is divided into various unit to be respectively described with function when describing apparatus above.Certainly, this is being implemented The function of each unit can be realized in same or multiple softwares and/or hardware during application.

Each embodiment in this specification all uses the mode gone forward one by one to describe, identical similar part between each embodiment Seeing mutually, what each embodiment stressed is the difference with other embodiments.Especially for device Or for system embodiment, owing to it is substantially similar to embodiment of the method, so describing fairly simple, relevant part ginseng See that the part of embodiment of the method illustrates.Apparatus and system embodiment described above is only schematically, wherein The described unit illustrated as separating component can be or may not be physically separate, the portion shown as unit Part can be or may not be physical location, i.e. may be located at a place, or can also be distributed to multiple network On unit.Some or all of module therein can be selected according to the actual needs to realize the purpose of the present embodiment scheme. Those of ordinary skill in the art, in the case of not paying creative work, are i.e. appreciated that and implement.

It should be noted that in this article, such as the relational terms of " first " and " second " or the like be used merely to by One entity or operation separate with another entity or operating space, and not necessarily require or imply these entities or behaviour Relation or the backward of any this reality is there is between work.And, term " includes ", " comprising " or its any its His variant is intended to comprising of nonexcludability, so that include the process of a series of key element, method, article or set Standby not only include those key elements, but also include other key elements being not expressly set out, or also include for this process, The key element that method, article or equipment are intrinsic.In the case of there is no more restriction, by statement " including ... " The key element limited, it is not excluded that there is also other phase in including the process of described key element, method, article or equipment Same key element.

The above is only the detailed description of the invention of the application, makes to skilled artisans appreciate that or realize the application. Multiple amendment to these embodiments will be apparent to one skilled in the art, and as defined herein one As principle can realize in other embodiments in the case of without departing from spirit herein or scope.Therefore, this Shen Please be not intended to be limited to the embodiments shown herein, and be to fit to and principles disclosed herein and features of novelty The widest consistent scope.

Claims

1. a web crawlers captures method for allocating tasks, it is characterised in that including:

2. web crawlers as claimed in claim 1 captures method for allocating tasks, it is characterised in that described timing detection net The network bandwidth utilization rate of network reptile, calculates network bandwidth utilization rate average E (w) and variance D (w) of each web crawlers And store, including:

Timing acquiring preset times network bandwidth utilization rate；

3. web crawlers as claimed in claim 1 captures method for allocating tasks, it is characterised in that described according to described net Network bandwidth utilization rate average E (w) and variance D (w) calculate the availability of each web crawlers, including:

4. web crawlers as claimed in claim 3 captures method for allocating tasks, it is characterised in that can described in described basis The web crawlers of execution task is determined by property, including:

5. a web crawlers data grab method, it is characterised in that including:

If described first history page is unavailable, according to the URL that disabled first history page is corresponding There is not the uniform resource position mark URL bunch of the first history page in URL bunch, and history crawl data, it is right to download The page answered resolves, and obtains the second target pages data.

6. web crawlers data grab method as claimed in claim 5, it is characterised in that described detection described first is gone through Whether the history page can be used, including:

If the attribute of page-tag described in described first history page and the page at target pages data place in crawl task The attribute of face label is identical, it is determined that described first history page can be used；

7. web crawlers data grab method as claimed in claim 5, it is characterised in that described configuration information also includes History page effective time, whether described first history page of described detection can use, including:

8. a web crawlers data grab method, it is characterised in that including:

If described first history page is unavailable, according to the URL that disabled first history page is corresponding URL bunch, and history captures in data the uniform resource position mark URL bunch that there is not the first history page, form the Two capture task；

9. web crawlers data grab method as claimed in claim 8, it is characterised in that described timing detection network is climbed The network bandwidth utilization rate of worm, calculates network bandwidth utilization rate average E (w) of each web crawlers and variance D (w) and deposits Storage, including:

Timing acquiring preset times network bandwidth utilization rate；

10. web crawlers data grab method as claimed in claim 8, it is characterised in that described detection described first Whether history page can be used, including:

11. web crawlers data grab methods as claimed in claim 8, it is characterised in that described configuration information also wraps Including history page effective time, whether described first history page of described detection can be used, including:

12. web crawlers data grab methods as claimed in claim 8, it is characterised in that described according to described network Bandwidth utilization rate average E (w) and variance D (w) calculate the availability of each web crawlers, including:

13. web crawlers data grab methods as claimed in claim 12, it is characterised in that described according to described available Property determines the web crawlers of execution task, including:

14. 1 kinds of web crawlers capture task allocation apparatus, it is characterised in that including:

Reptile availability calculations unit, for when receiving crawl task, according to described network bandwidth utilization rate average E (w) With the availability that variance D (w) calculates each web crawlers；

15. web crawlers as claimed in claim 14 capture task allocation apparatus, it is characterised in that the described network bandwidth Utilization rate processing unit, including:

16. web crawlers as claimed in claim 14 capture task allocation apparatus, it is characterised in that described reptile can be used Property computing unit, including:

17. web crawlers as claimed in claim 16 capture task allocation apparatus, it is characterised in that described first reptile Determine unit, including:

18. 1 kinds of web crawlers data grabber devices, it is characterised in that including:

19. web crawlers data grabber devices as claimed in claim 18, it is characterised in that described history page can be used Property detector unit, including:

20. web crawlers data grabber devices as claimed in claim 18, it is characterised in that described configuration information also wraps Include history page effective time, described history page availability detector unit, including:

21. 1 kinds of web crawlers data grabber devices, it is characterised in that including:

22. web crawlers data grabber devices as claimed in claim 21, it is characterised in that the described network bandwidth uses Rate processing unit, including:

23. web crawlers data grabber devices as claimed in claim 21, it is characterised in that described history page can be used Property detector unit, including:

24. web crawlers data grabber devices as claimed in claim 21, it is characterised in that described configuration information also wraps Include history page effective time, described history page availability detector unit, including:

25. web crawlers data grabber devices as claimed in claim 21, it is characterised in that described reptile availability meter Calculate unit, including:

26. web crawlers data grabber devices as claimed in claim 25, it is characterised in that described second reptile determines Unit, including: