CN106202108B

CN106202108B - Web crawlers grabs method for allocating tasks and device and data grab method and device

Info

Publication number: CN106202108B
Application number: CN201510227531.9A
Authority: CN
Inventors: 刘庆; 张美德; 殷贤君; 邹启蒙
Original assignee: Alibaba Group Holding Ltd
Current assignee: Alibaba Group Holding Ltd
Priority date: 2015-05-06
Filing date: 2015-05-06
Publication date: 2019-09-06
Anticipated expiration: 2035-05-06
Also published as: CN106202108A

Abstract

The embodiment of the present application discloses a kind of web crawlers crawl method for allocating tasks and device and data grab method and device.The web crawlers data grab method, it saves the history page grabbed and is stored in history crawl data, when there is crawl task, judge in history crawl data with the presence or absence of history page corresponding with the crawl URL cluster of task, if there is and existing history page it is available, first object page data is directly then extracted to available history page, without repeating to grab to this partial history page, saves system resource.Simultaneously, first timing detects the network bandwidth utilization rate of all-network crawler, it calculates the network bandwidth utilization rate mean value E (w) and variance D (w) of each web crawlers and stores, then the availability of each web crawlers is calculated according to E (w) and D (w), to available web crawlers, the web crawlers that task is executed by the descending sequential selection of probability of availability, thus reasonable distribution web crawlers resource.

Description

Web crawlers grabs method for allocating tasks and device and data grab method and device

Technical field

The present invention relates to network technique field more particularly to a kind of web crawlers crawl method for allocating tasks and device and number According to grasping means and device.

Background technique

Web crawlers (Computer Robot, also known as webpage spider or network robot), is a kind of according to certain Rule automatically grabs the program of internet web page data, is the important composition of search engine.Usual web crawlers is according to configuration Crawl task parses webpage and is filtered from the Internet download webpage, obtains target webpage data.It is all by web crawlers The target webpage data of crawl are stored in crawler system, and establish index, so as to inquiry and retrieval later.

The development of network and information technology is so that the very fast growth of the quantity of website, webpage and web data, a crawler system System needs numerous web crawlers to grab a large amount of web data, these web crawlers be likely distributed in the same local area network it In, it is also possible to it is dispersed in different geographical locations.Different according to the degree of scatter of web crawlers, crawler system is broadly divided into two kinds Framework: 1, based on the distributed structure/architecture of local area network, all-network crawler runs in the same local area network in this framework, passes through The same local area network accesses external the Internet, downloads webpage, and all network loads all concentrate in the outlet of local area network, due to The total bandwidth upper limit of network egress be it is fixed, the quantity of web crawlers will receive the limitation of local area network outlet bandwidth；2, it is based on The distributed structure/architecture of wide area network: parallel web crawlers is separately operable in diverse geographic location or network site, for example, network is climbed Worm is likely located at different computer rooms, or is located at China, Japan and the U.S., is each responsible for downloading the webpage on this three ground, or be located at CHINANET, CERNET and CEINET are each responsible for downloading the webpage in these three networks, each web crawlers under this framework External the Internet is accessed by respective network, is only influenced by corresponding outlet bandwidth, so that scatternet flow, subtracts Light network load.

Distributed structure/architecture based on wide area network be at present using more crawler system framework, but under this framework, due to Web crawlers quantity is big and content crawl demand is more, web crawlers deployment dispersion, and the corresponding outlet bandwidth of each web crawlers can Energy can be multifarious, and Internet usage situation is unreasonable to be made to grab efficiency reduction.On the other hand, crawl task quantity increases It is easy to produce the page and repeats crawl problem, for example A task needs to grab the top data of the C page, when crawler system completes task Afterwards, the B task of next arrival needs to grab the bottom data of the C page, since web crawlers is only returned and stored in A task The top data of the C page, therefore the C page can only be carried out again once to download to will cause in this way to obtain the bottom data of the C page The waste of crawler resource leads to web crawlers cluster pressure increase.

Summary of the invention

Crawl is repeated using the unreasonable and page to overcome web crawlers in the related technology to grab Task Network resource Problem, the application provide a kind of web crawlers crawl method for allocating tasks and device and data grab method and device.

According to the embodiment of the present application in a first aspect, providing a kind of web crawlers crawl method for allocating tasks, comprising:

Timing detects the network bandwidth utilization rate of web crawlers, calculates the network bandwidth utilization rate mean value of each web crawlers E (w) and variance D (w) is simultaneously stored, wherein w is network bandwidth utilization rate；

When receiving crawl task, each net is calculated according to the network bandwidth utilization rate mean value E (w) and variance D (w) The availability of network crawler；

The web crawlers of execution task is determined according to the availability.

Optionally, the network bandwidth utilization rate of the timing detection web crawlers, calculates the Netowrk tape of each web crawlers Wide utilization rate mean value E (w) and variance D (w) are simultaneously stored, comprising:

Timing acquiring preset times network bandwidth utilization rate；

Calculate network bandwidth utilization rate mean valueWith D (w)=E [w-E (w)]²And store, wherein n is default Number.

Optionally, described that each web crawlers is calculated according to the network bandwidth utilization rate mean value E (w) and variance D (w) Availability, comprising:

Extract before current time in predetermined time period the network bandwidth utilization rate mean value E (w) of each web crawlers and Variance D (w)；

According to the network bandwidth utilization rate mean value E (w) and variance D (w) of web crawlers each in predetermined time period, use Naive Bayes Classification Algorithm calculates the probability of availability and unavailability probability of each web crawlers.

Optionally, the web crawlers that execution task is determined according to the availability, comprising:

The probability of availability and unavailability probability of comparing cell crawler；

If the probability of availability of web crawlers is greater than unavailability probability, determine that the web crawlers is available, otherwise, institute It is unavailable to state web crawlers；

To available web crawlers, the web crawlers of task is executed by the descending sequential selection of probability of availability.

According to the second aspect of the embodiment of the present application, a kind of web crawlers data grab method is provided, comprising:

When receiving crawl task, being inquired in history crawl data according to the configuration information of crawl task whether there is First history page, the configuration information include the uniform resource position mark URL cluster of new crawl task, and the history grabs data The history page downloaded including executing history crawl task, first history page are the configuration with the new crawl task The corresponding history page of information；

If there is first history page, detect whether first history page can be used；

If first history page is available, parses available first history page and obtain first object page data；

If first history page is unavailable, positioned according to the corresponding unified resource of not available first history page Accord with the uniform resource position mark URL cluster that first history page is not present in URL cluster and history crawl data, downloading pair The page answered is parsed, and the second target pages data are obtained.

Optionally, whether detection first history page can be used, comprising:

First history page described in pre-parsed judges the page-tag in crawl task where target pages data described It whether there is in first history page；

If the page-tag exists in first history page, page described in first history page is judged Whether the attribute of face label is identical as the attribute of the page-tag where target pages data in crawl task；

If in the attribute of page-tag described in first history page and crawl task where target pages data Page-tag attribute it is identical, it is determined that first history page is available；

If the page-tag is not present in first history page, or if in first history page The attribute of the page-tag is not identical as the attribute of the page-tag where target pages data in crawl task, it is determined that institute It is unavailable to state the first history page.

Optionally, the configuration information further includes history page effective time, and detection first history page is It is no available, comprising:

Judge first history page whether within history page effective time；

If first history page is within history page effective time, the first history page described in pre-parsed, judgement Page-tag in crawl task where target pages data whether there is in first history page；

If the first history page is not within history page effective time or the page-tag is in first history Be not present in the page, or if page-tag described in first history page attribute and crawl task in target pages The attribute of page-tag where data is not identical, it is determined that first history page is unavailable.

According to the third aspect of the embodiment of the present application, another web crawlers data grab method is provided, comprising:

If first history page is unavailable, according to the corresponding unified resource of not available first history page The uniform resource position mark URL cluster of first history page, shape are not present in finger URL URL cluster and history crawl data At the second crawl task；

The availability of each web crawlers is calculated according to the network bandwidth utilization rate mean value E (w) and variance D (w)；

The web crawlers for executing the second crawl task is determined according to the availability of each web crawlers；

The web crawlers for executing the second crawl task grabs the second target pages data,

Wherein, first object page data and the second target pages data form the target pages number of the crawl task According to.

Timing acquiring preset times network bandwidth utilization rate；

Optionally, whether detection first history page can be used, comprising:

Judge first history page whether within history page effective time；

If the probability of availability of web crawlers is greater than unavailability probability, determine that the web crawlers is available, otherwise, really The fixed web crawlers is unavailable；

Corresponding to the embodiment of the present application in a first aspect, according to the fourth aspect of the embodiment of the present application, a kind of network is provided Crawler capturing task allocation apparatus, comprising:

Network bandwidth utilization rate processing unit calculates each for periodically detecting the network bandwidth utilization rate of web crawlers The network bandwidth utilization rate mean value E (w) and variance D (w) of web crawlers are simultaneously stored, wherein w is network bandwidth utilization rate；

Crawler availability calculations unit, for when receiving crawl task, according to the network bandwidth utilization rate mean value E (w) availability of each web crawlers is calculated with variance D (w)；

First crawler determination unit, for determining the web crawlers of execution task according to the availability.

Optionally, the network bandwidth utilization rate processing unit, comprising:

Network bandwidth utilization rate acquisition module is used for timing acquiring preset times network bandwidth utilization rate；

Mean variance computing module, for calculating network bandwidth utilization rate mean valueWith D (w)=E [w-E (w)]²And store, wherein n is preset times.

Optionally, the crawler availability calculations unit, comprising:

Mean variance extraction module extracts the network bandwidth of each web crawlers in predetermined time period before current time Utilization rate mean value E (w) and variance D (w)；

Probability evaluation entity, for the network bandwidth utilization rate mean value E according to web crawlers each in predetermined time period (w) and variance D (w), using Naive Bayes Classification Algorithm calculate each web crawlers probability of availability and unavailability it is general Rate.

Optionally, the first crawler determination unit, comprising:

Comparison module, probability of availability and unavailability probability for comparing cell crawler；

Availability determining module determines the network if the probability of availability of web crawlers is greater than unavailability probability Crawler is available, otherwise, it determines the web crawlers is unavailable；

Selecting module, for executing task by the descending sequential selection of probability of availability to available web crawlers Web crawlers.

A kind of network is provided according to the 5th of the embodiment of the present application the aspect corresponding to the second aspect of the embodiment of the present application Crawler data grabber device, comprising:

First history page query unit, for being existed according to the configuration information of crawl task when receiving crawl task History, which grabs inquiry in data, whether there is the first history page, and the configuration information includes that the unified resource of new crawl task is fixed Position symbol URL cluster, the history crawl data include the history page for executing history crawl task and being downloaded, the first history page Face is history page corresponding with the configuration information of the new crawl task；

History page availability detection unit, for detecting first history if there is first history page Whether the page can be used；

First resolution unit parses available first history page and obtains if available for first history page First object page data；

First picking unit, if unavailable for first history page, according to not available first history The unification of first history page is not present in the corresponding uniform resource position mark URL cluster of the page and history crawl data Resource Locator URL cluster is downloaded the corresponding page and is parsed, and the second target pages data are obtained.

Optionally, the history page availability detection unit, comprising:

First judgment module judges target pages data institute in crawl task for the first history page described in pre-parsed Page-tag whether there is in first history page；

Second judgment module judges described if existed in first history page for the page-tag The attribute of page-tag described in one history page and the attribute of the page-tag where target pages data in crawl task are It is no identical；

Determining module can be used, if in the attribute and crawl task of page-tag described in first history page The attribute of page-tag where target pages data is identical, it is determined that first history page is available；

Unavailable determining module, if be not present in first history page for the page-tag, Huo Zheru Page-tag in the attribute and crawl task of page-tag described in first history page described in fruit where target pages data Attribute it is not identical, then first history page is unavailable.

Optionally, the configuration information further includes history page effective time, the history page availability detection unit, Include:

Third judgment module, for judging first history page whether within history page effective time；

4th judgment module, if for first history page within history page effective time, described in pre-parsed First history page, judge the page-tag in crawl task where target pages data in first history page whether In the presence of；

5th judgment module judges described if existed in first history page for the page-tag The attribute of page-tag described in one history page and the attribute of the page-tag where target pages data in crawl task are It is no identical；

Unavailable determining module, if for the first history page not within history page effective time or the page Face label is not present in first history page, or if page-tag described in first history page attribute It is not identical as the attribute of the page-tag where target pages data in crawl task, it is determined that first history page can not With.

Another net is provided according to the 6th of the embodiment of the present application the aspect corresponding to the third aspect of the embodiment of the present application Network crawler data grabber device, comprising:

Second crawl task generation unit, if unavailable for first history page, according to not available first There is no first history pages in the corresponding uniform resource position mark URL cluster of history page and history crawl data Uniform resource position mark URL cluster forms the second crawl task；

Crawler availability calculations unit, it is each for being calculated according to the network bandwidth utilization rate mean value E (w) and variance D (w) The availability of a web crawlers；

Second crawler determination unit, for determining execution second crawl according to according to the availability of each web crawlers The web crawlers of task；

Second picking unit, the web crawlers for executing the second crawl task grab the second target pages data,

Optionally, the network bandwidth utilization rate processing unit, comprising:

Mean variance computing module, for calculating network bandwidth utilization rate mean valueWith D (w)=E [w-E (w)] it 2 and stores, wherein n is preset times.

Optionally, the history page availability detection unit, comprising:

Optionally, the crawler availability calculations unit, comprising:

Optionally, the second crawler determination unit, comprising:

Technical solution provided by the embodiments of the present application can include the following benefits: network provided by the embodiments of the present application Crawler data grab method, crawler system save the history page full page that grabbed and are contained in history crawl data, when there is crawl When task enters, judge to whether there is history page corresponding with the crawl URL cluster of task in history crawl data, if there is Corresponding history page and existing history page are available, then directly extract first object page number to available history page According to without repeating to grab to this partial history page, to save system resource.It is corresponding for not available history page There is no the URL clusters of corresponding history page in URL cluster and history crawl data, form the second crawl task, are grabbed by web crawlers It takes the corresponding page and obtains the second target pages data, thus complete the crawl of all target pages data.Meanwhile it being different from Common the case where completing crawl task using all-network crawler or random selection web crawlers, first periodically detect all nets The network bandwidth utilization rate of network crawler calculates the network bandwidth utilization rate mean value E (w) and variance D (w) of each web crawlers and deposits Then storage calculates the availability of each web crawlers according to E (w) and D (w), and to available web crawlers, by probability of availability Descending sequential selection executes the web crawlers of task, executes crawl task to select better web crawlers, thus Reasonable distribution web crawlers resource improves task and grabs efficiency.

It should be understood that above general description and following detailed description be only it is exemplary and explanatory, not The application can be limited.

Detailed description of the invention

In order to illustrate the technical solutions in the embodiments of the present application or in the prior art more clearly, to embodiment or will show below There is attached drawing needed in technical description to be briefly described, it should be apparent that, for those of ordinary skill in the art Speech, without any creative labor, is also possible to obtain other drawings based on these drawings.

Fig. 1 is a kind of process signal of web crawlers crawl method for allocating tasks shown in one exemplary embodiment of the application Figure.

Fig. 2 is a kind of flow diagram of web crawlers data grab method shown in one exemplary embodiment of the application.

Fig. 3 is the process signal of another web crawlers data grab method shown in one exemplary embodiment of the application Figure.

Fig. 4 is a kind of block diagram of web crawlers crawl task allocation apparatus shown in one exemplary embodiment of the application.

Fig. 5 is a kind of block diagram of web crawlers data grabber device shown in one exemplary embodiment of the application.

Fig. 6 is the block diagram of another web crawlers data grabber device shown in one exemplary embodiment of the application.

Specific embodiment

Example embodiments are described in detail here, and the example is illustrated in the accompanying drawings.Following description is related to When attached drawing, unless otherwise indicated, the same numbers in different drawings indicate the same or similar elements.Following exemplary embodiment Described in embodiment do not represent all embodiments consistent with the application.On the contrary, they be only with it is such as appended The example of the consistent device and method of some aspects be described in detail in claims, the application.

For comprehensive understanding the application, numerous concrete details are referred in the following detailed description, but this field skill Art personnel are it should be understood that the application may not need these details and realize.In other embodiments, it is not described in detail known Method, process, component and circuit, in order to avoid it is fuzzy to undesirably result in embodiment.

Fig. 1 is a kind of process signal of web crawlers crawl method for allocating tasks shown in one exemplary embodiment of the application Figure, as shown in Figure 1, which comprises

S101, timing detect the network bandwidth utilization rate of web crawlers, and the network bandwidth for calculating each web crawlers uses Rate mean value E (w) and variance D (w) is simultaneously stored, wherein w is network bandwidth utilization rate.

Wherein, the network bandwidth service condition of each web crawlers is periodically detected, the timing can be periodic intervals Regular hour can also click through for example, every 20min detection is primary according to the operating condition of crawler system fixed multiple times Row detection, for example, daily in each hour 10min, 25min, 30min, 40min and 60min detected. Every time when detection, preset times network bandwidth utilization rate is acquired, to calculate network bandwidth utilization rate mean value E (w) and variance D (w), The acquisition of network bandwidth utilization rate can be used conventional network bandwidth utilization rate acquisition method, it should be noted that setting When setting preset times, the time for acquiring preset times is made to be less than or equal to detection interval time.If preset times are n, then

The network bandwidth utilization rate mean value E (w) and variance D (w) of each web crawlers being calculated are stored in crawler In system, each network bandwidth utilization rate mean value E (w) and variance D (w) are corresponding with detection time.

S102 is calculated when receiving crawl task according to the network bandwidth utilization rate mean value E (w) and variance D (w) The availability of each web crawlers.

Wherein, the available of each web crawlers is calculated according to the network bandwidth utilization rate mean value E (w) and variance D (w) Property, comprising:

A1, the network bandwidth utilization rate mean value E (w) for extracting each web crawlers in predetermined time period before current time With variance D (w)；

A2, according to the network bandwidth utilization rate mean value E (w) and variance D (w) of web crawlers each in predetermined time period, The probability of availability and unavailability probability of each web crawlers are calculated using Naive Bayes Classification Algorithm.

After crawler system receives crawl task, first from the network bandwidth utilization rate mean value and variance data of storage, With current time (receiving the time of crawl task) for starting point, each net in predetermined time period before extraction current time The network bandwidth utilization rate mean value E (w) and variance D (w) of network crawler, for example, extracting each net before current time within 2h The network bandwidth utilization rate mean value and variance data of network crawler.Then each network is calculated using Naive Bayes Classification Algorithm to climb The probability of availability and unavailability probability, process of worm are as follows:

If the availability of web crawlers is with A (0,1) expression, wherein 0 indicates that the network bandwidth utilization rate of the web crawlers permits Perhaps new task is received, 1 expression does not allow；If the network attribute of web crawlers is X={ b, c }, wherein b indicates network bandwidth Utilization rate mean value E (w), c indicate network bandwidth utilization rate variance D (w), then 0≤b≤1,0≤c≤1；Web crawlers Probability of availability is P (a=0 | x), and unavailability probability is P (a=1 | x).Determine the Naive Bayes Classification Algorithm point of b and c Section critical value, the critical value by determining after many experiments, and are stored previously according to the integrated network service condition of crawler system In crawler system, such as the segmentation critical value of b is 0.3 and 0.8, and the segmentation critical value of c is 0.2 and 0.6, then using simple shellfish This sorting algorithm of leaf calculates following conditional probability:

P (0 <b≤0.3 | a=0), P (0.3 <b≤0.8 | a=0), P (0.8 <b≤1 | a=0),

P (0 < c≤0.2 | a=0), P (0.2 < c≤0.6 | a=0), P (0.6 < c≤1 | a=0),

P (0 <b≤0.3 | a=1), P (0.3 <b≤0.8 | a=1), P (0.8 <b≤1 | a=1),

P (0 < c≤0.2 | a=1), P (0.2 < c≤0.6 | a=1), P (0.6 < c≤1 | a=1), it is assumed that it is current some The network attribute of web crawlers is X={ b=0.35, c=0.15 }, then having:

P (a=0 | x)=P (a=0),

P (0.3 <b≤0.8 | a=0) P (0 < c≤0.2 | a=0) P (a=1 | x)=P (a=1) P (0.3 <b≤0.8 | a =1) P (0 < c≤0.2 | a=1)

It is possible thereby to which P (a=0 | x) and P (a=1 | x) is calculated.If P (a=0 | x) >=P (a=1 | x), currently The network bandwidth of web crawlers does not reach saturation, can receive new crawl task, that is, available, on the contrary then unavailable. The probability of availability and unavailability probability of web crawlers are calculated using network bandwidth utilization rate mean value E (w) and variance D (w) Can be used other methods, for example, using detection device by preset time interval ping (end-to-end connection, commonly used into The inspection of row availability) each web crawlers of operation device, record time that each web crawlers is led to by ping in certain period of time Several and ping obstructed numbers, and calculate it is corresponding can ping lead to probability and can not ping lead to probability, can ping lead to probability and make For probability of availability, can not ping lead to probability and be used as unavailability probability, if probability of availability greater than unavailability probability, Or probability of availability is greater than preset threshold values, then it is assumed that web crawlers is available.

S103 determines the web crawlers of execution task according to the availability.

Wherein, step S103, comprising:

Wherein, above-mentioned probability of availability and unavailability probability are calculated to all-network crawler, and determines available network After crawler, for available web crawlers, it is arranged from big to small according to the value of probability of availability P (a=0 | x), is then pressed Successively selection web crawlers distributes crawl task to this sequence.Because probability of availability is bigger, the network bandwidth of web crawlers is indicated Saturation degree is lower, according to the network bandwidth saturation degree of network bandwidth flow and available web crawlers that crawl task needs Situation and probability of availability sequence, can calculate and how many web crawlers needed.If available web crawlers is not enough to realize and grab Task is taken, then after crawler system can wait other web crawlers idle, completes remaining workload using idle web crawlers.

Fig. 2 is a kind of flow diagram of web crawlers data grab method shown in one exemplary embodiment of the application, As shown in Figure 2, which comprises

Step S201 is looked into history crawl data when receiving crawl task according to the configuration information of crawl task It askes and whether there is the first history page, the configuration information includes the uniform resource position mark URL cluster of crawl task, the history Crawl data include the history page for executing history crawl task and being downloaded, and first history page is to appoint with the new crawl The corresponding history page of the configuration information of business.

Wherein, the full page of the page grabbed when web crawlers is executed crawler task by crawler system, that is, the history grabbed The page is stored in history crawl data.The history crawl data can only include history page, also may include crawl History target pages data.When crawler system receives crawl task, inquiry is gone through with the presence or absence of first in history crawl data The history page, first history page is history page corresponding with the configuration information of the crawl task, for example, user will grab Take http://s.1688.com/selloffer/offer_search.htm? keywords=$ { this URL couples of key } &n=y The search-type page answered uploads the URL to crawler system and uploads 3 mp3, mobile phone and computer keywords (key), It seeks to grab

Is http://s.1688.com/selloffer/offer_search.htm? keywords=mp3&n=y,

Is http://s.1688.com/selloffer/offer_search.htm? keywords=Shou Ji &n=y and

Is http://s.1688.com/selloffer/offer_search.htm? keywords=electricity Nao &n=y tri- Data in the page, the URL (system for the crawl task that the part $ { key } as noted above in 3 keyword replacement URL is formed One Resource Locator) cluster, it is included in the configuration information of crawler task.After crawler system receives crawler task, according to described URL cluster in configuration information searches whether that there are the corresponding history pages of 3 in URL cluster URL in history crawl data.

Step S202 detects whether first history page can be used if there is first history page.

It wherein, can also there may be corresponding first history page of URL clusters all in crawler task in history crawl data The first history page corresponding with part URL cluster can be only existed, it is also possible to the first history page corresponding with URL cluster be not present. Since html (HyperText Markup Language, the HyperText Markup Language) code for the page being crawled is usually It is continually changing, it can not know what kind of variation it has occurred at which at time point, such as crawl task object is to extract page A Title, the title data of the page A once grabbed before 3 days be under a label, today grab task generate when, the mark of page A Topic data have been altered under b label, although history grabs data there are history page A in this when of history crawl data In history page A in may there is no b label, it is also possible to have b label it is apparent that title data is not under b label, if at 3 days B label is found in preceding page A will occur mistake to extract title data.Therefore, when there are first in history crawl data When history page, availability detection is carried out to existing first history page.

Detect whether first history page can be used, in the first possible embodiment, comprising:

First history page described in b1, pre-parsed judges that the page-tag in crawl task where target pages data exists It whether there is in first history page.

Wherein, pre-parsed is carried out to first history page, obtains the label in the first history page, task will be grabbed Page-tag where middle target pages data, such as the page in the page-tag where title data, with the first history page Face label is compared, and judges in the first history page with the presence or absence of the page mark where target pages data in crawl task Label.

If b2, the page-tag exist in first history page, judge in first history page Whether the attribute of the page-tag is identical as the attribute of the page-tag where target pages data in crawl task.

Wherein, if the page-tag in crawl task where target pages data is deposited in first history page Then further judging the page in page-tag and crawl task in first history page where target pages data Whether the attribute of face label is identical, to judge whether the data under the same page label are changed.

If target pages data institute in the attribute and crawl task of page-tag described in b3, first history page Page-tag attribute it is identical, it is determined that first history page is available.

If b4, the page-tag are not present in first history page, or if the first history page The attribute of page-tag described in face is not identical as the attribute of the page-tag where target pages data in crawl task, then really Fixed first history page is unavailable.

Wherein, if the attribute of page-tag described in first history page and target pages data in crawl task The attribute of the page-tag at place is identical, then the first history page and the data of the page to be grabbed do not occur under same page label Variation, therefore the first history page can be used to obtain target pages data, i.e. the first history page is available.If the page Face label is not present in first history page, then first history page is unavailable.If the page-tag exists Exist in first history page, but target in the attribute of page-tag described in first history page and crawl task The attribute of page-tag where page data is not identical, that is, the data hair under page-tag described in the first history page Variation is given birth to, then first history page is unavailable.

Step S202 carries out availability detection to each first history page, and testing result may be all first history The page is all available, it is also possible to which only the first history page of part is available, it is also possible to all unavailable.

Step S203 parses available first history page and obtains first object if first history page is available Page data.

Step S204, if first history page is unavailable, according to the corresponding system of not available first history page The uniform resource position mark URL of first history page is not present in one Resource Locator URL cluster and history crawl data Cluster is downloaded the corresponding page and is parsed, and the second target pages data are obtained.

Wherein, available first history page is directly used in crawl first object page data, that is, history grabs number If there is the first history page corresponding and available with the crawl URL cluster of task in, then no longer need to grab this part again The corresponding page of URL cluster, in order to avoid constitute the repetition crawl of the page.For being not present corresponding described first in history crawl data URL cluster in the crawl task of history page, although and there are corresponding first history page is right in history crawl data The not available URL cluster of the first history page answered is downloaded these corresponding pages of URL cluster by network and is parsed, that is, These corresponding pages of URL cluster are grabbed, the second target pages data are obtained.The first object page data and the second page The target pages data of Area Objects data composition crawl task.

Wherein, usually very frequently due to the variation of Webpage, the history page with current time interval overlong time can Can be completely different with the current page, while in order to reduce history crawl data storage capacity, database space is saved, is being grabbed History page effective time is added in the configuration information of task, history grabs the within history page effective time in data One history page is possible to be used to grab first object page data, and correspondingly, step S202 is in second of possible reality It applies in mode, comprising:

C1, judge first history page whether within history page effective time；

If c2, first history page are within history page effective time, the first history page described in pre-parsed is sentenced Page-tag in disconnected crawl task where target pages data whether there is in first history page；

If c3, the page-tag exist in first history page, institute in first history page is judged Whether the attribute for stating page-tag is identical as the attribute of the page-tag where target pages data in crawl task；

If target pages data institute in the attribute and crawl task of page-tag described in c4, first history page Page-tag attribute it is identical, it is determined that first history page is available；

If c5, the first history page be not within history page effective time or the page-tag is described first It is not present in history page, or if the attribute of page-tag described in first history page and target in crawl task The attribute of page-tag where page data is not identical, it is determined that first history page is unavailable.

Wherein, compared with the first possible embodiment of step S202, whether first judge first history page Within history page effective time, if the first history page is not within history page effective time, it is determined that described first goes through The history page is unavailable, if the first history page within effective time, just executes step c2 to c4.For example, history page is effective Time is 3 days, if the first history page is not the history page in 3 days, first history page is unavailable.

Fig. 3 is a kind of flow diagram of web crawlers data grab method shown in one exemplary embodiment of the application, As shown in Figure 3, which comprises

Step S301, timing detect the network bandwidth utilization rate of web crawlers, calculate the network bandwidth of each web crawlers Utilization rate mean value E (w) and variance D (w) is simultaneously stored, wherein w is network bandwidth utilization rate；

Step S302 is looked into history crawl data when receiving crawl task according to the configuration information of crawl task It askes and whether there is the first history page, the configuration information includes the uniform resource position mark URL cluster of new crawl task, described to go through History crawl data include the history page for executing history crawl task and being downloaded, and first history page is and the new crawl The corresponding history page of the configuration information of task；

Step S303 detects whether first history page can be used if there is first history page；

Step S304 parses available first history page and obtains first object if first history page is available Page data；

Step S305, if first history page is unavailable, according to the corresponding system of not available first history page The uniform resource position mark URL of first history page is not present in one Resource Locator URL cluster and history crawl data Cluster forms the second crawl task；

Step S306, according to the network bandwidth utilization rate mean value E (w) and variance D (w) calculate each web crawlers can The property used；

Step S307 determines the web crawlers for executing the second crawl task according to the availability of each web crawlers；

Step S308, the web crawlers for executing the second crawl task grab the second target pages data, wherein first Target pages data and the second target pages data form the target pages data of the crawl task.

Wherein, the step S301 includes:

D1, timing acquiring preset times network bandwidth utilization rate；

D2 calculates network bandwidth utilization rate mean value E (w)=and D (w)=E [w-E (w)] 2 and stores, and wherein n is default time Number.

Wherein, the network bandwidth service condition of each web crawlers is periodically detected, the timing can be periodic intervals Regular hour can also click through for example, every 20min detection is primary according to the operating condition of crawler system fixed multiple times Row detection, for example, daily in each hour 10min, 25min, 30min, 40min and 60min detected. Every time when detection, preset times network bandwidth utilization rate is acquired, to calculate network bandwidth utilization rate mean value E (w) and variance D (w), The acquisition of network bandwidth utilization rate can be used conventional network bandwidth utilization rate acquisition method, it should be noted that setting When setting preset times, the time for acquiring preset times should be made to be less than or equal to detection interval time.If preset times are n, then

Wherein, step S303, in the first possible embodiment, comprising:

Wherein, the configuration information can also include history page effective time, correspondingly, the step S303, the In two kinds of possible embodiments, comprising:

Judge first history page whether within history page effective time；

Wherein, after determining available first history page by step S303, to available first history page, directly into Row parsing obtains first object page data, so as to avoid the repetition of the page from grabbing, saves system resource.

For not available first history page, number is grabbed by the URL cluster in its corresponding crawl task, and in history It searches according to middle less than the URL cluster in the crawl task of the first history page, forms the second crawl task according to these URL clusters, It is exactly the second crawl task using these URL clusters as target URL cluster, it is subsequent only to grab corresponding second mesh according to these URL clusters Mark page data.

Wherein, step S306, it is described that each net is calculated according to the network bandwidth utilization rate mean value E (w) and variance D (w) The availability of network crawler, comprising:

Correspondingly, step S307 determines the web crawlers of execution task according to the availability, comprising:

After the web crawlers for determining the second crawl task that executes in step S307, the second mesh is grabbed using these web crawlers Mark page data.First object page data and the second target pages data form the target pages data of the crawl task.

Web crawlers data grab method provided by the embodiments of the present application, it is complete that crawler system saves the history page grabbed Page is stored in history crawl data, when there is crawl task to enter, judges to whether there is and crawl task in history crawl data The corresponding history page of URL cluster, if there is corresponding history page and existing history page is available, then goes through to available The history page directly extracts first object page data, without repeating to grab to this partial history page, to save system money Source.The URL cluster of corresponding history page is not present in URL cluster corresponding for not available history page and history crawl data, Using these URL clusters as target URL cluster, the corresponding page is grabbed by web crawlers and obtains the second target pages data, thus Complete the crawl of all target pages data.Meanwhile it being different from and being climbed usually using all-network crawler or random selection network Worm is completed the case where crawl task, and first timing detects the network bandwidth utilization rate of all-network crawler, calculates each web crawlers Network bandwidth utilization rate mean value E (w) and variance D (w) and store, each web crawlers is then calculated according to E (w) and D (w) Availability, that is, the unsaturated web crawlers of network bandwidth, and to available web crawlers, it is descending by probability of availability Sequential selection execute task web crawlers, crawl task is executed to select better web crawlers, thus reasonable distribution Web crawlers resource improves task and grabs efficiency.

By the description of above embodiment of the method, it is apparent to those skilled in the art that the application can Realize by means of software and necessary general hardware platform, naturally it is also possible to by hardware, but in many cases the former It is more preferably embodiment.Based on this understanding, the technical solution of the application substantially makes tribute to the prior art in other words The part offered can be embodied in the form of software products, and be stored in a storage medium, including some instructions to So that a smart machine executes all or part of the steps of each embodiment the method for the application.And storage medium above-mentioned Including: that read-only memory (ROM), random access memory (RAM), magnetic or disk etc. are various can store data and program The medium of code.

Fig. 4 is a kind of block diagram of web crawlers crawl task allocation apparatus shown in one exemplary embodiment of the application.Such as Shown in Fig. 4, described device includes:

Network bandwidth utilization rate processing unit U401 is calculated for periodically detecting the network bandwidth utilization rate of web crawlers The network bandwidth utilization rate mean value E (w) and variance D (w) of each web crawlers are simultaneously stored, wherein w is network bandwidth utilization rate；

Crawler availability calculations unit U402, for when receiving crawl task, according to the network bandwidth utilization rate Mean value E (w) and variance D (w) calculates the availability of each web crawlers；

First crawler determination unit U403, for determining the web crawlers of execution task according to the availability.

Wherein, the network bandwidth utilization rate processing unit, comprising:

Wherein, the crawler availability calculations unit, comprising:

Wherein, the first crawler determination unit, comprising:

Fig. 5 is a kind of block diagram of web crawlers data grabber device shown in one exemplary embodiment of the application.Such as Fig. 5 institute Show, described device includes:

First history page query unit U501, for matching confidence according to crawl task when receiving crawl task Breath inquiry in history crawl data whether there is the first history page, and the configuration information includes the unified money of new crawl task Source finger URL URL cluster, the history crawl data include the history page for executing history crawl task and being downloaded, and described first goes through The history page is history page corresponding with the configuration information of the new crawl task；

History page availability detection unit U502, for if there is first history page, detection described first Whether history page can be used；

First resolution unit U503 parses available first history page if available for first history page Obtain first object page data；

First picking unit U504, if unavailable for first history page, according to described not available first There is no first history pages in the corresponding uniform resource position mark URL cluster of history page and history crawl data Uniform resource position mark URL cluster is downloaded the corresponding page and is parsed, and the second target pages data are obtained.

Wherein, the history page availability detection unit, in the first possible embodiment, comprising:

Wherein, the configuration information further includes history page effective time, the history page availability detection unit, In second of possible embodiment, comprising:

Fig. 6 is the block diagram of another web crawlers data grabber device shown in one exemplary embodiment of the application.Such as Fig. 6 Shown, described device includes:

Network bandwidth utilization rate processing unit U601 is calculated for periodically detecting the network bandwidth utilization rate of web crawlers The network bandwidth utilization rate mean value E (w) and variance D (w) of each web crawlers are simultaneously stored, wherein w is network bandwidth utilization rate；

First history page query unit U602, for matching confidence according to crawl task when receiving crawl task Breath inquiry in history crawl data whether there is the first history page, and the configuration information includes the unified money of new crawl task Source finger URL URL cluster, the history crawl data include the history page for executing history crawl task and being downloaded, and described first goes through The history page is history page corresponding with the configuration information of the new crawl task；

History page availability detection unit U603, for if there is first history page, detection described first Whether history page can be used；

First resolution unit U604 parses available first history page if available for first history page Obtain first object page data；

Second crawl task generation unit U605, if unavailable for first history page, according to not available The first history page is not present in the corresponding uniform resource position mark URL cluster of first history page and history crawl data The uniform resource position mark URL cluster in face forms the second crawl task；

Crawler availability calculations unit U606, based on according to the network bandwidth utilization rate mean value E (w) and variance D (w) Calculate the availability of each web crawlers；

Second crawler determination unit U607 executes second crawl for determining according to the availability of each web crawlers The web crawlers of task；

Second picking unit U608, the web crawlers for executing the second crawl task grab the second target pages number The target pages data of the crawl task are formed according to, wherein first object page data and the second target pages data.

Wherein, the network bandwidth utilization rate processing unit, comprising:

Wherein, the crawler availability calculations unit, comprising:

Wherein, the second crawler determination unit, comprising:

For convenience of description, it is divided into various units when description apparatus above with function to describe respectively.Certainly, implementing this The function of each unit can be realized in the same or multiple software and or hardware when application.

All the embodiments in this specification are described in a progressive manner, same and similar portion between each embodiment Dividing may refer to each other, and each embodiment focuses on the differences from other embodiments.Especially for device or For system embodiment, since it is substantially similar to the method embodiment, so describing fairly simple, related place is referring to method The part of embodiment illustrates.Apparatus and system embodiment described above is only schematical, wherein the conduct The unit of separate part description may or may not be physically separated, component shown as a unit can be or Person may not be physical unit, it can and it is in one place, or may be distributed over multiple network units.It can root According to actual need that some or all of the modules therein is selected to achieve the purpose of the solution of this embodiment.Ordinary skill Personnel can understand and implement without creative efforts.

It should be noted that, in this document, the relational terms of such as " first " and " second " or the like are used merely to one A entity or operation with another entity or operate distinguish, without necessarily requiring or implying these entities or operation it Between there are any this actual relationship or backwards.Moreover, the terms "include", "comprise" or its any other variant are intended to Cover non-exclusive inclusion, so that the process, method, article or equipment for including a series of elements not only includes those Element, but also including other elements that are not explicitly listed, or further include for this process, method, article or setting Standby intrinsic element.In the absence of more restrictions, the element limited by sentence "including a ...", it is not excluded that There is also other identical elements in the process, method, article or apparatus that includes the element.

The above is only the specific embodiment of the application, is made skilled artisans appreciate that or realizing this Shen Please.Various modifications to these embodiments will be apparent to one skilled in the art, as defined herein General Principle can be realized in other embodiments without departing from the spirit or scope of the application.Therefore, the application It is not intended to be limited to the embodiments shown herein, and is to fit to and the principles and novel features disclosed herein phase one The widest scope of cause.

Claims

1. a kind of web crawlers grabs method for allocating tasks characterized by comprising

Timing detects the network bandwidth utilization rate of web crawlers, calculates the network bandwidth utilization rate mean value E (w) of each web crawlers With variance D (w) and store, wherein w is network bandwidth utilization rate；

When receiving crawl task, each network is calculated according to the network bandwidth utilization rate mean value E (w) and variance D (w) and is climbed The availability of worm；

The web crawlers of execution task is determined according to the availability.

2. web crawlers as described in claim 1 grabs method for allocating tasks, which is characterized in that the timing detection network is climbed The network bandwidth utilization rate of worm calculates the network bandwidth utilization rate mean value E (w) and variance D (w) of each web crawlers and stores, Include:

Timing acquiring preset times network bandwidth utilization rate；

Calculate network bandwidth utilization rate mean valueWith D (w)=E [w-E (w)]²And store, wherein n is preset times.

3. web crawlers as described in claim 1 grabs method for allocating tasks, which is characterized in that described according to the Netowrk tape Wide utilization rate mean value E (w) and variance D (w) calculate the availability of each web crawlers, comprising:

Extract the network bandwidth utilization rate mean value E (w) and variance D of each web crawlers in predetermined time period before current time (w)；

According to the network bandwidth utilization rate mean value E (w) and variance D (w) of web crawlers each in predetermined time period, simplicity is used Bayesian Classification Arithmetic calculates the probability of availability and unavailability probability of each web crawlers.

4. web crawlers as claimed in claim 3 grabs method for allocating tasks, which is characterized in that described according to the availability Determine the web crawlers of execution task, comprising:

If the probability of availability of web crawlers is greater than unavailability probability, determine that the web crawlers is available, otherwise, the net Network crawler is unavailable；

5. a kind of web crawlers data grab method characterized by comprising

When receiving crawl task, according to the configuration information of crawl task, inquiry whether there is first in history crawl data History page, the configuration information include the uniform resource position mark URL cluster of new crawl task, and the history crawl data include The history page that history crawl task is downloaded is executed, first history page is the configuration information with the new crawl task Corresponding history page；

If first history page is unavailable, according to the corresponding uniform resource locator of not available first history page The uniform resource position mark URL cluster of the first history page is not present in URL cluster and history crawl data, downloads corresponding page Face is parsed, and the second target pages data are obtained.

6. web crawlers data grab method as claimed in claim 5, which is characterized in that detection the first history page Whether face can be used, comprising:

First history page described in pre-parsed judges the page-tag in crawl task where target pages data described first It whether there is in history page；

If the page-tag exists in first history page, page mark described in first history page is judged Whether the attribute of label is identical as the attribute of the page-tag where target pages data in crawl task；

If the page in the attribute of page-tag described in first history page and crawl task where target pages data The attribute of face label is identical, it is determined that first history page is available；

If the page-tag is not present in first history page, or if described in first history page The attribute of page-tag is not identical as the attribute of the page-tag where target pages data in crawl task, it is determined that described One history page is unavailable.

7. web crawlers data grab method as claimed in claim 5, which is characterized in that the configuration information further includes history Whether page effective time, detection first history page can be used, comprising:

Judge first history page whether within history page effective time；

If first history page, within history page effective time, the first history page described in pre-parsed judges to grab Page-tag in task where target pages data whether there is in first history page；

If the first history page is not within history page effective time or the page-tag is in first history page In be not present, or if page-tag described in first history page attribute and crawl task in target pages data The attribute of the page-tag at place is not identical, it is determined that first history page is unavailable.

8. a kind of web crawlers data grab method characterized by comprising

If first history page is unavailable, according to the corresponding uniform resource locator of not available first history page The uniform resource position mark URL cluster of the first history page is not present in URL cluster and history crawl data, forms the second crawl Task；

Wherein, first object page data and the second target pages data form the target pages data of the crawl task.

9. web crawlers data grab method as claimed in claim 8, which is characterized in that the timing detection web crawlers Network bandwidth utilization rate calculates the network bandwidth utilization rate mean value E (w) and variance D (w) of each web crawlers and stores, comprising:

Timing acquiring preset times network bandwidth utilization rate；

10. web crawlers data grab method as claimed in claim 8, which is characterized in that detection first history Whether the page can be used, comprising:

11. web crawlers data grab method as claimed in claim 8, which is characterized in that the configuration information further includes going through Whether history page effective time, detection first history page can be used, comprising:

Judge first history page whether within history page effective time；

12. web crawlers data grab method as claimed in claim 8, which is characterized in that described according to the network bandwidth Utilization rate mean value E (w) and variance D (w) calculates the availability of each web crawlers, comprising:

13. web crawlers data grab method as claimed in claim 12, which is characterized in that described true according to the availability Surely the web crawlers of task is executed, comprising:

If the probability of availability of web crawlers is greater than unavailability probability, determine that the web crawlers is available, otherwise, it determines institute It is unavailable to state web crawlers；

14. a kind of web crawlers grabs task allocation apparatus characterized by comprising

Network bandwidth utilization rate processing unit calculates each network for periodically detecting the network bandwidth utilization rate of web crawlers The network bandwidth utilization rate mean value E (w) and variance D (w) of crawler are simultaneously stored, wherein w is network bandwidth utilization rate；

Crawler availability calculations unit, for when receiving crawl task, according to the network bandwidth utilization rate mean value E (w) The availability of each web crawlers is calculated with variance D (w)；

15. web crawlers as claimed in claim 14 grabs task allocation apparatus, which is characterized in that the network bandwidth uses Rate processing unit, comprising:

Mean variance computing module, for calculating network bandwidth utilization rate mean valueWith D (w)=E [w-E (w)]²And Storage, wherein n is preset times.

16. web crawlers as claimed in claim 14 grabs task allocation apparatus, which is characterized in that the crawler availability meter Calculate unit, comprising:

Mean variance extraction module, the network bandwidth of each web crawlers uses in predetermined time period before extracting current time Rate mean value E (w) and variance D (w)；

Probability evaluation entity, for according to the network bandwidth utilization rate mean value E (w) of web crawlers each in predetermined time period and Variance D (w) calculates the probability of availability and unavailability probability of each web crawlers using Naive Bayes Classification Algorithm.

17. web crawlers as claimed in claim 16 grabs task allocation apparatus, which is characterized in that first crawler determines Unit, comprising:

Availability determining module determines the web crawlers if the probability of availability of web crawlers is greater than unavailability probability It can use, otherwise, it determines the web crawlers is unavailable；

Selecting module, for executing the net of task by the descending sequential selection of probability of availability to available web crawlers Network crawler.

18. a kind of web crawlers data grabber device characterized by comprising

First history page query unit, for when receiving crawl task, according to the configuration information of crawl task in history Grabbing inquiry in data whether there is the first history page, and the configuration information includes the uniform resource locator of new crawl task URL cluster, the history crawl data include the history page for executing history crawl task and being downloaded, and first history page is History page corresponding with the configuration information of the new crawl task；

History page availability detection unit, for detecting first history page if there is first history page Whether can be used；

First resolution unit parses available first history page and obtains first if available for first history page Target pages data；

First picking unit, if unavailable for first history page, according to not available first history page The unified resource of first history page is not present in corresponding uniform resource position mark URL cluster and history crawl data Finger URL URL cluster is downloaded the corresponding page and is parsed, and the second target pages data are obtained.

19. web crawlers data grabber device as claimed in claim 18, which is characterized in that the history page availability inspection Survey unit, comprising:

First judgment module judges in crawl task where target pages data for the first history page described in pre-parsed Page-tag whether there is in first history page；

Second judgment module judges that described first goes through if existed in first history page for the page-tag The attribute of the attribute of page-tag described in the history page and the page-tag where target pages data in crawl task whether phase Together；

Determining module can be used, if for target in the attribute and crawl task of page-tag described in first history page The attribute of page-tag where page data is identical, it is determined that first history page is available；

Unavailable determining module, if be not present in first history page for the page-tag, or if institute State the category of the page-tag in the attribute and crawl task of page-tag described in the first history page where target pages data Property is not identical, then first history page is unavailable.

20. web crawlers data grabber device as claimed in claim 18, which is characterized in that the configuration information further includes going through History page effective time, the history page availability detection unit, comprising:

4th judgment module, if for first history page within history page effective time, first described in pre-parsed History page, judges whether the page-tag in crawl task where target pages data deposits in first history page ?；

5th judgment module judges that described first goes through if existed in first history page for the page-tag The attribute of the attribute of page-tag described in the history page and the page-tag where target pages data in crawl task whether phase Together；

Unavailable determining module, if for the first history page not within history page effective time or the page mark Label be not present in first history page, or if the attribute of page-tag described in first history page with grab Take the attribute of the page-tag in task where target pages data not identical, it is determined that first history page is unavailable.

21. a kind of web crawlers data grabber device characterized by comprising

Second crawl task generation unit, if unavailable for first history page, according to not available first history The unification of first history page is not present in the corresponding uniform resource position mark URL cluster of the page and history crawl data Resource Locator URL cluster forms the second crawl task；

Crawler availability calculations unit, for calculating each net according to the network bandwidth utilization rate mean value E (w) and variance D (w) The availability of network crawler；

Second crawler determination unit, for determining execution the second crawl task according to according to the availability of each web crawlers Web crawlers；

22. web crawlers data grabber device as claimed in claim 21, which is characterized in that at the network bandwidth utilization rate Manage unit, comprising:

23. web crawlers data grabber device as claimed in claim 21, which is characterized in that the history page availability inspection Survey unit, comprising:

24. web crawlers data grabber device as claimed in claim 21, which is characterized in that the configuration information further includes going through History page effective time, the history page availability detection unit, comprising:

25. web crawlers data grabber device as claimed in claim 21, which is characterized in that the crawler availability calculations list Member, comprising:

26. web crawlers data grabber device as claimed in claim 25, which is characterized in that second crawler determines single Member, comprising: