CN106202108A - Web crawlers captures method for allocating tasks and device and data grab method and device - Google Patents
Web crawlers captures method for allocating tasks and device and data grab method and device Download PDFInfo
- Publication number
- CN106202108A CN106202108A CN201510227531.9A CN201510227531A CN106202108A CN 106202108 A CN106202108 A CN 106202108A CN 201510227531 A CN201510227531 A CN 201510227531A CN 106202108 A CN106202108 A CN 106202108A
- Authority
- CN
- China
- Prior art keywords
- page
- history
- history page
- web crawlers
- tag
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Granted
Links
Landscapes
- Information Transfer Between Computers (AREA)
Abstract
The embodiment of the present application discloses a kind of web crawlers and captures method for allocating tasks and device and data grab method and device.Described web crawlers data grab method, preserve the history page captured and be stored in history crawl data, when there being crawl task, judge history to capture in data and whether exist and URL bunch of corresponding history page of the task of crawl, if there is and exist history page can use, then to available history page extracting directly first object page data, without repeating this partial history page to capture, save system resource.Simultaneously, first timing detects the network bandwidth utilization rate of all-network reptile, calculate network bandwidth utilization rate average E (w) of each web crawlers and variance D (w) and store, then the availability of each web crawlers is calculated according to E (w) and D (w), to available web crawlers, the web crawlers of execution task is selected by the descending order of probability of availability, thus reasonable distribution web crawlers resource.
Description
Technical field
The present invention relates to networking technology area, particularly relate to a kind of web crawlers and capture method for allocating tasks and device and data
Grasping means and device.
Background technology
Web crawlers (Computer Robot is also called webpage Aranea or network robot), is a kind of according to certain
Rule captures the program of internet web page data automatically, is the important composition of search engine.Generally web crawlers is according to configuration
Crawl task from the Internet download webpage, webpage is resolved and filters, obtain target web data.All by net
The target web data of network crawler capturing are stored in crawler system, and set up index, in order to inquiry afterwards and retrieval.
The development of network and information technology makes the very fast growth of quantity of website, webpage and web data, a crawler system
Needing numerous web crawlers to capture substantial amounts of web data, these web crawlers are likely distributed among same LAN,
It is likely to be dispersed in different geographical position.Degree of scatter according to web crawlers is different, and crawler system is broadly divided into two kinds
Framework: 1, distributed structure/architecture based on LAN, in this framework, all-network reptile runs in same LAN,
Accessing external the Internet by same LAN, download webpage, all of offered load all concentrates on the outlet of LAN
On, owing to the total bandwidth upper limit of network egress is fixing, the quantity of web crawlers can be limited by LAN outlet bandwidth
System;2, distributed structure/architecture based on wide area network: parallel web crawlers is separately operable in diverse geographic location or network site,
Such as, web crawlers is likely located at different machine room, or is positioned at China, Japan and the U.S., is each responsible for downloading this three ground
Webpage, or be positioned at CHINANET, CERNET and CEINET, be each responsible for downloading in these three network
Webpage, under this framework, each web crawlers accesses external the Internet by respective network, only by each self-corresponding outlet
The impact of bandwidth, thus scatternet flow, alleviate offered load.
Distributed structure/architecture based on wide area network is to apply more crawler system framework at present, but under this framework, due to net
Network reptile quantity is big and content crawl demand is many, and web crawlers disposes dispersion, and the outlet bandwidth that each web crawlers is corresponding can
Can be able to vary, the unreasonable meeting of Internet usage situation makes crawl efficiency reduce.On the other hand, task quantity is captured
Increasing the easily generation page and repeat crawl problem, such as A task needs to capture the top data of the C page, works as crawler system
After completing task, the next B task arrived needs to capture the bottom data of the C page, due to web crawlers in A task
Only return and store the top data of the C page, therefore the C page the most once can only be downloaded and obtain the C page
Bottom data, so can cause the waste of reptile resource, cause web crawlers cluster pressure to increase.
Summary of the invention
Use unreasonable for overcoming web crawlers in correlation technique to capture Task Network resource, and the page repeats asking of crawl
Topic, the application provides a kind of web crawlers to capture method for allocating tasks and device and data grab method and device.
First aspect according to the embodiment of the present application, it is provided that a kind of web crawlers captures method for allocating tasks, including:
Timing detects the network bandwidth utilization rate of web crawlers, calculates the network bandwidth utilization rate average of each web crawlers
E (w) and variance D (w) also store, and wherein, w is network bandwidth utilization rate;
When receiving crawl task, calculate each net according to described network bandwidth utilization rate average E (w) and variance D (w)
The availability of network reptile;
The web crawlers of execution task is determined according to described availability.
Optionally, the network bandwidth utilization rate of described timing detection web crawlers, calculate the network bandwidth of each web crawlers
Utilization rate average E (w) and variance D (w) also store, including:
Timing acquiring preset times network bandwidth utilization rate;
Calculate network bandwidth utilization rate averageWith D (w)=E [w-E (w)]2And store, wherein n is secondary for presetting
Number.
Optionally, described each web crawlers is calculated according to described network bandwidth utilization rate average E (w) and variance D (w)
Availability, including:
Extract network bandwidth utilization rate average E (w) and the side of each web crawlers in predetermined time period before current time
Difference D (w);
According to network bandwidth utilization rate average E (w) of each web crawlers in predetermined time period and variance D (w), use
Naive Bayes Classification Algorithm calculates probability of availability and the unavailability probability of each web crawlers.
Optionally, the described web crawlers determining execution task according to described availability, including:
The probability of availability of comparing cell reptile and unavailability probability;
If the probability of availability of web crawlers is more than unavailability probability, determine that described web crawlers can be used, otherwise, institute
State web crawlers unavailable;
To available web crawlers, select the web crawlers of execution task by the order that probability of availability is descending.
Second aspect according to the embodiment of the present application, it is provided that a kind of web crawlers data grab method, including:
When receiving crawl task, capture in data to inquire about whether have the according to the configuration information of the task of crawl in history
One history page, described configuration information includes the uniform resource position mark URL bunch newly capturing task, and described history captures number
According to including that performing history captures the history page that task is downloaded, described first history page is and described new crawl task
The history page that configuration information is corresponding;
If there is described first history page, detect whether described first history page can be used;
If described first history page can be used, resolve the first available history page and obtain first object page data;
If described first history page is unavailable, according to the URL that disabled first history page is corresponding
URL bunch, and history crawl data do not exist the uniform resource position mark URL bunch of described first history page, under
Carry the corresponding page to resolve, obtain the second target pages data.
Optionally, whether described first history page of described detection can be used, including:
First history page described in pre-parsed, it is judged that in crawl task, the page-tag at target pages data place is described
Whether one history page exists;
If described page-tag exists in described first history page, it is judged that the page described in described first history page
The attribute of label is the most identical with the attribute of the page-tag at target pages data place in crawl task;
If the attribute of page-tag described in described first history page and target pages data place in crawl task
The attribute of page-tag is identical, it is determined that described first history page can be used;
If described page-tag does not exists in described first history page, if or institute in described first history page
The attribute stating page-tag differs with the attribute of the page-tag at target pages data place in crawl task, it is determined that institute
State the first history page unavailable.
Optionally, described configuration information also includes history page effective time, and whether described first history page of described detection
Available, including:
Judge that described first history page is whether in history page effective time;
If described first history page is in history page effective time, the first history page described in pre-parsed, it is judged that grab
Take whether the page-tag at target pages data place in task exists in described first history page;
If described page-tag exists in described first history page, it is judged that the page described in described first history page
The attribute of label is the most identical with the attribute of the page-tag at target pages data place in crawl task;
If the attribute of page-tag described in described first history page and target pages data place in crawl task
The attribute of page-tag is identical, it is determined that described first history page can be used;
If the first history page is not in history page effective time, or described page-tag is at described first history page
Face does not exists, if or target pages in the attribute of page-tag described in described first history page and crawl task
The attribute of the page-tag at data place differs, it is determined that described first history page is unavailable.
The third aspect according to the embodiment of the present application, it is provided that another kind of web crawlers data grab method, including:
Timing detects the network bandwidth utilization rate of web crawlers, calculates the network bandwidth utilization rate average of each web crawlers
E (w) and variance D (w) also store, and wherein, w is network bandwidth utilization rate;
When receiving crawl task, capture in data to inquire about whether have the according to the configuration information of the task of crawl in history
One history page, described configuration information includes the uniform resource position mark URL bunch newly capturing task, and described history captures number
According to including that performing history captures the history page that task is downloaded, described first history page is and described new crawl task
The history page that configuration information is corresponding;
If there is described first history page, detect whether described first history page can be used;
If described first history page can be used, resolve the first available history page and obtain first object page data;
If described first history page is unavailable, fixed according to the unified resource that described disabled first history page is corresponding
Position symbol URL bunch, and history crawl data do not exist the uniform resource position mark URL bunch of described first history page,
Form the second crawl task;
The availability of each web crawlers is calculated according to described network bandwidth utilization rate average E (w) and variance D (w);
Availability according to each web crawlers determines the web crawlers performing described second crawl task;
Perform the described second web crawlers capturing task and capture the second target pages data,
Wherein, first object page data and the second target pages data form the target pages data of described crawl task.
Optionally, the network bandwidth utilization rate of described timing detection web crawlers, calculate the network bandwidth of each web crawlers
Utilization rate average E (w) and variance D (w) also store, including:
Timing acquiring preset times network bandwidth utilization rate;
Calculate network bandwidth utilization rate averageWith D (w)=E [w-E (w)]2And store, wherein n is secondary for presetting
Number.
Optionally, whether described first history page of described detection can be used, including:
First history page described in pre-parsed, it is judged that in crawl task, the page-tag at target pages data place is described
Whether one history page exists;
If described page-tag exists in described first history page, it is judged that the page described in described first history page
The attribute of label is the most identical with the attribute of the page-tag at target pages data place in crawl task;
If the attribute of page-tag described in described first history page and target pages data place in crawl task
The attribute of page-tag is identical, it is determined that described first history page can be used;
If described page-tag does not exists in described first history page, if or institute in described first history page
The attribute stating page-tag differs with the attribute of the page-tag at target pages data place in crawl task, it is determined that institute
State the first history page unavailable.
Optionally, described configuration information also includes history page effective time, and whether described first history page of described detection
Available, including:
Judge that described first history page is whether in history page effective time;
If described first history page is in history page effective time, the first history page described in pre-parsed, it is judged that grab
Take whether the page-tag at target pages data place in task exists in described first history page;
If described page-tag exists in described first history page, it is judged that the page described in described first history page
The attribute of label is the most identical with the attribute of the page-tag at target pages data place in crawl task;
If the attribute of page-tag described in described first history page and target pages data place in crawl task
The attribute of page-tag is identical, it is determined that described first history page can be used;
If the first history page is not in history page effective time, or described page-tag is at described first history page
Face does not exists, if or target pages in the attribute of page-tag described in described first history page and crawl task
The attribute of the page-tag at data place differs, it is determined that described first history page is unavailable.
Optionally, described each web crawlers is calculated according to described network bandwidth utilization rate average E (w) and variance D (w)
Availability, including:
Extract network bandwidth utilization rate average E (w) and the side of each web crawlers in predetermined time period before current time
Difference D (w);
According to network bandwidth utilization rate average E (w) of each web crawlers in predetermined time period and variance D (w), use
Naive Bayes Classification Algorithm calculates probability of availability and the unavailability probability of each web crawlers.
Optionally, the described web crawlers determining execution task according to described availability, including:
The probability of availability of comparing cell reptile and unavailability probability;
If the probability of availability of web crawlers is more than unavailability probability, determine that described web crawlers can be used, otherwise, really
Fixed described web crawlers is unavailable;
To available web crawlers, select the web crawlers of execution task by the order that probability of availability is descending.
Corresponding to the first aspect of the embodiment of the present application, according to the fourth aspect of the embodiment of the present application, it is provided that a kind of network is climbed
Worm captures task allocation apparatus, including:
Network bandwidth utilization rate processing unit, for regularly detecting the network bandwidth utilization rate of web crawlers, calculates each net
Network bandwidth utilization rate average E (w) of network reptile and variance D (w) also store, and wherein, w is network bandwidth utilization rate;
Reptile availability calculations unit, for when receiving crawl task, according to described network bandwidth utilization rate average
E (w) and variance D (w) calculate the availability of each web crawlers;
First reptile determines unit, for determining the web crawlers of execution task according to described availability.
Optionally, described network bandwidth utilization rate processing unit, including:
Network bandwidth utilization rate acquisition module, for timing acquiring preset times network bandwidth utilization rate;
Mean variance computing module, is used for calculating network bandwidth utilization rate averageWith D (w)=E [w-E (w)]2
And store, wherein n is preset times.
Optionally, described reptile availability calculations unit, including:
Mean variance extraction module, before extracting current time, in predetermined time period, the network bandwidth of each web crawlers makes
By rate average E (w) and variance D (w);
Probability evaluation entity, for according to network bandwidth utilization rate average E (w) of each web crawlers in predetermined time period
With variance D (w), use Naive Bayes Classification Algorithm to calculate the probability of availability of each web crawlers and unavailability is general
Rate.
Optionally, described first reptile determines unit, including:
Comparison module, for probability of availability and the unavailability probability of comparing cell reptile;
Availability determines module, if the probability of availability of web crawlers is more than unavailability probability, determines that described network is climbed
Worm can be used, otherwise, it determines described web crawlers is unavailable;
Selecting module, for available web crawlers, the order descending by probability of availability selects execution task
Web crawlers.
Corresponding to the second aspect of the embodiment of the present application, according to the 5th aspect of the embodiment of the present application, it is provided that a kind of network is climbed
Borer population according to grabbing device, including:
First history page query unit, for when receiving crawl task, is going through according to the configuration information of the task of crawl
History captures in data to inquire about whether there is the first history page, and described configuration information includes that the unified resource newly capturing task is fixed
Position symbol URL bunch, described history captures data and includes that performing history captures the history page that task is downloaded, and described first goes through
The history page is the history page corresponding with the configuration information of described new crawl task;
History page availability detector unit, for if there is described first history page, detects described first history page
Whether face can be used;
First resolution unit, if can using for described first history page, resolving the first available history page and obtaining the
One target pages data;
First placement unit, if unavailable, according to described disabled first history page for described first history page
The uniform resource position mark URL bunch that face is corresponding, and history crawl data do not exist the unification of described first history page
URLs URL bunch, downloads the corresponding page and resolves, obtain the second target pages data.
Optionally, described history page availability detector unit, including:
First judge module, for the first history page described in pre-parsed, it is judged that target pages data place in crawl task
Page-tag whether exist in described first history page;
Second judge module, if existed in described first history page for described page-tag, it is judged that described first
In the attribute of page-tag described in history page and crawl task, whether the attribute of the page-tag at target pages data place
Identical;
Can use and determine module, if for the attribute of page-tag described in described first history page and mesh in crawl task
The attribute of the page-tag at mark page data place is identical, it is determined that described first history page can be used;
Unavailable determine module, if do not existed in described first history page for described page-tag, if or
The attribute of page-tag described in described first history page and the page-tag at target pages data place in crawl task
Attribute differs, and the most described first history page is unavailable.
Optionally, described configuration information also includes history page effective time, described history page availability detector unit,
Including:
3rd judge module, is used for judging that described first history page is whether in history page effective time;
4th judge module, if for described first history page in history page effective time, described in pre-parsed the
One history page, it is judged that in crawl task the page-tag at target pages data place in described first history page whether
Exist;
5th judge module, if existed in described first history page for described page-tag, it is judged that described first
In the attribute of page-tag described in history page and crawl task, whether the attribute of the page-tag at target pages data place
Identical;
Can use and determine module, if for the attribute of page-tag described in described first history page and mesh in crawl task
The attribute of the page-tag at mark page data place is identical, it is determined that described first history page can be used;
Unavailable determine module, if for the first history page not in history page effective time, or the described page
Label does not exists in described first history page, if or the attribute of page-tag described in described first history page
Differ with the attribute of the page-tag at target pages data place in crawl task, it is determined that described first history page is not
Available.
Corresponding to the third aspect of the embodiment of the present application, according to the 6th aspect of the embodiment of the present application, it is provided that another kind of network
Reptile data grabber device, including:
Network bandwidth utilization rate processing unit, for regularly detecting the network bandwidth utilization rate of web crawlers, calculates each net
Network bandwidth utilization rate average E (w) of network reptile and variance D (w) also store, and wherein, w is network bandwidth utilization rate;
First history page query unit, for when receiving crawl task, is going through according to the configuration information of the task of crawl
History captures in data to inquire about whether there is the first history page, and described configuration information includes that the unified resource newly capturing task is fixed
Position symbol URL bunch, described history captures data and includes that performing history captures the history page that task is downloaded, and described first goes through
The history page is the history page corresponding with the configuration information of described new crawl task;
History page availability detector unit, for if there is described first history page, detects described first history page
Whether face can be used;
First resolution unit, if can using for described first history page, resolving the first available history page and obtaining the
One target pages data;
Second captures task signal generating unit, if unavailable for described first history page, goes through according to disabled first
There is not described first history page in the uniform resource position mark URL bunch that the history page is corresponding, and history crawl data
Uniform resource position mark URL bunch, forms the second crawl task;
Reptile availability calculations unit, for calculating each according to described network bandwidth utilization rate average E (w) and variance D (w)
The availability of individual web crawlers;
Second reptile determines unit, for determining that described second crawl of execution is appointed according to according to the availability of each web crawlers
The web crawlers of business;
Second placement unit, captures the second target pages data for performing the described second web crawlers capturing task,
Wherein, first object page data and the second target pages data form the target pages data of described crawl task.
Optionally, described network bandwidth utilization rate processing unit, including:
Network bandwidth utilization rate acquisition module, for timing acquiring preset times network bandwidth utilization rate;
Mean variance computing module, is used for calculating network bandwidth utilization rate averageWith D (w)=E [w-E (w)] 2
And store, wherein n is preset times.
Optionally, described history page availability detector unit, including:
First judge module, for the first history page described in pre-parsed, it is judged that target pages data place in crawl task
Page-tag whether exist in described first history page;
Second judge module, if existed in described first history page for described page-tag, it is judged that described first
In the attribute of page-tag described in history page and crawl task, whether the attribute of the page-tag at target pages data place
Identical;
Can use and determine module, if for the attribute of page-tag described in described first history page and mesh in crawl task
The attribute of the page-tag at mark page data place is identical, it is determined that described first history page can be used;
Unavailable determine module, if do not existed in described first history page for described page-tag, if or
The attribute of page-tag described in described first history page and the page-tag at target pages data place in crawl task
Attribute differs, and the most described first history page is unavailable.
Optionally, described configuration information also includes history page effective time, described history page availability detector unit,
Including:
3rd judge module, is used for judging that described first history page is whether in history page effective time;
4th judge module, if for described first history page in history page effective time, described in pre-parsed the
One history page, it is judged that in crawl task the page-tag at target pages data place in described first history page whether
Exist;
5th judge module, if existed in described first history page for described page-tag, it is judged that described first
In the attribute of page-tag described in history page and crawl task, whether the attribute of the page-tag at target pages data place
Identical;
Can use and determine module, if for the attribute of page-tag described in described first history page and mesh in crawl task
The attribute of the page-tag at mark page data place is identical, it is determined that described first history page can be used;
Unavailable determine module, if for the first history page not in history page effective time, or the described page
Label does not exists in described first history page, if or the attribute of page-tag described in described first history page
Differ with the attribute of the page-tag at target pages data place in crawl task, it is determined that described first history page is not
Available.
Optionally, described reptile availability calculations unit, including:
Mean variance extraction module, before extracting current time, in predetermined time period, the network bandwidth of each web crawlers makes
By rate average E (w) and variance D (w);
Probability evaluation entity, for according to network bandwidth utilization rate average E (w) of each web crawlers in predetermined time period
With variance D (w), use Naive Bayes Classification Algorithm to calculate the probability of availability of each web crawlers and unavailability is general
Rate.
Optionally, described second reptile determines unit, including:
Comparison module, for probability of availability and the unavailability probability of comparing cell reptile;
Availability determines module, if the probability of availability of web crawlers is more than unavailability probability, determines that described network is climbed
Worm can be used, otherwise, it determines described web crawlers is unavailable;
Selecting module, for available web crawlers, the order descending by probability of availability selects execution task
Web crawlers.
The technical scheme that the embodiment of the present application provides can include following beneficial effect: the network that the embodiment of the present application provides is climbed
Worm data grab method, crawler system preserves the history page full page captured and is contained in history crawl data, when there being crawl
When task enters, it is judged that whether history crawl data exist and URL bunch of corresponding history page of the task of crawl, if
The history page that there is corresponding history page and existence can be used, then to available history page extracting directly first object page
Face data, without repeating this partial history page to capture, thus save system resource.For disabled history page
Corresponding URL bunch in face, and history captures and there is not URL bunch of corresponding history page in data, forms the second crawl and appoints
Business, is captured the corresponding page by web crawlers and obtains the second target pages data, thus completing all target pages data
Crawl.Meanwhile, it is different from common use all-network reptile or randomly chooses web crawlers and complete crawl task
Situation, first timing detects the network bandwidth utilization rate of all-network reptile, and the network bandwidth calculating each web crawlers uses
Rate average E (w) and variance D (w) also store, and then calculate the availability of each web crawlers according to E (w) and D (w), and
To available web crawlers, select the web crawlers of execution task by the order that probability of availability is descending, to select more
Good web crawlers performs crawl task, thus reasonable distribution web crawlers resource, improve task and capture efficiency.
It should be appreciated that it is only exemplary and explanatory that above general description and details hereinafter describe, can not
Limit the application.
Accompanying drawing explanation
In order to be illustrated more clearly that the embodiment of the present application or technical scheme of the prior art, below will be to embodiment or existing
In technology description, the required accompanying drawing used is briefly described, it should be apparent that, for those of ordinary skill in the art
Speech, on the premise of not paying creative work, it is also possible to obtain other accompanying drawing according to these accompanying drawings.
Fig. 1 is the flow process signal that a kind of web crawlers shown in the application one exemplary embodiment captures method for allocating tasks
Figure.
Fig. 2 is the schematic flow sheet of a kind of web crawlers data grab method shown in the application one exemplary embodiment.
Fig. 3 is the schematic flow sheet of the another kind of web crawlers data grab method shown in the application one exemplary embodiment.
Fig. 4 is the block diagram that a kind of web crawlers shown in the application one exemplary embodiment captures task allocation apparatus.
Fig. 5 is the block diagram of a kind of web crawlers data grabber device shown in the application one exemplary embodiment.
Fig. 6 is the block diagram of the another kind of web crawlers data grabber device shown in the application one exemplary embodiment.
Detailed description of the invention
Here will illustrate exemplary embodiment in detail, its example represents in the accompanying drawings.Explained below relates to attached
During figure, unless otherwise indicated, the same numbers in different accompanying drawings represents same or analogous key element.Following exemplary is implemented
Embodiment described in example does not represent all embodiments consistent with the application.On the contrary, they be only with such as
The example of the apparatus and method that some aspects that described in detail in appended claims, the application are consistent.
In order to understand the application comprehensively, refer to numerous concrete details in the following detailed description, but art technology
Personnel are it should be understood that the application can realize without these details.In other embodiments, it is not described in detail public affairs
Method, process, assembly and the circuit known, obscures in order to avoid undesirably resulting in embodiment.
Fig. 1 is the flow process signal that a kind of web crawlers shown in the application one exemplary embodiment captures method for allocating tasks
Figure, as it is shown in figure 1, described method includes:
S101, timing detects the network bandwidth utilization rate of web crawlers, calculates the network bandwidth utilization rate of each web crawlers
Average E (w) and variance D (w) also store, and wherein, w is network bandwidth utilization rate.
Wherein, timing detects the network bandwidth service condition of each web crawlers, and described timing can be periodic intervals one
The fixed time, such as, every 20min detects once, it is also possible to fix multiple time point according to the ruuning situation of crawler system
Detect, such as the 10min of each hour, 25min, 30min, 40min and 60min in every day
Detect.When detecting, gather preset times network bandwidth utilization rate, to calculate network bandwidth utilization rate average E (w) every time
With variance D (w), the collection to network bandwidth utilization rate can use conventional network bandwidth utilization rate acquisition method, need
Illustrate, when arranging preset times, make the time of collection preset times less than or equal to the assay intervals time.If it is pre-
Network bandwidth utilization rate average E (w) of calculated each web crawlers and variance D (w) are stored in reptile system
In system, each network bandwidth utilization rate average E (w) is corresponding with the detection time with variance D (w).
S102, when receiving crawl task, calculates according to described network bandwidth utilization rate average E (w) and variance D (w)
The availability of each web crawlers.
Wherein, calculate the availability of each web crawlers according to described network bandwidth utilization rate average E (w) and variance D (w),
Including:
Network bandwidth utilization rate average E (w) of each web crawlers in predetermined time period before a1, extraction current time
With variance D (w);
A2, according to network bandwidth utilization rate average E (w) of each web crawlers in predetermined time period and variance D (w),
Naive Bayes Classification Algorithm is used to calculate probability of availability and the unavailability probability of each web crawlers.
After crawler system receives crawl task, first from the network bandwidth utilization rate average and variance data of storage, with
Current time (i.e. receiving the time of crawl task) is starting point, extracts before current time in predetermined time period each
Network bandwidth utilization rate average E (w) of web crawlers and variance D (w), such as, before extraction current time within 2h
The network bandwidth utilization rate average of each web crawlers and variance data.Then Naive Bayes Classification Algorithm is used to calculate each
The probability of availability of individual web crawlers and unavailability probability, its process is as follows:
If the availability of web crawlers represents with A (0,1), wherein, 0 represents that the network bandwidth utilization rate of this web crawlers permits
Being permitted to receive new task, 1 expression does not allows;If the network attribute of web crawlers is X={b, c}, wherein b represents network
Bandwidth utilization rate average E (w), c represents network bandwidth utilization rate variance D (w), then 0≤b≤1,0≤c≤1;Net
The probability of availability of network reptile is P (a=0 | x), and unavailability probability is P (a=1 | x).Determine the naive Bayesian of b and c
Sorting algorithm segmentation marginal value, this marginal value, previously according to the integrated network service condition of crawler system, passes through many experiments
After determine, and being stored in crawler system, the segmentation marginal value of such as b is 0.3 and 0.8, and the segmentation marginal value of c is 0.2
With 0.6, then Naive Bayes Classification Algorithm is used to be calculated as follows conditional probability:
P (0 <b≤0.3 | a=0), P (0.3 <b≤0.8 | a=0), P (0.8 <b≤1 | a=0),
P (0 < c≤0.2 | a=0), P (0.2 < c≤0.6 | a=0), P (0.6 < c≤1 | a=0),
P (0 <b≤0.3 | a=1), P (0.3 <b≤0.8 | a=1), P (0.8 <b≤1 | a=1),
P (0 < c≤0.2 | a=1), P (0.2 < c≤0.6 | a=1), P (0.6 < c≤1 | a=1), it is assumed that currently certain web crawlers
Network attribute is X={b=0.35, c=0.15}, then have:
P (a=0 | x)=P (a=0),
P (0.3 <b≤0.8 | a=0) P (0 < c≤0.2 | a=0) P (a=1 | x)=P (a=1) P (0.3 <b≤0.8 | a=1) P (0 < c≤0.2 | a=
1)
Thus can be calculated P (a=0 | x) and P (a=1 | x).If P (a=0 | x) >=P (a=1 | x), then current network reptile
The network bandwidth be not reaching to saturated, new crawl task can be received, namely can use, on the contrary the most unavailable.Use
Network bandwidth utilization rate average E (w) and the probability of availability of variance D (w) calculating web crawlers and unavailability probability also may be used
With the method using other, (end-to-end connection, is commonly used to by time interval ping preset such as to use detection equipment
Carry out availability inspection) run the device of each web crawlers, in record certain period of time, each web crawlers is led to by ping
Number of times and the obstructed number of times of ping, and calculate can leading to probability and probability can not be led to by ping by ping of correspondence, can ping
Logical probability, as probability of availability, can not lead to probability as unavailability probability, if probability of availability is not more than by ping
Probability of availability, or probability of availability is more than the threshold values preset, then it is assumed that and web crawlers is available.
S103, determines the web crawlers of execution task according to described availability.
Wherein, step S103, including:
The probability of availability of comparing cell reptile and unavailability probability;
If the probability of availability of web crawlers is more than unavailability probability, determine that described web crawlers can be used, otherwise, institute
State web crawlers unavailable;
To available web crawlers, select the web crawlers of execution task by the order that probability of availability is descending.
Wherein, all-network reptile is calculated above-mentioned probability of availability and unavailability probability, and determines that available network is climbed
After worm, for available web crawlers, it is arranged from big to small according to the value of probability of availability P (a=0 | x), then presses
This order selects web crawlers distribution crawl task successively.Because probability of availability is the biggest, represent the Netowrk tape of web crawlers
Wide saturation is the lowest, the network bandwidth flow needed according to the task of crawl, and the network bandwidth of available web crawlers is satisfied
With degree situation and probability of availability sequence, needs how many web crawlers can be calculated.If available web crawlers is not enough to
Realize crawl task, then, after crawler system can wait other web crawlers free time, use the web crawlers of free time to complete remainder
Workload.
Fig. 2 is the schematic flow sheet of a kind of web crawlers data grab method shown in the application one exemplary embodiment, as
Shown in Fig. 2, described method includes:
Step S201, when receiving crawl task, inquires about in history captures data according to the configuration information of the task of crawl
Whether there is the first history page, described configuration information includes the uniform resource position mark URL bunch of crawl task, described in go through
History captures data and includes that performing history captures the history page that task is downloaded, and described first history page is for newly to grab with described
Take the history page that the configuration information of task is corresponding.
Wherein, web crawlers is performed the full page of the page captured during reptile task, the history page i.e. captured by crawler system
Face, is saved in history and captures in data.Described history captures data can only include history page, it is also possible to include capturing
History target pages data.When crawler system receives crawl task, capture in data to inquire about whether have the in history
One history page, described first history page is the history page corresponding with the configuration information of described crawl task, such as,
User's http://to be captured is s.1688.com/selloffer/offer_search.htm?Keywords=$ key}&n=y this
The search-type page corresponding for URL, uploads this URL to crawler system and uploads mp3, mobile phone and 3, computer
Key word (key), namely to capture
http://s.1688.com/selloffer/offer_search.htm?Keywords=mp3&n=y,
http://s.1688.com/selloffer/offer_search.htm?Keywords=Shou Ji &n=y and
http://s.1688.com/selloffer/offer_search.htm?Data in tri-pages of keywords=electricity Nao &n=y, as
Shown in above-mentioned, with the $ in 3 key words replacement URL, { (unified resource is fixed for the URL of the crawl task that key} part is formed
Position symbol) bunch, it is included in the configuration information of reptile task.After crawler system receives reptile task, according to described configuration
In information URL bunch, searches whether 3 history corresponding for URL existing in URL bunch in history captures data
The page.
Step S202, if there is described first history page, detects whether described first history page can be used.
Wherein, history there may be all URL bunch of the first corresponding history page in reptile task, also in capturing data
First history page corresponding with part URL bunch may be only existed, it is also possible to there is not first corresponding with URL bunch and go through
The history page.Html (HyperText Markup Language, HyperText Markup Language) due to the crawled page
Code is typically to be continually changing, it is impossible to know at which time point, it there occurs that what kind of changes, such as, capture task mesh
Mark is the title extracting page A, and the title data of the page A once captured before 3 days is under a label, and today captures
When task generates, the title data of page A has been altered under b label, although history in this time captures in data and exists
History page A, but history captures in history page A in data and is likely not to have b label, it is also possible to there is b label but aobvious
So title data is not under b label, extracts title data will send out if finding b label in 3 days front page A
Raw mistake.Therefore, when there is the first history page during history captures data, carrying out the first history page existed can
Detect by property.
Detect whether described first history page can be used, in the embodiment that the first is possible, including:
First history page described in b1, pre-parsed, it is judged that in crawl task, the page-tag at target pages data place is in institute
State in the first history page and whether exist.
Wherein, described first history page is carried out pre-parsed, obtains the label in the first history page, task will be captured
The page-tag at middle target pages data place, such as in the page-tag at title data place, with the first history page
Page-tag is compared, it is judged that whether there is the page at target pages data place in crawl task in the first history page
Label.
If the described page-tag of b2 exists in described first history page, then judge institute in described first history page
The attribute stating page-tag is the most identical with the attribute of the page-tag at target pages data place in crawl task.
Wherein, if the page-tag at target pages data place exists in described first history page in crawl task,
Then determine whether this page-tag in described first history page and the page at target pages data place in crawl task
The attribute of label is the most identical, to judge whether the data under this same page label there occurs change.
If the attribute of page-tag described in described first history page of b3 and target pages data place in crawl task
The attribute of page-tag identical, it is determined that described first history page can be used.
If the described page-tag of b4 does not exists in described first history page, if or described first history page
Described in the attribute of page-tag differ, the most really with the attribute of the page-tag at target pages data place in crawl task
Fixed described first history page is unavailable.
Wherein, if target pages data institute in the attribute of page-tag described in described first history page and crawl task
The attribute of page-tag identical, then under same page label, the data of the first history page and the page to be captured do not occur
Change, therefore can use the first history page to obtain target pages data, and the i.e. first history page can be used.If institute
Stating page-tag not exist in described first history page, the most described first history page is unavailable.If the described page
Label exists in described first history page, but the attribute of page-tag described in described first history page is appointed with capturing
In business, the attribute of the page-tag at target pages data place differs, namely page-tag described in the first history page
Under data there occurs change, the most described first history page is unavailable.
Step S202 carries out availability detection to each first history page, and testing result is probably all first history
The page all can be used, it is also possible to only part the first history page can be used, it is also possible to the most unavailable.
Step S203, if described first history page can be used, resolves the first available history page and obtains first object page
Face data.
Step S204, if described first history page is unavailable, according to the unification that disabled first history page is corresponding
URLs URL bunch, and history crawl data do not exist the URL of described first history page
URL bunch, download the corresponding page and resolve, obtain the second target pages data.
Wherein, the first available history page is directly used in crawl first object page data, and namely history captures data
If first history page corresponding and available with URL bunch of the task of crawl is had, then without the most again capturing this part in
The URL bunch of corresponding page, so as not to constitute the page repeat capture.History is captured in data and does not exist described in correspondence
, although and there is the first corresponding history page in history crawl data in URL bunch in the crawl task of the first history page
Face but disabled URL bunch of first history page of correspondence, download these URL bunch corresponding page by network and carry out
Resolve, namely capture these URL bunch corresponding page, obtain the second target pages data.The described first object page
Data and the target pages data of described second page target data composition crawl task.
Wherein, owing to the change of Webpage is the most frequent, possible with the history page of current time interval overlong time
The most completely different with the current page, simultaneously in order to reduce history crawl memory data output, save database space, grabbing
Take addition history page effective time in the configuration information of task, in history crawl data within history page effective time
The first history page it is possible to be used for capture first object page data, correspondingly, step S202 can at the second
In the embodiment of energy, including:
C1, judge that described first history page is whether in history page effective time;
If described first history page of c2 is in history page effective time, the first history page described in pre-parsed is sentenced
In disconnected crawl task, whether the page-tag at target pages data place exists in described first history page;
If the described page-tag of c3 exists in described first history page, it is judged that described in described first history page
The attribute of page-tag is the most identical with the attribute of the page-tag at target pages data place in crawl task;
If the attribute of page-tag described in described first history page of c4 and target pages data place in crawl task
The attribute of page-tag identical, it is determined that described first history page can be used;
If c5 the first history page is not in history page effective time, or described page-tag is gone through described first
The history page does not exists, if or target in the attribute of page-tag described in described first history page and crawl task
The attribute of the page-tag at page data place differs, it is determined that described first history page is unavailable.
Wherein, compared with the first possible embodiment of step S202, whether first judge described first history page
In history page effective time, if the first history page is not in history page effective time, it is determined that described first
History page is unavailable, if the first history page is in effective time, just performs step c2 to c4.Such as, history
Page effective time is 3 days, if the first history page is not the history page in 3 days, then this first history page is not
Available.
Fig. 3 is the schematic flow sheet of a kind of web crawlers data grab method shown in the application one exemplary embodiment, as
Shown in Fig. 3, described method includes:
Step S301, timing detects the network bandwidth utilization rate of web crawlers, and the network bandwidth calculating each web crawlers makes
By rate average E (w) and variance D (w) and store, wherein, w is network bandwidth utilization rate;
Step S302, when receiving crawl task, inquires about in history captures data according to the configuration information of the task of crawl
Whether there is the first history page, described configuration information includes the uniform resource position mark URL bunch newly capturing task, described
History captures data and includes that performing history captures the history page downloaded of task, described first history page be with described newly
The history page that the configuration information of crawl task is corresponding;
Step S303, if there is described first history page, detects whether described first history page can be used;
Step S304, if described first history page can be used, resolves the first available history page and obtains first object page
Face data;
Step S305, if described first history page is unavailable, according to the unification that disabled first history page is corresponding
URLs URL bunch, and history crawl data do not exist the URL of described first history page
URL bunch, form the second crawl task;
Step S306, according to described network bandwidth utilization rate average E (w) and variance D (w) calculate each web crawlers can
The property used;
Step S307, determines according to the availability of each web crawlers and performs the described second web crawlers capturing task;
Step S308, performs the described second web crawlers capturing task and captures the second target pages data, wherein, first
Target pages data and the second target pages data form the target pages data of described crawl task.
Wherein, described step S301 includes:
D1, timing acquiring preset times network bandwidth utilization rate;
D2, calculates network bandwidth utilization rate average E (w)=and D (w)=E [w-E (w)] 2 and stores, and wherein n is default
Number of times.
Wherein, timing detects the network bandwidth service condition of each web crawlers, and described timing can be periodic intervals one
The fixed time, such as, every 20min detects once, it is also possible to fix multiple time point according to the ruuning situation of crawler system
Detect, such as the 10min of each hour, 25min, 30min, 40min and 60min in every day
Detect.When detecting, gather preset times network bandwidth utilization rate, to calculate network bandwidth utilization rate average E (w) every time
With variance D (w), the collection to network bandwidth utilization rate can use conventional network bandwidth utilization rate acquisition method, need
Illustrate, when arranging preset times, the time of collection preset times should be made less than or equal to the assay intervals time.If
Network bandwidth utilization rate average E (w) of calculated each web crawlers and variance D (w) are stored in reptile system
In system, each network bandwidth utilization rate average E (w) is corresponding with the detection time with variance D (w).
Wherein, step S303, in the embodiment that the first is possible, including:
First history page described in pre-parsed, it is judged that in crawl task, the page-tag at target pages data place is described
Whether one history page exists;
If described page-tag exists in described first history page, it is judged that the page described in described first history page
The attribute of label is the most identical with the attribute of the page-tag at target pages data place in crawl task;
If the attribute of page-tag described in described first history page and target pages data place in crawl task
The attribute of page-tag is identical, it is determined that described first history page can be used;
If described page-tag does not exists in described first history page, if or institute in described first history page
The attribute stating page-tag differs with the attribute of the page-tag at target pages data place in crawl task, it is determined that institute
State the first history page unavailable.
Wherein, described configuration information can also include history page effective time, correspondingly, described step S303,
In two kinds of possible embodiments, including:
Judge that described first history page is whether in history page effective time;
If described first history page is in history page effective time, the first history page described in pre-parsed, it is judged that grab
Take whether the page-tag at target pages data place in task exists in described first history page;
If described page-tag exists in described first history page, it is judged that the page described in described first history page
The attribute of label is the most identical with the attribute of the page-tag at target pages data place in crawl task;
If the attribute of page-tag described in described first history page and target pages data place in crawl task
The attribute of page-tag is identical, it is determined that described first history page can be used;
If the first history page is not in history page effective time, or described page-tag is at described first history page
Face does not exists, if or target pages in the attribute of page-tag described in described first history page and crawl task
The attribute of the page-tag at data place differs, it is determined that described first history page is unavailable.
Wherein, after determining the first available history page by step S303, to the first available history page, directly
Carry out parsing and obtain first object page data, such that it is able to avoid the crawl that repeats of the page, saving system resource.
For disabled first history page, by URL bunch in the crawl task of its correspondence, and capture in history
Data are searched less than URL bunch in the crawl task of the first history page, form the second crawl according to these URL bunch
Task, namely the second crawl task is with these URL bunch for target URL bunch, and follow-up general is only according to these URL bunch
Capture the second corresponding target pages data.
Wherein, step S306, described calculate each net according to described network bandwidth utilization rate average E (w) and variance D (w)
The availability of network reptile, including:
Extract network bandwidth utilization rate average E (w) and the side of each web crawlers in predetermined time period before current time
Difference D (w);
According to network bandwidth utilization rate average E (w) of each web crawlers in predetermined time period and variance D (w), use
Naive Bayes Classification Algorithm calculates probability of availability and the unavailability probability of each web crawlers.
Correspondingly, step S307, the web crawlers of execution task is determined according to described availability, including:
The probability of availability of comparing cell reptile and unavailability probability;
If the probability of availability of web crawlers is more than unavailability probability, determine that described web crawlers can be used, otherwise, really
Fixed described web crawlers is unavailable;
To available web crawlers, select the web crawlers of execution task by the order that probability of availability is descending.
Determine the web crawlers of the second crawl task that performs in step S307 after, these web crawlers are used to capture the second mesh
Mark page data.First object page data and the second target pages data form the target pages data of described crawl task.
The web crawlers data grab method that the embodiment of the present application provides, crawler system preserves the history page full page captured
It is stored in history and captures data, when there being crawl task to enter, it is judged that whether history crawl data exist and crawl task
URL bunch of corresponding history page, the history page if there is corresponding history page and existence can be used, then to available
History page extracting directly first object page data, without repeating this partial history page to capture, thus save
System resource.For corresponding URL bunch of disabled history page, and history captures and there is not corresponding history page in data
URL bunch of face, using these URL bunch as target URL bunch, is captured the corresponding page by web crawlers and obtains the
Two target pages data, thus complete the crawl of all target pages data.Meanwhile, it is different from and generally uses all-network
Reptile or randomly choose web crawlers and complete the situation of crawl task, first timing detects the network bandwidth of all-network reptile
Utilization rate, calculates network bandwidth utilization rate average E (w) of each web crawlers and variance D (w) and stores, then basis
E (w) and D (w) calculate the availability of each web crawlers, the namely unsaturated web crawlers of the network bandwidth, and to available
Web crawlers, select the web crawlers of execution task by the descending order of probability of availability, to select more preferable net
Network reptile performs crawl task, thus reasonable distribution web crawlers resource, improve task and capture efficiency.
By the description of above embodiment of the method, those skilled in the art is it can be understood that can borrow to the application
The mode helping software to add required general hardware platform realizes, naturally it is also possible to by hardware, but a lot of in the case of the former
It it is more preferably embodiment.Based on such understanding, prior art is made by the technical scheme of the application the most in other words
The part of contribution can embody with the form of software product, and is stored in a storage medium, including some instructions
With so that a smart machine performs all or part of step of method described in each embodiment of the application.And aforesaid deposit
Storage media includes: read only memory (ROM), random access memory (RAM), magnetic disc or CD etc. are various can
With storage data and the medium of program code.
Fig. 4 is the block diagram that a kind of web crawlers shown in the application one exemplary embodiment captures task allocation apparatus.Such as figure
Shown in 4, described device includes:
Network bandwidth utilization rate processing unit U401, for regularly detecting the network bandwidth utilization rate of web crawlers, calculates
Network bandwidth utilization rate average E (w) of each web crawlers and variance D (w) also store, and wherein, w is that the network bandwidth makes
By rate;
Reptile availability calculations unit U402, for when receiving crawl task, according to described network bandwidth utilization rate
Average E (w) and variance D (w) calculate the availability of each web crawlers;
First reptile determines unit U403, for determining the web crawlers of execution task according to described availability.
Wherein, described network bandwidth utilization rate processing unit, including:
Network bandwidth utilization rate acquisition module, for timing acquiring preset times network bandwidth utilization rate;
Mean variance computing module, is used for calculating network bandwidth utilization rate averageWith D (w)=E [w-E (w)]2
And store, wherein n is preset times.
Wherein, described reptile availability calculations unit, including:
Mean variance extraction module, before extracting current time, in predetermined time period, the network bandwidth of each web crawlers makes
By rate average E (w) and variance D (w);
Probability evaluation entity, for according to network bandwidth utilization rate average E (w) of each web crawlers in predetermined time period
With variance D (w), use Naive Bayes Classification Algorithm to calculate the probability of availability of each web crawlers and unavailability is general
Rate.
Wherein, described first reptile determines unit, including:
Comparison module, for probability of availability and the unavailability probability of comparing cell reptile;
Availability determines module, if the probability of availability of web crawlers is more than unavailability probability, determines that described network is climbed
Worm can be used, otherwise, it determines described web crawlers is unavailable;
Selecting module, for available web crawlers, the order descending by probability of availability selects execution task
Web crawlers.
Fig. 5 is the block diagram of a kind of web crawlers data grabber device shown in the application one exemplary embodiment.Such as Fig. 5 institute
Showing, described device includes:
First history page query unit U501, for when receiving crawl task, according to the configuration letter of the task of crawl
Ceasing and inquire about whether there is the first history page in history captures data, described configuration information includes the unification newly capturing task
URLs URL bunch, described history captures data and includes that performing history captures the history page that task is downloaded, described
First history page is the history page corresponding with the configuration information of described new crawl task;
History page availability detector unit U502, for if there is described first history page, detects described first
Whether history page can be used;
First resolution unit U503, if can use for described first history page, resolves the first available history page
Obtain first object page data;
First placement unit U504, if unavailable, according to described disabled first for described first history page
There is not described first history page in the uniform resource position mark URL bunch that history page is corresponding, and history crawl data
Uniform resource position mark URL bunch, download the corresponding page and resolve, obtain the second target pages data.
Wherein, described history page availability detector unit, in the embodiment that the first is possible, including:
First judge module, for the first history page described in pre-parsed, it is judged that target pages data place in crawl task
Page-tag whether exist in described first history page;
Second judge module, if existed in described first history page for described page-tag, it is judged that described first
In the attribute of page-tag described in history page and crawl task, whether the attribute of the page-tag at target pages data place
Identical;
Can use and determine module, if for the attribute of page-tag described in described first history page and mesh in crawl task
The attribute of the page-tag at mark page data place is identical, it is determined that described first history page can be used;
Unavailable determine module, if do not existed in described first history page for described page-tag, if or
The attribute of page-tag described in described first history page and the page-tag at target pages data place in crawl task
Attribute differs, and the most described first history page is unavailable.
Wherein, described configuration information also includes history page effective time, described history page availability detector unit,
In the embodiment that the second is possible, including:
3rd judge module, is used for judging that described first history page is whether in history page effective time;
4th judge module, if for described first history page in history page effective time, described in pre-parsed the
One history page, it is judged that in crawl task the page-tag at target pages data place in described first history page whether
Exist;
5th judge module, if existed in described first history page for described page-tag, it is judged that described first
In the attribute of page-tag described in history page and crawl task, whether the attribute of the page-tag at target pages data place
Identical;
Can use and determine module, if for the attribute of page-tag described in described first history page and mesh in crawl task
The attribute of the page-tag at mark page data place is identical, it is determined that described first history page can be used;
Unavailable determine module, if for the first history page not in history page effective time, or the described page
Label does not exists in described first history page, if or the attribute of page-tag described in described first history page
Differ with the attribute of the page-tag at target pages data place in crawl task, it is determined that described first history page is not
Available.
Fig. 6 is the block diagram of the another kind of web crawlers data grabber device shown in the application one exemplary embodiment.Such as Fig. 6
Shown in, described device includes:
Network bandwidth utilization rate processing unit U601, for regularly detecting the network bandwidth utilization rate of web crawlers, calculates
Network bandwidth utilization rate average E (w) of each web crawlers and variance D (w) also store, and wherein, w is that the network bandwidth makes
By rate;
First history page query unit U602, for when receiving crawl task, according to the configuration letter of the task of crawl
Ceasing and inquire about whether there is the first history page in history captures data, described configuration information includes the unification newly capturing task
URLs URL bunch, described history captures data and includes that performing history captures the history page that task is downloaded, described
First history page is the history page corresponding with the configuration information of described new crawl task;
History page availability detector unit U603, for if there is described first history page, detects described first
Whether history page can be used;
First resolution unit U604, if can use for described first history page, resolves the first available history page
Obtain first object page data;
Second captures task signal generating unit U605, if unavailable, according to disabled for described first history page
There is not described first history in the uniform resource position mark URL bunch that the first history page is corresponding, and history crawl data
The uniform resource position mark URL bunch of the page, forms the second crawl task;
Reptile availability calculations unit U606, for according to described network bandwidth utilization rate average E (w) and variance D (w)
Calculate the availability of each web crawlers;
Second reptile determines unit U607, for determining that execution described second captures according to the availability of each web crawlers
The web crawlers of task;
Second placement unit U608, captures the second target pages number for performing the described second web crawlers capturing task
According to, wherein, first object page data and the second target pages data form the target pages data of described crawl task.
Wherein, described network bandwidth utilization rate processing unit, including:
Network bandwidth utilization rate acquisition module, for timing acquiring preset times network bandwidth utilization rate;
Mean variance computing module, is used for calculating network bandwidth utilization rate averageWith D (w)=E [w-E (w)]2
And store, wherein n is preset times.
Wherein, described history page availability detector unit, in the embodiment that the first is possible, including:
First judge module, for the first history page described in pre-parsed, it is judged that target pages data place in crawl task
Page-tag whether exist in described first history page;
Second judge module, if existed in described first history page for described page-tag, it is judged that described first
In the attribute of page-tag described in history page and crawl task, whether the attribute of the page-tag at target pages data place
Identical;
Can use and determine module, if for the attribute of page-tag described in described first history page and mesh in crawl task
The attribute of the page-tag at mark page data place is identical, it is determined that described first history page can be used;
Unavailable determine module, if do not existed in described first history page for described page-tag, if or
The attribute of page-tag described in described first history page and the page-tag at target pages data place in crawl task
Attribute differs, and the most described first history page is unavailable.
Wherein, described configuration information also includes history page effective time, described history page availability detector unit,
In the embodiment that the second is possible, including:
3rd judge module, is used for judging that described first history page is whether in history page effective time;
4th judge module, if for described first history page in history page effective time, described in pre-parsed the
One history page, it is judged that in crawl task the page-tag at target pages data place in described first history page whether
Exist;
5th judge module, if existed in described first history page for described page-tag, it is judged that described first
In the attribute of page-tag described in history page and crawl task, whether the attribute of the page-tag at target pages data place
Identical;
Can use and determine module, if for the attribute of page-tag described in described first history page and mesh in crawl task
The attribute of the page-tag at mark page data place is identical, it is determined that described first history page can be used;
Unavailable determine module, if for the first history page not in history page effective time, or the described page
Label does not exists in described first history page, if or the attribute of page-tag described in described first history page
Differ with the attribute of the page-tag at target pages data place in crawl task, it is determined that described first history page is not
Available.
Wherein, described reptile availability calculations unit, including:
Mean variance extraction module, before extracting current time, in predetermined time period, the network bandwidth of each web crawlers makes
By rate average E (w) and variance D (w);
Probability evaluation entity, for according to network bandwidth utilization rate average E (w) of each web crawlers in predetermined time period
With variance D (w), use Naive Bayes Classification Algorithm to calculate the probability of availability of each web crawlers and unavailability is general
Rate.
Wherein, described second reptile determines unit, including:
Comparison module, for probability of availability and the unavailability probability of comparing cell reptile;
Availability determines module, if the probability of availability of web crawlers is more than unavailability probability, determines that described network is climbed
Worm can be used, otherwise, it determines described web crawlers is unavailable;
Selecting module, for available web crawlers, the order descending by probability of availability selects execution task
Web crawlers.
For convenience of description, it is divided into various unit to be respectively described with function when describing apparatus above.Certainly, this is being implemented
The function of each unit can be realized in same or multiple softwares and/or hardware during application.
Each embodiment in this specification all uses the mode gone forward one by one to describe, identical similar part between each embodiment
Seeing mutually, what each embodiment stressed is the difference with other embodiments.Especially for device
Or for system embodiment, owing to it is substantially similar to embodiment of the method, so describing fairly simple, relevant part ginseng
See that the part of embodiment of the method illustrates.Apparatus and system embodiment described above is only schematically, wherein
The described unit illustrated as separating component can be or may not be physically separate, the portion shown as unit
Part can be or may not be physical location, i.e. may be located at a place, or can also be distributed to multiple network
On unit.Some or all of module therein can be selected according to the actual needs to realize the purpose of the present embodiment scheme.
Those of ordinary skill in the art, in the case of not paying creative work, are i.e. appreciated that and implement.
It should be noted that in this article, such as the relational terms of " first " and " second " or the like be used merely to by
One entity or operation separate with another entity or operating space, and not necessarily require or imply these entities or behaviour
Relation or the backward of any this reality is there is between work.And, term " includes ", " comprising " or its any its
His variant is intended to comprising of nonexcludability, so that include the process of a series of key element, method, article or set
Standby not only include those key elements, but also include other key elements being not expressly set out, or also include for this process,
The key element that method, article or equipment are intrinsic.In the case of there is no more restriction, by statement " including ... "
The key element limited, it is not excluded that there is also other phase in including the process of described key element, method, article or equipment
Same key element.
The above is only the detailed description of the invention of the application, makes to skilled artisans appreciate that or realize the application.
Multiple amendment to these embodiments will be apparent to one skilled in the art, and as defined herein one
As principle can realize in other embodiments in the case of without departing from spirit herein or scope.Therefore, this Shen
Please be not intended to be limited to the embodiments shown herein, and be to fit to and principles disclosed herein and features of novelty
The widest consistent scope.
Claims (26)
1. a web crawlers captures method for allocating tasks, it is characterised in that including:
Timing detects the network bandwidth utilization rate of web crawlers, calculates the network bandwidth utilization rate average of each web crawlers
E (w) and variance D (w) also store, and wherein, w is network bandwidth utilization rate;
When receiving crawl task, calculate each net according to described network bandwidth utilization rate average E (w) and variance D (w)
The availability of network reptile;
The web crawlers of execution task is determined according to described availability.
2. web crawlers as claimed in claim 1 captures method for allocating tasks, it is characterised in that described timing detection net
The network bandwidth utilization rate of network reptile, calculates network bandwidth utilization rate average E (w) and variance D (w) of each web crawlers
And store, including:
Timing acquiring preset times network bandwidth utilization rate;
Calculate network bandwidth utilization rate averageWith D (w)=E [w-E (w)]2And store, wherein n is secondary for presetting
Number.
3. web crawlers as claimed in claim 1 captures method for allocating tasks, it is characterised in that described according to described net
Network bandwidth utilization rate average E (w) and variance D (w) calculate the availability of each web crawlers, including:
Extract network bandwidth utilization rate average E (w) and the side of each web crawlers in predetermined time period before current time
Difference D (w);
According to network bandwidth utilization rate average E (w) of each web crawlers in predetermined time period and variance D (w), use
Naive Bayes Classification Algorithm calculates probability of availability and the unavailability probability of each web crawlers.
4. web crawlers as claimed in claim 3 captures method for allocating tasks, it is characterised in that can described in described basis
The web crawlers of execution task is determined by property, including:
The probability of availability of comparing cell reptile and unavailability probability;
If the probability of availability of web crawlers is more than unavailability probability, determine that described web crawlers can be used, otherwise, institute
State web crawlers unavailable;
To available web crawlers, select the web crawlers of execution task by the order that probability of availability is descending.
5. a web crawlers data grab method, it is characterised in that including:
When receiving crawl task, capture in data to inquire about whether have the according to the configuration information of the task of crawl in history
One history page, described configuration information includes the uniform resource position mark URL bunch newly capturing task, and described history captures number
According to including that performing history captures the history page that task is downloaded, described first history page is and described new crawl task
The history page that configuration information is corresponding;
If there is described first history page, detect whether described first history page can be used;
If described first history page can be used, resolve the first available history page and obtain first object page data;
If described first history page is unavailable, according to the URL that disabled first history page is corresponding
There is not the uniform resource position mark URL bunch of the first history page in URL bunch, and history crawl data, it is right to download
The page answered resolves, and obtains the second target pages data.
6. web crawlers data grab method as claimed in claim 5, it is characterised in that described detection described first is gone through
Whether the history page can be used, including:
First history page described in pre-parsed, it is judged that in crawl task, the page-tag at target pages data place is described
Whether one history page exists;
If described page-tag exists in described first history page, it is judged that the page described in described first history page
The attribute of label is the most identical with the attribute of the page-tag at target pages data place in crawl task;
If the attribute of page-tag described in described first history page and the page at target pages data place in crawl task
The attribute of face label is identical, it is determined that described first history page can be used;
If described page-tag does not exists in described first history page, if or institute in described first history page
The attribute stating page-tag differs with the attribute of the page-tag at target pages data place in crawl task, it is determined that institute
State the first history page unavailable.
7. web crawlers data grab method as claimed in claim 5, it is characterised in that described configuration information also includes
History page effective time, whether described first history page of described detection can use, including:
Judge that described first history page is whether in history page effective time;
If described first history page is in history page effective time, the first history page described in pre-parsed, it is judged that grab
Take whether the page-tag at target pages data place in task exists in described first history page;
If described page-tag exists in described first history page, it is judged that the page described in described first history page
The attribute of label is the most identical with the attribute of the page-tag at target pages data place in crawl task;
If the attribute of page-tag described in described first history page and the page at target pages data place in crawl task
The attribute of face label is identical, it is determined that described first history page can be used;
If the first history page is not in history page effective time, or described page-tag is at described first history page
Face does not exists, if or target pages in the attribute of page-tag described in described first history page and crawl task
The attribute of the page-tag at data place differs, it is determined that described first history page is unavailable.
8. a web crawlers data grab method, it is characterised in that including:
Timing detects the network bandwidth utilization rate of web crawlers, calculates the network bandwidth utilization rate average of each web crawlers
E (w) and variance D (w) also store, and wherein, w is network bandwidth utilization rate;
When receiving crawl task, capture in data to inquire about whether have the according to the configuration information of the task of crawl in history
One history page, described configuration information includes the uniform resource position mark URL bunch newly capturing task, and described history captures number
According to including that performing history captures the history page that task is downloaded, described first history page is and described new crawl task
The history page that configuration information is corresponding;
If there is described first history page, detect whether described first history page can be used;
If described first history page can be used, resolve the first available history page and obtain first object page data;
If described first history page is unavailable, according to the URL that disabled first history page is corresponding
URL bunch, and history captures in data the uniform resource position mark URL bunch that there is not the first history page, form the
Two capture task;
The availability of each web crawlers is calculated according to described network bandwidth utilization rate average E (w) and variance D (w);
Availability according to each web crawlers determines the web crawlers performing described second crawl task;
Perform the described second web crawlers capturing task and capture the second target pages data,
Wherein, first object page data and the second target pages data form the target pages data of described crawl task.
9. web crawlers data grab method as claimed in claim 8, it is characterised in that described timing detection network is climbed
The network bandwidth utilization rate of worm, calculates network bandwidth utilization rate average E (w) of each web crawlers and variance D (w) and deposits
Storage, including:
Timing acquiring preset times network bandwidth utilization rate;
Calculate network bandwidth utilization rate averageWith D (w)=E [w-E (w)]2And store, wherein n is secondary for presetting
Number.
10. web crawlers data grab method as claimed in claim 8, it is characterised in that described detection described first
Whether history page can be used, including:
First history page described in pre-parsed, it is judged that in crawl task, the page-tag at target pages data place is described
Whether one history page exists;
If described page-tag exists in described first history page, it is judged that the page described in described first history page
The attribute of label is the most identical with the attribute of the page-tag at target pages data place in crawl task;
If the attribute of page-tag described in described first history page and the page at target pages data place in crawl task
The attribute of face label is identical, it is determined that described first history page can be used;
If described page-tag does not exists in described first history page, if or institute in described first history page
The attribute stating page-tag differs with the attribute of the page-tag at target pages data place in crawl task, it is determined that institute
State the first history page unavailable.
11. web crawlers data grab methods as claimed in claim 8, it is characterised in that described configuration information also wraps
Including history page effective time, whether described first history page of described detection can be used, including:
Judge that described first history page is whether in history page effective time;
If described first history page is in history page effective time, the first history page described in pre-parsed, it is judged that grab
Take whether the page-tag at target pages data place in task exists in described first history page;
If described page-tag exists in described first history page, it is judged that the page described in described first history page
The attribute of label is the most identical with the attribute of the page-tag at target pages data place in crawl task;
If the attribute of page-tag described in described first history page and the page at target pages data place in crawl task
The attribute of face label is identical, it is determined that described first history page can be used;
If the first history page is not in history page effective time, or described page-tag is at described first history page
Face does not exists, if or target pages in the attribute of page-tag described in described first history page and crawl task
The attribute of the page-tag at data place differs, it is determined that described first history page is unavailable.
12. web crawlers data grab methods as claimed in claim 8, it is characterised in that described according to described network
Bandwidth utilization rate average E (w) and variance D (w) calculate the availability of each web crawlers, including:
Extract network bandwidth utilization rate average E (w) and the side of each web crawlers in predetermined time period before current time
Difference D (w);
According to network bandwidth utilization rate average E (w) of each web crawlers in predetermined time period and variance D (w), use
Naive Bayes Classification Algorithm calculates probability of availability and the unavailability probability of each web crawlers.
13. web crawlers data grab methods as claimed in claim 12, it is characterised in that described according to described available
Property determines the web crawlers of execution task, including:
The probability of availability of comparing cell reptile and unavailability probability;
If the probability of availability of web crawlers is more than unavailability probability, determine that described web crawlers can be used, otherwise, really
Fixed described web crawlers is unavailable;
To available web crawlers, select the web crawlers of execution task by the order that probability of availability is descending.
14. 1 kinds of web crawlers capture task allocation apparatus, it is characterised in that including:
Network bandwidth utilization rate processing unit, for regularly detecting the network bandwidth utilization rate of web crawlers, calculates each net
Network bandwidth utilization rate average E (w) of network reptile and variance D (w) also store, and wherein, w is network bandwidth utilization rate;
Reptile availability calculations unit, for when receiving crawl task, according to described network bandwidth utilization rate average E (w)
With the availability that variance D (w) calculates each web crawlers;
First reptile determines unit, for determining the web crawlers of execution task according to described availability.
15. web crawlers as claimed in claim 14 capture task allocation apparatus, it is characterised in that the described network bandwidth
Utilization rate processing unit, including:
Network bandwidth utilization rate acquisition module, for timing acquiring preset times network bandwidth utilization rate;
Mean variance computing module, is used for calculating network bandwidth utilization rate averageWith D (w)=E [w-E (w)]2
And store, wherein n is preset times.
16. web crawlers as claimed in claim 14 capture task allocation apparatus, it is characterised in that described reptile can be used
Property computing unit, including:
Mean variance extraction module, before extracting current time, in predetermined time period, the network bandwidth of each web crawlers makes
By rate average E (w) and variance D (w);
Probability evaluation entity, for according to network bandwidth utilization rate average E (w) of each web crawlers in predetermined time period
With variance D (w), use Naive Bayes Classification Algorithm to calculate the probability of availability of each web crawlers and unavailability is general
Rate.
17. web crawlers as claimed in claim 16 capture task allocation apparatus, it is characterised in that described first reptile
Determine unit, including:
Comparison module, for probability of availability and the unavailability probability of comparing cell reptile;
Availability determines module, if the probability of availability of web crawlers is more than unavailability probability, determines that described network is climbed
Worm can be used, otherwise, it determines described web crawlers is unavailable;
Selecting module, for available web crawlers, the order descending by probability of availability selects execution task
Web crawlers.
18. 1 kinds of web crawlers data grabber devices, it is characterised in that including:
First history page query unit, for when receiving crawl task, is going through according to the configuration information of the task of crawl
History captures in data to inquire about whether there is the first history page, and described configuration information includes that the unified resource newly capturing task is fixed
Position symbol URL bunch, described history captures data and includes that performing history captures the history page that task is downloaded, and described first goes through
The history page is the history page corresponding with the configuration information of described new crawl task;
History page availability detector unit, for if there is described first history page, detects described first history page
Whether face can be used;
First resolution unit, if can using for described first history page, resolving the first available history page and obtaining the
One target pages data;
First placement unit, if unavailable, according to described disabled first history page for described first history page
The uniform resource position mark URL bunch that face is corresponding, and history crawl data do not exist the unification of described first history page
URLs URL bunch, downloads the corresponding page and resolves, obtain the second target pages data.
19. web crawlers data grabber devices as claimed in claim 18, it is characterised in that described history page can be used
Property detector unit, including:
First judge module, for the first history page described in pre-parsed, it is judged that target pages data place in crawl task
Page-tag whether exist in described first history page;
Second judge module, if existed in described first history page for described page-tag, it is judged that described first
In the attribute of page-tag described in history page and crawl task, whether the attribute of the page-tag at target pages data place
Identical;
Can use and determine module, if for the attribute of page-tag described in described first history page and mesh in crawl task
The attribute of the page-tag at mark page data place is identical, it is determined that described first history page can be used;
Unavailable determine module, if do not existed in described first history page for described page-tag, if or
The attribute of page-tag described in described first history page and the page-tag at target pages data place in crawl task
Attribute differs, and the most described first history page is unavailable.
20. web crawlers data grabber devices as claimed in claim 18, it is characterised in that described configuration information also wraps
Include history page effective time, described history page availability detector unit, including:
3rd judge module, is used for judging that described first history page is whether in history page effective time;
4th judge module, if for described first history page in history page effective time, described in pre-parsed the
One history page, it is judged that in crawl task the page-tag at target pages data place in described first history page whether
Exist;
5th judge module, if existed in described first history page for described page-tag, it is judged that described first
In the attribute of page-tag described in history page and crawl task, whether the attribute of the page-tag at target pages data place
Identical;
Can use and determine module, if for the attribute of page-tag described in described first history page and mesh in crawl task
The attribute of the page-tag at mark page data place is identical, it is determined that described first history page can be used;
Unavailable determine module, if for the first history page not in history page effective time, or the described page
Label does not exists in described first history page, if or the attribute of page-tag described in described first history page
Differ with the attribute of the page-tag at target pages data place in crawl task, it is determined that described first history page is not
Available.
21. 1 kinds of web crawlers data grabber devices, it is characterised in that including:
Network bandwidth utilization rate processing unit, for regularly detecting the network bandwidth utilization rate of web crawlers, calculates each net
Network bandwidth utilization rate average E (w) of network reptile and variance D (w) also store, and wherein, w is network bandwidth utilization rate;
First history page query unit, for when receiving crawl task, is going through according to the configuration information of the task of crawl
History captures in data to inquire about whether there is the first history page, and described configuration information includes that the unified resource newly capturing task is fixed
Position symbol URL bunch, described history captures data and includes that performing history captures the history page that task is downloaded, and described first goes through
The history page is the history page corresponding with the configuration information of described new crawl task;
History page availability detector unit, for if there is described first history page, detects described first history page
Whether face can be used;
First resolution unit, if can using for described first history page, resolving the first available history page and obtaining the
One target pages data;
Second captures task signal generating unit, if unavailable for described first history page, goes through according to disabled first
There is not described first history page in the uniform resource position mark URL bunch that the history page is corresponding, and history crawl data
Uniform resource position mark URL bunch, forms the second crawl task;
Reptile availability calculations unit, for calculating each according to described network bandwidth utilization rate average E (w) and variance D (w)
The availability of individual web crawlers;
Second reptile determines unit, for determining that described second crawl of execution is appointed according to according to the availability of each web crawlers
The web crawlers of business;
Second placement unit, captures the second target pages data for performing the described second web crawlers capturing task,
Wherein, first object page data and the second target pages data form the target pages data of described crawl task.
22. web crawlers data grabber devices as claimed in claim 21, it is characterised in that the described network bandwidth uses
Rate processing unit, including:
Network bandwidth utilization rate acquisition module, for timing acquiring preset times network bandwidth utilization rate;
Mean variance computing module, is used for calculating network bandwidth utilization rate averageWith D (w)=E [w-E (w)]2
And store, wherein n is preset times.
23. web crawlers data grabber devices as claimed in claim 21, it is characterised in that described history page can be used
Property detector unit, including:
First judge module, for the first history page described in pre-parsed, it is judged that target pages data place in crawl task
Page-tag whether exist in described first history page;
Second judge module, if existed in described first history page for described page-tag, it is judged that described first
In the attribute of page-tag described in history page and crawl task, whether the attribute of the page-tag at target pages data place
Identical;
Can use and determine module, if for the attribute of page-tag described in described first history page and mesh in crawl task
The attribute of the page-tag at mark page data place is identical, it is determined that described first history page can be used;
Unavailable determine module, if do not existed in described first history page for described page-tag, if or
The attribute of page-tag described in described first history page and the page-tag at target pages data place in crawl task
Attribute differs, and the most described first history page is unavailable.
24. web crawlers data grabber devices as claimed in claim 21, it is characterised in that described configuration information also wraps
Include history page effective time, described history page availability detector unit, including:
3rd judge module, is used for judging that described first history page is whether in history page effective time;
4th judge module, if for described first history page in history page effective time, described in pre-parsed the
One history page, it is judged that in crawl task the page-tag at target pages data place in described first history page whether
Exist;
5th judge module, if existed in described first history page for described page-tag, it is judged that described first
In the attribute of page-tag described in history page and crawl task, whether the attribute of the page-tag at target pages data place
Identical;
Can use and determine module, if for the attribute of page-tag described in described first history page and mesh in crawl task
The attribute of the page-tag at mark page data place is identical, it is determined that described first history page can be used;
Unavailable determine module, if for the first history page not in history page effective time, or the described page
Label does not exists in described first history page, if or the attribute of page-tag described in described first history page
Differ with the attribute of the page-tag at target pages data place in crawl task, it is determined that described first history page is not
Available.
25. web crawlers data grabber devices as claimed in claim 21, it is characterised in that described reptile availability meter
Calculate unit, including:
Mean variance extraction module, before extracting current time, in predetermined time period, the network bandwidth of each web crawlers makes
By rate average E (w) and variance D (w);
Probability evaluation entity, for according to network bandwidth utilization rate average E (w) of each web crawlers in predetermined time period
With variance D (w), use Naive Bayes Classification Algorithm to calculate the probability of availability of each web crawlers and unavailability is general
Rate.
26. web crawlers data grabber devices as claimed in claim 25, it is characterised in that described second reptile determines
Unit, including:
Comparison module, for probability of availability and the unavailability probability of comparing cell reptile;
Availability determines module, if the probability of availability of web crawlers is more than unavailability probability, determines that described network is climbed
Worm can be used, otherwise, it determines described web crawlers is unavailable;
Selecting module, for available web crawlers, the order descending by probability of availability selects execution task
Web crawlers.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201510227531.9A CN106202108B (en) | 2015-05-06 | 2015-05-06 | Web crawlers grabs method for allocating tasks and device and data grab method and device |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201510227531.9A CN106202108B (en) | 2015-05-06 | 2015-05-06 | Web crawlers grabs method for allocating tasks and device and data grab method and device |
Publications (2)
Publication Number | Publication Date |
---|---|
CN106202108A true CN106202108A (en) | 2016-12-07 |
CN106202108B CN106202108B (en) | 2019-09-06 |
Family
ID=57459100
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201510227531.9A Active CN106202108B (en) | 2015-05-06 | 2015-05-06 | Web crawlers grabs method for allocating tasks and device and data grab method and device |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN106202108B (en) |
Cited By (7)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN107800684A (en) * | 2017-09-20 | 2018-03-13 | 贵州白山云科技有限公司 | A kind of low frequency reptile recognition methods and device |
CN108345615A (en) * | 2017-01-23 | 2018-07-31 | 阿里巴巴集团控股有限公司 | A kind of dispensing of page link and launch method of adjustment and system |
CN108920683A (en) * | 2018-07-12 | 2018-11-30 | 郑州云海信息技术有限公司 | A kind of method, apparatus and storage medium of cloud computing platform downloading external resource |
CN109471966A (en) * | 2018-10-30 | 2019-03-15 | 中译语通科技股份有限公司 | A kind of method and system of automatic acquisition target data source |
CN110333980A (en) * | 2019-05-24 | 2019-10-15 | 深圳壹账通智能科技有限公司 | The test method and device of network crawler system, storage medium, electronic equipment |
CN111092921A (en) * | 2018-10-24 | 2020-05-01 | 北大方正集团有限公司 | Data acquisition method, device and storage medium |
CN111444407A (en) * | 2020-03-26 | 2020-07-24 | 桂林理工大学 | Automatic extraction method and system for page list information of web crawler |
Families Citing this family (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN107580052B (en) * | 2017-09-07 | 2020-04-10 | 翼果(深圳)科技有限公司 | Self-evolution network self-adaptive crawler method and system |
Citations (8)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN102880698A (en) * | 2012-09-21 | 2013-01-16 | 新浪网技术(中国)有限公司 | Method and device for determining caught website |
CN102945268A (en) * | 2012-10-25 | 2013-02-27 | 北京腾逸科技发展有限公司 | Method and system for excavating comments on characteristics of product |
CN103514301A (en) * | 2013-10-24 | 2014-01-15 | 深圳市同洲电子股份有限公司 | Method and system for scheduling tasks of distributed network crawlers |
CN103530392A (en) * | 2013-10-22 | 2014-01-22 | 北京奇虎科技有限公司 | Method and device for determining capture flows |
CN103530390A (en) * | 2013-10-22 | 2014-01-22 | 北京奇虎科技有限公司 | Webpage crawling method and device |
CN103559219A (en) * | 2013-10-18 | 2014-02-05 | 北京京东尚科信息技术有限公司 | Distributed web crawler capture task dispatching method, dispatching-side device and capture nodes |
CN103559083A (en) * | 2013-10-11 | 2014-02-05 | 北京奇虎科技有限公司 | Web crawl task scheduling method and task scheduler |
CN103605764A (en) * | 2013-11-26 | 2014-02-26 | Tcl集团股份有限公司 | Web crawler system and web crawler multitask executing and scheduling method |
-
2015
- 2015-05-06 CN CN201510227531.9A patent/CN106202108B/en active Active
Patent Citations (8)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN102880698A (en) * | 2012-09-21 | 2013-01-16 | 新浪网技术(中国)有限公司 | Method and device for determining caught website |
CN102945268A (en) * | 2012-10-25 | 2013-02-27 | 北京腾逸科技发展有限公司 | Method and system for excavating comments on characteristics of product |
CN103559083A (en) * | 2013-10-11 | 2014-02-05 | 北京奇虎科技有限公司 | Web crawl task scheduling method and task scheduler |
CN103559219A (en) * | 2013-10-18 | 2014-02-05 | 北京京东尚科信息技术有限公司 | Distributed web crawler capture task dispatching method, dispatching-side device and capture nodes |
CN103530392A (en) * | 2013-10-22 | 2014-01-22 | 北京奇虎科技有限公司 | Method and device for determining capture flows |
CN103530390A (en) * | 2013-10-22 | 2014-01-22 | 北京奇虎科技有限公司 | Webpage crawling method and device |
CN103514301A (en) * | 2013-10-24 | 2014-01-15 | 深圳市同洲电子股份有限公司 | Method and system for scheduling tasks of distributed network crawlers |
CN103605764A (en) * | 2013-11-26 | 2014-02-26 | Tcl集团股份有限公司 | Web crawler system and web crawler multitask executing and scheduling method |
Cited By (11)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN108345615A (en) * | 2017-01-23 | 2018-07-31 | 阿里巴巴集团控股有限公司 | A kind of dispensing of page link and launch method of adjustment and system |
CN107800684A (en) * | 2017-09-20 | 2018-03-13 | 贵州白山云科技有限公司 | A kind of low frequency reptile recognition methods and device |
CN107800684B (en) * | 2017-09-20 | 2018-09-18 | 贵州白山云科技有限公司 | A kind of low frequency reptile recognition methods and device |
CN108920683A (en) * | 2018-07-12 | 2018-11-30 | 郑州云海信息技术有限公司 | A kind of method, apparatus and storage medium of cloud computing platform downloading external resource |
CN111092921A (en) * | 2018-10-24 | 2020-05-01 | 北大方正集团有限公司 | Data acquisition method, device and storage medium |
CN109471966A (en) * | 2018-10-30 | 2019-03-15 | 中译语通科技股份有限公司 | A kind of method and system of automatic acquisition target data source |
CN109471966B (en) * | 2018-10-30 | 2022-07-15 | 中译语通科技股份有限公司 | Method and system for automatically acquiring target data source |
CN110333980A (en) * | 2019-05-24 | 2019-10-15 | 深圳壹账通智能科技有限公司 | The test method and device of network crawler system, storage medium, electronic equipment |
WO2020238131A1 (en) * | 2019-05-24 | 2020-12-03 | 深圳壹账通智能科技有限公司 | Web crawler system testing method and apparatus, storage medium, and electronic device |
CN111444407A (en) * | 2020-03-26 | 2020-07-24 | 桂林理工大学 | Automatic extraction method and system for page list information of web crawler |
CN111444407B (en) * | 2020-03-26 | 2023-05-16 | 桂林理工大学 | Automatic extraction method and system for page list information of web crawlers |
Also Published As
Publication number | Publication date |
---|---|
CN106202108B (en) | 2019-09-06 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN106202108A (en) | Web crawlers captures method for allocating tasks and device and data grab method and device | |
CN102171689B (en) | Method and system for providing search results | |
CN102760151B (en) | Implementation method of open source software acquisition and searching system | |
CN106709012A (en) | Method and device for analyzing big data | |
CN103729362B (en) | The determination method and apparatus of navigation content | |
CN105243159A (en) | Visual script editor-based distributed web crawler system | |
CN108039959A (en) | Situation Awareness method, system and the relevant apparatus of a kind of data | |
US7873623B1 (en) | System for user driven ranking of web pages | |
CN102567494B (en) | Website classification method and device | |
CN104021125A (en) | Search engine sorting method and system and search engine | |
CN110147360A (en) | A kind of data integration method, device, storage medium and server | |
US8799274B2 (en) | Topic map for navigation control | |
Achsan et al. | A fast distributed focused-web crawling | |
WO2020238070A1 (en) | Web page segmentation and search algorithm-based service packaging method | |
CN111382155B (en) | Data processing method of data warehouse, electronic equipment and medium | |
CN105373546A (en) | Information processing method and system for knowledge services | |
CN104408180A (en) | Stored data inquiring method and device | |
CN104252348A (en) | Webpage access statistics method and device based on browser | |
CN105007314A (en) | Big data processing system oriented to mass reading data of readers | |
CN106649498A (en) | Network public opinion analysis system based on crawler and text clustering analysis | |
CN102117331A (en) | Video search method and system | |
CN104199893A (en) | System and method for publishing omnimedia contents fast | |
CN107341274A (en) | A kind of full-text search engine and data retrieval method | |
CN103605744A (en) | Method and device for analyzing website searching engine traffic data | |
KR100557874B1 (en) | Method of scientific information analysis and media that can record computer program thereof |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
C06 | Publication | ||
PB01 | Publication | ||
C10 | Entry into substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |