CN103544278A

CN103544278A - Method and equipment for identifying website capturing flow quota

Info

Publication number: CN103544278A
Application number: CN201310500682.8A
Authority: CN
Inventors: 魏少俊
Original assignee: Beijing Qihoo Technology Co Ltd; Qizhi Software Beijing Co Ltd
Current assignee: Beijing Qihoo Technology Co Ltd
Priority date: 2013-10-22
Filing date: 2013-10-22
Publication date: 2014-01-29
Anticipated expiration: 2033-10-22
Also published as: CN103544278B

Abstract

The invention discloses a method and equipment for ensuring website capturing flow quota. The method includes acquiring visited data of target websites to be captured; identifying capturing bearing quota of the target websites according to visited data; acquiring webpage mass distribution of webpages in the target websites; identifying task flow of the target websites to be captured according to the webpage mass distribution of the webpages in the target websites; identifying flow quota used in webpage capturing in the target websites according to the capturing bearing flow of the target websites and the task flow of the capturing target websites. With the method, flow quota for webpage capturing on the target websites can be distributed well when search engine creeper programs captures webpages of websites, conflict between the creeper programs and the websites to be captured is reduced, and capturing action of the creeper programs and updating requirements of searching engines are reasonably balanced.

Description

Determine that website captures the method and apparatus of flow quota

Technical field

The present invention relates to search engine technique field, be specifically related to determine that website captures the method and apparatus of flow quota.

Background technology

Search engine is a kind of means of Internet information platform, by search engine, a large amount of info webs on internet can be collected, after processing is processed, set up information database and index data base, user can be by input inquiry word in the entrance providing at search engine, thereby obtains the Search Results that search engine returns for this query word.Along with development and the maturation of search engine technique, its service sector providing is more and more perfect, and when people obtain information needed from internet in large scale, search engine has become a kind of very conventional, also instrument very easily.

Search engine is in order to download the webpage on internet, for analyzing web page data with set up index, often needs to use a kind of implementing procedure that captures webpage, and this program is commonly called " reptile program " or " spider ".Owing to always ceaselessly producing new web page on internet, simultaneously original webpage is also in continuous renewal, so reptile program need to ceaselessly work, to guarantee that search engine can access up-to-date web data.For better Search Results is provided, the total hope of reptile program of search engine can be included new web page on internet and original webpage of renewal quickly.But web page resources is positioned in each site hosts on network, and reptile program will certainly take the Service Source of site hosts to the crawl of web page resources, as the software and hardware of site hosts is processed resource, bandwidth etc.If the task of capturing webpage has surpassed the tolerance range of site hosts, will have influence on website user's normal access, the webpage of reptile program captures behavior so just becomes website unfriendly act, can cause affecting websites response overtime when serious, even Website server collapse.And, be the stability of protection website, the access of reptile program usually can be monitored in website, and takes restriction, even disable access measure to producing the reptile program of unfriendly act.Once reptile program is limited or forbids, the webpage of search engine captures efficiency meeting step-down, even cannot upgrade or download this website and webpage resource, finally providing of search service is had a negative impact.

Simultaneously, in prior art, be generally to set manually to set flow or the frequency that reptile program can capture website, although this mode reduces the reptile program of search engine and conflicting of crawled website, but web data is upgraded and do not obtain maximum embodiment, and the demand that therefore makes the crawl behavior of reptile program and website data upgrade does not obtain rational balance.

Summary of the invention

In view of the above problems, the present invention has been proposed so that a kind of equipment and the corresponding method of determining website crawl flow quota that provides definite website that overcomes the problems referred to above or address the above problem at least in part to capture flow quota.

According to one aspect of the present invention, provide a kind of definite website to capture the method for flow quota, comprising:

Obtain the visit data that is subject to of targeted website to be captured;

According to the described visit data that is subject to, determine that flow is born in the crawl of described targeted website;

The web page quality that obtains webpage in described targeted website distributes;

According to the described web page quality of webpage in described targeted website, distribute, determine the task flow that captures targeted website;

According to the crawl of described targeted website, bear flow, and the task flow of described crawl targeted website, determine the flow quota of carrying out webpage crawl on described targeted website.

Alternatively, described in obtain the visit data that is subject to of targeted website to be captured, comprising:

Access statistic data according to search engine to described targeted website, determine described targeted website described in be subject to visit data.

Alternatively, be subject to visit data described in described basis, determine that flow is born in the crawl of described targeted website, comprising:

According to the described visit data that is subject to, determine the born access total amount of described targeted website;

According to described access total amount and the preset crawl pressure coefficient of bearing, determine that flow is born in the crawl of described targeted website.

Alternatively, be subject to visit data described in described basis, determine the born access total amount of described targeted website, comprising:

Access statistic data according to search engine to described targeted website, the market share of described search engine, the direct visit capacity of user, and website redundancy flow, determine the born access total amount of described targeted website.

Alternatively, described in obtain webpage in described targeted website web page quality distribute, comprising:

According to the pagerank of webpage in described targeted website, and/or the link degree of depth of webpage, determine the scoring of webpage;

Scoring to a plurality of webpages in described targeted website is normalized, and obtains the mass distribution that each webpage is corresponding.

The web page quality that obtains all webpages in described targeted website distributes;

The described described web page quality according to webpage in described targeted website distributes, and determines the task flow that captures targeted website, comprising:

The summation of obtaining the web page quality distribution of all webpages in described targeted website, the summation distributing according to the web page quality of all webpages in described targeted website, determines the task flow that captures targeted website.

Alternatively, also comprise:

Obtain one or more task scale factors;

The summation that the described web page quality according to all webpages in described targeted website distributes, determines the task flow that captures targeted website, comprising:

The summation distributing according to described web page quality and the product of one or more described task scale factors, determine the task flow that captures targeted website.

Alternatively, described in obtain one or more task scale factors, comprising:

Obtain in described targeted website, webpage number to be captured accounts in described targeted website the ratio of webpage sum;

And/or,

Obtain in described targeted website, unduplicated webpage quantity accounts for the ratio of webpage sum in described targeted website.

Alternatively, described in obtain in described targeted website, webpage number to be captured accounts for the ratio of webpage sum in described targeted website, comprising:

Obtain in described targeted website, capture the webpage number upgrading in history, and/or the new webpage number producing in described targeted website, accounts for the ratio of webpage sum in described targeted website.

Alternatively, described in obtain in described targeted website, unduplicated webpage quantity accounts for the ratio of webpage sum in described targeted website, comprising:

In the crawl history to targeted website, obtain and compare the information fingerprint of captured webpage;

According to the result of comparison, obtain unduplicated information fingerprint number, account for the ratio of total fingerprint number, as described unduplicated webpage quantity, account for the ratio of webpage sum in described targeted website.

Alternatively, also comprise:

According to what capture targeted website, task T.T. determine unit interval coefficient;

The summation and the one or more described task scale factor that according to described web page quality, distribute, and the product of described unit interval coefficient, determine the task flow that captures targeted website.

Alternatively, also comprise:

When being greater than described crawl, described task flow bears flow, and when both differences are greater than preset threshold value, by adjusting described task scale factor, and/or described unit interval coefficient, adjust described task flow, until described task flow is less than or equal to described crawl, bear flow, or both differences are less than preset threshold value.

Alternatively, flow is born in the described crawl according to described targeted website, and the task flow of described crawl targeted website, determines the flow quota of carrying out webpage crawl on described targeted website, comprising:

When described task flow is greater than described crawl, bear flow, and both differences are while being less than preset threshold value, the flow quota that described task flow is defined as carrying out to webpage crawl on described targeted website.

According to a further aspect in the invention, provide a kind of definite website to capture the equipment of flow quota, having comprised:

Website visitation data acquiring unit, is suitable for obtaining the visit data that is subject to of targeted website to be captured;

Website holding capacity determining unit, is suitable for being subject to visit data described in basis, determines that flow is born in the crawl of described targeted website;

Web page quality distributed acquisition unit, the web page quality distribution that is suitable for obtaining webpage in described targeted website;

Task flow acquiring unit, is suitable for distributing according to the described web page quality of webpage in described targeted website, determines the task flow that captures targeted website;

Flow quota determining unit, is suitable for bearing flow according to the crawl of described targeted website, and the task flow of described crawl targeted website, determines the flow quota of carrying out webpage crawl on described targeted website.

Alternatively, described website visitation data acquiring unit, is suitable for:

Alternatively, described website holding capacity determining unit, comprising:

Visit capacity is determined subelement, is suitable for being subject to visit data described in basis, determines the born access total amount of described targeted website;

Described website holding capacity determining unit, is suitable for, according to described access total amount and the preset crawl pressure coefficient of bearing, determining that flow is born in the crawl of described targeted website.

Alternatively, described visit capacity is determined subelement, is suitable for:

Alternatively, described web page quality distributed acquisition unit, is suitable for:

Alternatively, described web page quality distributed acquisition unit, comprising:

Web page quality distributed acquisition subelement, the web page quality distribution that is suitable for obtaining all webpages in described targeted website;

Described task flow acquiring unit, comprising:

Task flow obtains subelement, the summation that in the described targeted website that is suitable for obtaining, the web page quality of all webpages distributes, and the summation distributing according to the web page quality of all webpages in described targeted website, determines the task flow that captures targeted website.

Alternatively, also comprise:

Task scale factor acquiring unit, is suitable for obtaining one or more task scale factors;

Described task flow obtains subelement, is suitable for:

Alternatively, described task scale factor acquiring unit, is suitable for:

And/or,

Alternatively, described task scale factor acquiring unit, is suitable for:

Alternatively, also comprise:

Unit interval coefficient acquiring unit, is suitable for task T.T. determining unit interval coefficient according to what capture targeted website;

Described task flow obtains subelement, is suitable for:

Alternatively, also comprise:

Task flow adjustment unit, be suitable for bearing flow when described task flow is greater than described crawl, and when both differences are greater than preset threshold value, by adjusting described task scale factor, and/or described unit interval coefficient, adjust described task flow, until described task flow is less than or equal to described crawl, bear flow, or both differences are less than preset threshold value.

Alternatively, described flow quota determining unit, is suitable for:

The method that definite website according to the present invention captures flow quota can, according to the visit data that is subject to of targeted website to be captured, determine that, when search engine reptile program captures targeted website, flow is born in the crawl that can bear of targeted website; And can distribute according to the web page quality of webpage in targeted website, determine the task flow that captures targeted website task; And then bear flow according to the crawl of targeted website, and the task flow that captures targeted website, determine the flow quota of carrying out webpage crawl on targeted website.Solved thus the problem that reptile program unconfined crawl causes too much taking site resource.Realized in the situation that the crawl pressure permission to website effectively captures the web data of website, to reduce the reptile program of search engine and conflicting of crawled website.The crawl behavior of reptile program and search engine are upgraded demand and obtain rational balance.

Above-mentioned explanation is only the general introduction of technical solution of the present invention, in order to better understand technological means of the present invention, and can be implemented according to the content of instructions, and for above and other objects of the present invention, feature and advantage can be become apparent, below especially exemplified by the specific embodiment of the present invention.

Accompanying drawing explanation

By reading below detailed description of the preferred embodiment, various other advantage and benefits will become cheer and bright for those of ordinary skills.Accompanying drawing is only for the object of preferred implementation is shown, and do not think limitation of the present invention.And in whole accompanying drawing, by identical reference symbol, represent identical parts.In the accompanying drawings:

Fig. 1 shows the process flow diagram of the method that webpage captures according to an embodiment of the invention;

Fig. 2 shows and determines that according to an embodiment of the invention website captures the process flow diagram of the method for flow quota;

Fig. 3 shows the process flow diagram of determining according to an embodiment of the invention the method that captures flow;

Fig. 4 shows and determines that according to an embodiment of the invention the sub-channel in website captures the process flow diagram of the method for flow quota;

Fig. 5 shows the schematic diagram of the equipment that webpage captures according to an embodiment of the invention;

Fig. 6 shows and determines that according to an embodiment of the invention website captures the schematic diagram of the equipment of flow quota;

Fig. 7 shows the schematic diagram of determining according to an embodiment of the invention the equipment that captures flow;

Fig. 8 shows and determines that according to an embodiment of the invention the sub-channel in website captures the schematic diagram of the equipment of flow quota.

Embodiment

Exemplary embodiment disclosed by the invention is described below with reference to accompanying drawings in more detail.Although shown exemplary embodiment disclosed by the invention in accompanying drawing, yet should be appreciated that can realize the present invention embodiment open and that should do not set forth here with various forms limits.On the contrary, it is in order more thoroughly to understand the disclosure that these embodiment are provided, and can by scope disclosed by the invention complete convey to those skilled in the art.

For convenience of explanation, first define the explanation of parameter and parameter as shown in table 1:

Table 1

Embodiment mono-

Refer to Fig. 1, the process flow diagram of the method that the webpage providing for the embodiment of the present invention captures, as shown in the figure, the method that the webpage that the embodiment of the present invention provides captures can comprise the following steps:

S110: obtain the dynamic flow quota value of carrying out webpage crawl on targeted website;

In process webpage in targeted website being captured in reptile program, for fear of to the unconfined crawl in same website, and cause affecting the generation of the situations such as normal access of website, conventionally need to be to reptile program the crawl flow on targeted website or frequency carry out certain restriction, dynamic flow quota value is a kind of restriction of the crawl flow on targeted website to reptile program.On targeted website, carry out the dynamic flow quota value of webpage crawl, can be understood as when reptile program is carried out crawl task, limit to the flow capturing of same website within the unit interval, for example, will be restricted to 3,000,000/day to dynamic flow quota value.In this step S110, can obtain the dynamic flow quota value of carrying out webpage crawl on targeted website.

When obtaining the dynamic flow quota value of carrying out webpage crawl on targeted website, can realize by the following method:

First obtain the visit data that is subject to of targeted website, then can, according to the described visit data that is subject to, determine that flow is born in the crawl of targeted website; The web page quality that obtains webpage in targeted website distributes; According to the web page quality of webpage in targeted website, distribute, determine the task flow that captures targeted website; And then bear flow according to the crawl of targeted website, and the task flow that captures targeted website, determine the dynamic flow quota value of carrying out webpage crawl on targeted website;

Wherein, can be according to search engine the access statistic data to targeted website, determine the visit data that is subject to of targeted website.According to the described visit data that is subject to, when flow is born in the crawl of determining targeted website, can, according to being subject to visit data, first determine the born access total amount of targeted website; According to bearing total amount and preset crawl pressure coefficient, determine that flow is born in the crawl of targeted website.Concrete, the access statistic data of the targeted website that can collect according to search engine, and the market share of search engine, the direct visit capacity of user, and website redundancy flow, jointly determine the born access total amount of targeted website, then be multiplied by preset crawl pressure coefficient, as the crawl of targeted website, bear flow.

What in targeted website, the web page quality of webpage distributed obtains, can be according to the pagerank of webpage in targeted website, and/or the link degree of depth of webpage, determine the scoring of webpage; Scoring to a plurality of webpages in targeted website is normalized, and obtains the mass distribution that each webpage is corresponding.In targeted website, the web page quality distribution qi of webpage, can be understood as the scoring situation to the web page quality of webpage in targeted website.Web page quality distributes can be by the pagerank of webpage, and/or the link degree of depth of webpage, determine, and as can be according to the pagerank of webpage in targeted website, and/or the link degree of depth of webpage, determine the scoring of webpage; Then the scoring of a plurality of webpages in targeted website is normalized, obtains the mass distribution that each webpage is corresponding.Normalized, can make the quality score of webpage in targeted website normalize to (0,1] in this interval.

Concerning with search engine, can obtain the web page quality distribution of stating all webpages in targeted website, and then the summation that distributes of the web page quality that obtains all webpages in targeted website, the summation distributing according to the web page quality of all webpages in targeted website, determines the task flow that captures targeted website.Concrete can obtain one or more task scale factors; As obtain in targeted website, webpage number to be captured accounts in targeted website the ratio of webpage sum; And/or, obtain the ratio that unduplicated webpage quantity in targeted website accounts for webpage sum in targeted website.Then the summation distributing according to web page quality and the product of one or more task scale factors, determine the task flow that captures targeted website.

Wherein obtain in described targeted website, webpage number to be captured accounts for the ratio of webpage sum in described targeted website and can obtain in described targeted website, captures the webpage number upgrading in history, and/or, the new webpage number producing in targeted website, accounts for the ratio of webpage sum in targeted website.Obtain the ratio that unduplicated webpage quantity in targeted website accounts for webpage sum in targeted website, can, in the crawl history to targeted website, obtain and compare the information fingerprint of captured webpage; According to the result of comparison, obtain unduplicated information fingerprint number, account for the ratio of total fingerprint number, as unduplicated webpage quantity, account for the ratio of webpage sum in described targeted website.

In addition, can also task T.T. determine unit interval coefficient according to what capture targeted website; When determining the task flow that captures targeted website, summation and one or more task scale factor that can distribute according to web page quality, and the product of unit interval coefficient, determine the task flow that captures targeted website.

When capturing the task flow of targeted website and the crawl of website and bear flow and determine the task flow that captures targeted website, can be greater than to capture at task flow and bear flow, and when both differences are greater than preset threshold value, by adjusting task scale factor, and/or unit interval coefficient, adjust task flow, until task flow is less than or equal to capture, bear flow, or both differences are less than preset threshold value.The dynamic adjustment of realization to dynamic flow quota value.When task flow is greater than to capture, bear flow, and both differences are while being less than preset threshold value, the dynamic flow quota value that task flow can be defined as carrying out webpage crawl on targeted website.

In addition in other method, also can, only according to the task flow that captures targeted website, obtain the dynamic flow quota value of carrying out webpage crawl on targeted website.The web page quality that now can first obtain webpage in targeted website distributes; According to the web page quality of webpage in targeted website, distribute, determine the task flow that captures targeted website; According to the task flow that captures targeted website, determine the dynamic flow quota value of carrying out webpage crawl on targeted website.

Description to the more concrete realization of the embodiment of the present invention one each step, can capture with reference to the definite website in step S210 to S240 in the embodiment of the present invention two content of flow quota, and can with embodiment tri-in determine the content cross-reference of the corresponding part in the method that captures flow.

S120: according to described dynamic flow quota value, the webpage on described targeted website is captured.

After determining the dynamic flow quota value of carrying out webpage crawl on targeted website, the flow that can carry out being limited with dynamic flow quota value on targeted website according to determined dynamic flow quota value captures.If certainly crawl demand is born flow when much larger than the crawl of website, can be by simplifying crawl demand, or the data that capture are carried out after stricter screening, then capture.

The method that the webpage above embodiment of the present invention one being provided captures is described in detail, and by the method, can obtain the dynamic flow quota value of carrying out webpage crawl on targeted website; According to dynamic flow quota value, the webpage on targeted website is captured, realized in the situation that the crawl pressure permission to website effectively captures the web data of website, to reduce the reptile program of search engine and conflicting of crawled website.

Embodiment bis-

Refer to Fig. 2, the definite website providing for the embodiment of the present invention two captures the process flow diagram of the method for flow quota, and as shown in the figure, the method that definite website that the embodiment of the present invention provides captures flow quota can comprise the following steps:

S210: the visit data that is subject to that obtains targeted website to be captured;

First can obtain the visit data that is subject to of targeted website to be captured, the visit data that is subject to of targeted website to be captured can be the click volume data of the one day of website, as the parameters C in table 1, get wait capturing being subject to after visit data of targeted website, can release the access holding capacity of targeted website to be captured according to the visit data that is subject to of targeted website.

The visit data that is subject to of targeted website can obtain from many aspects, as can be by obtaining in the rank publish data of website.In addition, user's browsing page is usually undertaken by browser software, thus the webpage that also can browse by browser user add up, then according to browser in currently marketed occupation rate, determine the access holding capacity of website.As counted on every daily visit of certain website by browser, be 1,500,000 times, and the Vehicles Collected from Market occupation rate of this browser is 15%, can determines that the day access total amount of this website is 1,000 ten thousand times, the access holding capacity of this website is at least 1,000 ten thousand times.

Can also be according to search engine the access statistic data to targeted website, determine the visit data that is subject to of targeted website, this is because in the process of user's browsing page, often need to visit webpage by search engine, the Search Results providing by search engine carries out redirect and visits webpage, search engine can be added up the webpage of access, and then to adding up by the click volume of search engine access websites, according to the access statistic data of the targeted website of search engine statistics, determine the visit data that is subject to of targeted website.Concrete, can pass through the visit capacity of search engine access destination website, divided by the market share of this search engine, as the visit data that is subject to of this website.Every daily visit that user accesses certain website by search engine redirect as counted on is 1,500,000 times, and the Vehicles Collected from Market occupation rate of this search engine is 15%, the day access total amount that can determine this website is 1,000 ten thousand times, and the access holding capacity of this website is at least 1,000 ten thousand times.

In addition, also can be combined with several different methods or approach, obtain the visit data that is subject to of more accurate targeted website.For example be combined with two kinds of above-mentioned methods, be about to the statistics of client browser software, combine with search engine statistics, can determine that user is by search engine redirect simultaneously, and the data of non-search engine redirect access destination website, both are combined to the visit data that is subject to that can obtain more accurate targeted website.It should be noted that, website be subject to visit data, generally the access times that are subject to website in the unit interval represent, in describing as the aforementioned, be to describe with every daily visit of website, certainly, also can use according to concrete applicable cases other chronomere, as the access times that are subject to of website in a hour, the present invention is to this not restriction.

S220: according to the described visit data that is subject to, determine that flow is born in the crawl of described targeted website;

Get being subject to after visit data of targeted website, can, according to the visit data that is subject to getting, determine that flow is born in the crawl of targeted website.Flow is born in the crawl of website, and can be understood as the reptile program that website can be born in the unit interval and capture flow, the unit interval wherein, equally can be depending on concrete applicable cases, below using and day describe this method as a unit interval.

In actual applications, can be directly within the unit interval getting the visit capacity of website as the crawl of website, bear flow.But the service that website provides usually be take user and is browsed as main, if directly the unit interval visit capacity of the website getting is born to flow as the crawl of website, likely can exceed website for the upper limit of bearing of reptile program crawl, therefore, at the visit data that is subject to that obtains targeted website, be multiplied by preset crawl pressure coefficient, flow is born in the crawl that obtains targeted website.Preset crawl pressure coefficient, can be a number percent coefficient, and its span is (0,1).For example every daily visit of passing through search engine redirect of certain website is 1,500,000 times, and preset crawl pressure coefficient is 30%, and to bear flow be 450,000 every days in the crawl of last definite targeted website.

Preset crawl pressure coefficient, can take to arrange flexibly according to the difference that is subject to the source of visit data, as above in example, the visit data that is subject to of website is the every daily visit by search engine redirect, and a this part that is subject to the visit data access total amount that in fact just this website can bear, therefore, preset crawl pressure coefficient can be that a relatively high value is set.If can obtain more accurate, close to the visit data that is subject to of the actual access total amount that can bear of website, can be that a relatively low value is set by preset crawl pressure coefficient.

Under another kind of implementation, can be subject to visit data according to targeted website, determine the born access total amount of targeted website; Then according to the born access total amount of targeted website and preset crawl pressure coefficient, determine that flow is born in the crawl of targeted website.Want to obtain the born access total amount of the targeted website that relatively approaches actual conditions, a reasonable method is with reference to many-sided source, to obtain targeted website to be subject to visit data as far as possible, as can be, according to the statistics of browser, obtained the visit capacity of the direct access websites of user; By the statistics of search engine, obtain user by the visit capacity of the redirect access websites of search engine search results simultaneously; The market share of search engine; And the redundancy flow of website etc. is determined the born access total amount of targeted website jointly.Wherein the redundancy flow of website refers to the redundant access holding capacity of website, can, according to the acquisitions such as website visiting peak value of long term monitoring, also can obtain based on experience value.For instance, every daily visit of passing through search engine redirect of certain website is 1,500,000 times, the market share of this search engine is 15%, in addition, it is the direct visit capacity of user that this website also has the flow of half, be user directly flow and the flow that this website is accessed in search engine redirect of access is suitable, and this website also has 50% redundancy flow, can determine that the born access total amount of this website unit interval (every day) is:

150 ten thousand times/day of ÷ 15% ÷ 50% * 150%=3000

Can bear access total amount the every day of this website is 3,000 ten thousand times/day.If preset crawl pressure coefficient is 5%, the crawl that can determine this website is born flow and is:

Ten thousand times/day of 3000 * 5%=150

In this example, owing to obtaining targeted website with reference to many-sided source, be subject to visit data, the born access total amount of the targeted website obtaining, more close to the actual total flow that can bear in website, preset crawl pressure coefficient is set to a relatively low value, with respect to 30% in a upper example, in this example, capture pressure coefficient and be set to 5%.

As in table 1, C represents the targeted website click volume of one day, the clicked number of times that can be all pages in website in same day Search Results, the crawl of targeted website is born flow and is appreciated that a function about parameters C, i.e. and flow is born in the crawl of targeted website can be designated as f (C).

S230: the web page quality that obtains webpage in described targeted website distributes;

By step S210 and S220, to obtain the crawl of targeted website and born flow, the crawl of this targeted website is born flow and be can be understood as and obtain according to the visit data of website, and the predicted value that reptile program captures can be born in website.On the other hand, also need the task situation of knowing that reptile program captures website, capture the task flow of targeted website.Obtain the task flow that reptile program captures targeted website, the embodiment of the present invention Main Basis web page quality distribution of webpage targeted website in, as the parameter q i in table 1.Here, in targeted website, the web page quality distribution qi of webpage, can be understood as the scoring situation to the web page quality of webpage in targeted website.Web page quality distributes can be by the pagerank of webpage, and/or the link degree of depth of webpage, determine, and as can be according to the pagerank of webpage in targeted website, and/or the link degree of depth of webpage, determine the scoring of webpage; Then the scoring of a plurality of webpages in targeted website is normalized, obtains the mass distribution that each webpage is corresponding.Normalized, can make the quality score of webpage in targeted website normalize to (0,1] in this interval.For example, in targeted website, there is following webpage, and pagerank (also referred to as the PR value of webpage, getting 1 to 10 positive integer) corresponding to webpage, the link degree of depth (depth gets positive integer according to the link degree of depth of webpage), as shown in table 2:

Table 2

Webpage	PR value	PR÷10×0.7	depth	1/depth×0.3	Mass distribution
						Webpage 1	10	0.7	1	0.3	1
Webpage 2	7	0.49	3	0.1	0.59
						Webpage 3	7	0.49	3	0.1	0.59
Webpage 4	6	0.42	5	0.06	0.48
						Webpage 5	8	0.56	2	0.15	0.71

Owing to having used pagerank and the web page interlinkage degree of depth simultaneously, determine the mass distribution of webpage, in table 2, can be when calculating, for the pagerank of webpage and the link degree of depth arrange different weights, due to the value of PR, be 1 to 10 positive integer like this, the positive integer of the link degree of depth for getting according to the link degree of depth of webpage, by normalization with give weight, obtained webpage 1 to webpage 5, the mass distribution situation of each webpage.Certainly, in actual applications, also can obtain in other mode the mass distribution of webpage, as the PR value of independent use webpage, or the link degree of depth of webpage is obtained the mass distribution of webpage, can also be in browser preset grading module, by user, by grading module, each webpage is given a mark when the browsing page, and then search engine collects the marking of user to each webpage, the mass distribution that obtains webpage after normalized is added up and done to air exercise minute.Can certainly be by the mode of this user's marking, be attached in the method for the above-mentioned mass distribution of determining webpage with pagerank and the web page interlinkage degree of depth, realize the method that another kind obtains the mass distribution of webpage, its implementation procedure and above-mentioned example class seemingly, do not repeat them here.

In addition, according to the difference of the task of crawl, when the web page quality that obtains targeted website webpage distributes, can be that the web page quality that obtains all webpages in targeted website distributes, also can obtain part in targeted website needs the web page quality of the target web of crawl to distribute, and has detailed introduction in subsequent step.

S240: distribute according to the described web page quality of webpage in described targeted website, determine the task flow that captures targeted website;

Next can distribute according to the web page quality of webpage in targeted website, determine the task flow that captures targeted website.Concrete, the summation that in the targeted website that can first obtain, the web page quality of webpage distributes, the summation distributing according to the web page quality of webpage in targeted website, determines the task flow that captures targeted website.

According to the difference of reptile program task strategy or crawl target, when the web page quality that obtains targeted website webpage distributes, can be for different scopes.

The crawl demand of reptile program can come from two aspects: be to capture the webpage upgrading in history on the one hand, be that search engine had captured the webpage in website, there is again renewal in a part wherein, and search engine need to capture again to the webpage of this part renewal.As m in table 1 represents the webpage number of current website of having included, if the accounting of the webpage wherein upgrading is a, the quantity that captures the webpage upgrading in history is (a * m).Be the newfound webpage not yet capturing on the other hand, its quantity is the parameter n in table 1.The crawl demand of comprehensive these two aspects, the quantity that the crawl task of reptile program need to capture webpage can be:

(a×m)+n

When crawl task for be this two kinds of webpages of website time, can obtain the web page quality distribution qi of all webpages in targeted website, and then the summation of the web page quality distribution qi of all webpages in the targeted website obtaining, that is:

Σ_{i = 1}^{m} qi

The summation distributing according to the web page quality of all webpages in targeted website, determines the task flow that captures targeted website.The summation that the web page quality of webpage distributes, can, directly as the task flow that captures targeted website, in addition, can also obtain one or more task scale factors; The summation distributing according to web page quality and the product of one or more task scale factors, determine the task flow that captures targeted website.Wherein, task scale factor can, according to the different in kind of self, play different effects in the process of determining the task flow that captures targeted website.

The one or more task scale factors that obtain can be following task scale factors:

Obtain in targeted website, webpage number to be captured accounts for the ratio of webpage sum in targeted website;

And/or,

Obtain in targeted website, unduplicated webpage quantity accounts for the ratio of webpage sum in targeted website;

Wherein webpage number to be captured accounts in targeted website this task scale factor of the ratio of webpage sum, can use:

\frac{(a \times m) + n}{m}

Represent, in the process webpage of targeted website being captured in reptile program, can only to capturing the webpage and the newfound webpage not yet capturing that upgrade in history, capture, these two kinds of webpages can be used as task scale factor with the ratio of webpage sum, multiply each other with the summation of the web page quality distribution qi of all webpages in targeted website,

\frac{(a \times m) + n}{m} \times Σ_{i = 1}^{m} qi

Both multiply each other resulting result as the task flow that captures targeted website, have reacted more accurately the flow of the required by task of this crawl targeted website.In actual crawl task, capture the webpage upgrading in history, not necessarily exist with the new webpage producing in targeted website simultaneously, while therefore obtaining this task scale factor, can be according to actual conditions, obtain and in targeted website, capture the new webpage number producing in the webpage number that upgrades in history and/or targeted website, account for the ratio of webpage sum in described targeted website.

In addition, can also obtain in targeted website, unduplicated webpage quantity accounts for the ratio of webpage sum in targeted website, i.e. parameters u in table 1, as another task scale factor.Reptile program to the crawl process of targeted website in, usually can identify the page repeating, the page repeating is only captured once.Therefore can pass through this task scale factor, further duplicate pages in the task flow of crawl targeted website be carried out to further filtering, the flow of required by task that makes to capture targeted website is more accurate.Now can basis:

\frac{(a \times m) + n}{m} \times u \times Σ_{i = 1}^{m} qi

Add unduplicated webpage quantity to account for this task scale factor of ratio of webpage sum in described targeted website, and determine the task flow that captures targeted website.

Concrete, can utilize info web fingerprint identification technology during the ratio (as the parameters u in table 1) of webpage sum in accounting for targeted website obtaining unduplicated webpage quantity, in the crawl history to targeted website, obtain and compare the information fingerprint of captured webpage; According to the result of comparison, obtain unduplicated information fingerprint number, account for the ratio of total fingerprint number, as described unduplicated webpage quantity, account for the ratio of webpage sum in described targeted website.

In addition, the task that reptile program captures targeted website usually needs a period of time to complete, and the flow of required by task captures the flow that targeted website is distributed in the unit interval often, therefore can also introduce according to webpage and capture definite unit interval coefficient of required by task time, if the time that reptile program captures the task of targeted website to be needed is T(, need 10 days as the task that reptile program captures targeted website), can basis:

\frac{1}{T} \times \frac{(a \times m) + n}{m} \times u \times Σ_{i = 1}^{m} qi

Determine the task flow that captures (as every day) targeted website in each unit interval.Because the crawl of targeted website is born flow and also can be described with the flow in the unit interval, therefore, the task flow that captures targeted website can adopt identical unit, so that relatively, and as all adopted (ten thousand times/day) as description unit.

S250: according to the crawl of described targeted website, bear flow, and the task flow of described crawl targeted website, determine the flow quota of carrying out webpage crawl on described targeted website;

It should be noted that, determine that flow is born in the crawl of described targeted website, and determine the task flow that captures targeted website, both execution sequences can be arbitrarily, can first perform step S210 and S220, then perform step S230 and S240; Also S230 and S240 be can first perform step, then S210 and S220 performed step.No matter which kind of order, can obtain the crawl of targeted website and bear flow, and the task flow that captures targeted website.

In process webpage being captured in reptile program, for fear of to the unconfined crawl in same website, and cause affecting the generation of the situations such as normal access of website, conventionally need to be to reptile program the crawl flow on targeted website or frequency carry out certain restriction, flow quota is wherein a kind of.On targeted website, carry out the flow quota of webpage crawl, can be understood as when reptile program is carried out crawl task, to the flow capturing of same website or the limit of frequency, for example, will be 3,000,000/day to target flow quota restrictions within the unit interval.In the method providing in the embodiment of the present invention one, can bear flow according to the crawl of targeted website, and the task flow that captures targeted website, determine carrying out the flow quota of webpage crawl on targeted website.

In the crawl that gets targeted website, bear flow, and capture after the task flow of targeted website, can determine the flow quota of carrying out webpage crawl on targeted website according to the two.Concrete can compare both, using less one as the flow quota of carrying out webpage crawl on targeted website.For example the crawl of targeted website is born to flow and represents with f (C), and the task flow that captures targeted website with:

\frac{1}{T} \times \frac{(a \times m) + n}{m} \times u \times Σ_{i = 1}^{m} qi

During expression, can be by:

Min (f (C), \frac{1}{T} \times \frac{(a \times m) + n}{m} \times u \times Σ_{i = 1}^{m} qi)

As the flow quota of carrying out webpage crawl on targeted website.Wherein Min representative more than two is comparing in parameter, gets wherein minimum parameter as operation result.

In addition, because the born flow of website usually has certain elastic space, in the crawl of targeted website, bear flow, while being more or less the same with task flow, can using task flow as the flow quota of carrying out webpage crawl on targeted website.When task flow is greater than to capture, bear flow, and both differences are while being less than preset threshold value, the flow quota that task flow can be defined as carry out webpage crawl on targeted website.Flow is born in the crawl that is greater than website when task flow, and when both differences are greater than preset threshold value, can be by adjustment task scale factor, and/or unit interval coefficient, adjustment task flow, until being less than or equal to capture, task flow bears flow, or both differences are less than preset threshold value, adjustment task scale factor, in fact to simplify crawl demand, or the process ，Er unit of adjustment time coefficient that the data that capture are carried out to stricter screening is in fact to adjust the time that reptile program is carried out the task of capturing targeted website.

The method that the definite the website above embodiment of the present invention two being provided captures flow quota is described in detail, can be according to the visit data that is subject to of targeted website to be captured by the method, determine that the crawl that the reptile program that can bear targeted website captures it bears flow; And can distribute according to the web page quality of webpage in targeted website, determine the task flow that captures targeted website task; And then bear flow according to the crawl of targeted website, and the task flow that captures targeted website, determine the flow quota of carrying out webpage crawl on targeted website.Solved thus the problem that reptile program unconfined crawl causes too much taking site resource.Realized in the situation that the crawl pressure permission to website effectively captures the web data of website, to reduce the reptile program of search engine and conflicting of crawled website.The crawl behavior of reptile program and search engine are upgraded demand and obtain rational balance.

Embodiment tri-

Refer to Fig. 3, the process flow diagram of the method for the definite crawl flow providing for the embodiment of the present invention three, as shown in the figure, the method for definite crawl flow that the embodiment of the present invention provides can comprise the following steps.

S310: obtain task scale factor according to targeted website attributive character;

S320: the web page quality distribution summation based in described task scale factor and targeted website, determine the task flow that captures targeted website.

Wherein, the task scale factor obtaining can be to obtain in targeted website, and webpage number to be captured accounts in targeted website the ratio of webpage sum; And/or, to obtain in targeted website, unduplicated webpage quantity accounts for the ratio of webpage sum in targeted website.Obtaining webpage number to be captured in targeted website accounts in targeted website the ratio of webpage sum and can be, obtain in targeted website, capture the webpage number upgrading in history, and/or the new webpage number producing in targeted website, account in targeted website the ratio of webpage sum.Obtain in targeted website, the ratio that unduplicated webpage quantity accounts for webpage sum in targeted website can be, in the crawl history to targeted website, to obtain and compare the information fingerprint of captured webpage; According to the result of comparison, obtain unduplicated information fingerprint number, account for the ratio of total fingerprint number, as unduplicated webpage quantity, account for the ratio of webpage sum in targeted website.

Then, the product of web page quality distribution summation that can be based in one or more task scale factors and targeted website, determines the task flow that captures targeted website.Web page quality distribution summation can be determined in the following way: according to the pagerank of webpage in targeted website, and/or the link degree of depth of webpage, determine the scoring of webpage; Scoring to a plurality of webpages in targeted website is normalized, and obtains the mass distribution that each webpage is corresponding; According to mass distribution corresponding to each webpage obtaining, determine web page quality distribution summation.

In addition, can also task T.T. determine unit interval coefficient according to what capture targeted website; In the web page quality distribution summation based in described task scale factor and targeted website, while determining the task flow that captures targeted website, summation and one or more task scale factor that can distribute according to web page quality, and the product of unit interval coefficient, determine the task flow that captures targeted website.

After getting the task flow that captures targeted website, can also, according to the task flow that captures targeted website, to targeted website, carry out webpage crawl.When determined task flow is excessive, can be by adjustment task scale factor, and/or unit interval coefficient, adjust the task flow that captures targeted website, the task flow that captures targeted website is adjusted in the scope that can bear targeted website.Adjustment task scale factor, is in fact to simplify crawl demand, or the process ，Er unit of adjustment time coefficient that the data that capture are carried out to stricter screening, is in fact to adjust the time that reptile program is carried out the task of capturing targeted website.

The method of the definite crawl the flow above embodiment of the present invention three being provided is introduced, the realization more specifically of the method and applicating example can with embodiment bis-cross-references.By the method, can obtain task scale factor according to targeted website attributive character; Web page quality distribution summation based in task scale factor and targeted website, determines the task flow that captures targeted website.Thereby when reptile program is captured website, required crawl flow has been carried out determining accurately, the web data of website has effectively been captured, to reduce the reptile program of search engine and conflicting of crawled website.

Embodiment tetra-

Aforementioned each embodiment be take the introduction that website carries out concrete implementation as starting point, but in a lot of websites, usually there is a plurality of substation points or sub-channel simultaneously, now, can regard the substation point of website or sub-channel as an independently website, apply the similar said method that the embodiment of the present invention provides simultaneously, can carry out channel to the substation point existing in website or sub-channel and bear obtaining of flow, according to the sub-channel of being accessed each sub-channel of data acquisition of each sub-channel, bear flow, according to the web page quality of webpage in each sub-channel, distribute, determine the sub-channel task flow of each sub-channel, difference is now can bear flow according to sub-channel, and sub-channel task flow to be determined the crawl weight of each sub-channel, in conjunction with the flow quota of whole targeted website, and the crawl weight of each sub-channel, determine the channel quota of each sub-channel.The last described channel quota corresponding according to each sub-channel, captures the webpage in each sub-channel.Below this is described in detail.

Refer to Fig. 4, the sub-channel in definite website providing for the embodiment of the present invention four captures the process flow diagram of the method for flow quota, and as shown in the figure, the method that the sub-channel in definite website that the embodiment of the present invention provides captures flow quota can comprise the following steps:

S410: obtain each sub-channel of targeted website and bear flow;

During specific implementation, for the sub-channel of targeted website, owing to can be used as independently website, treat, in same targeted website, user's visit capacity of each sub-channel etc. can be come out respectively by visit data, therefore, the specific implementation of flow is born in the crawl of obtaining sub-channel, can be identical with the implementation that flow is born in step S110 and the crawl of obtaining targeted website described in S120 of embodiment mono-.Also can be according to the visit data that is subject to of each sub-channel of targeted website, obtain each sub-channel of targeted website and bear flow, concrete, can, according to the visit data that is subject to each sub-channel of targeted website of search engine statistics, obtain each sub-channel of targeted website and bear flow.Specifically when really stator channel bears flow, can be according to the visit data that is subject to of each sub-channel of targeted website, the sub-channel of determining each sub-channel of targeted website bears access total amount, then according to the sub-channel of sub-channel, bear access total amount and preset channel pressure coefficient, determine that each sub-channel of targeted website bears flow.Concrete implementation procedure can, referring to embodiment mono-or embodiment bis-, repeat no more here.

S420: distribute according to the web page quality of webpage in each sub-channel, determine each sub-channel task flow;

Sub-channel task flow when antithetical phrase channel captures, is actually a kind of historical according to previously capturing, and the predicted value of the flow of the sub-channel task of crawl obtained of web page quality.Before true stator channel task flow, the web page quality that can first obtain equally webpage in each sub-channel distributes, concrete obtain manner, also can with the same way as that web page quality distributes of obtaining described in step S230 in embodiment mono-.And the implementation of definite each sub-channel task flow, can be identical with the mode of the task flow of definite targeted website in step S140.For example, can be according to the pagerank of webpage in each sub-channel, and/or the link degree of depth of webpage, determine the scoring of webpage in each sub-channel, scoring to a plurality of webpages in each sub-channel is normalized, obtain the mass distribution that each webpage is corresponding, according to the web page quality of webpage in each the sub-channel obtaining, distribute, determine each sub-channel task flow.Specific implementation still can, referring to the introduction in embodiment mono-, repeat no more here.

S430: according to described sub-channel, bear flow, and crawl weight corresponding to described sub-each sub-channel of channel flow rate calculation;

Getting after sub-channel bears flow and sub-channel task flow, in the embodiment of the present invention four, can also calculate the crawl weight that each sub-channel is corresponding.That is to say, for each sub-channel, can from can bear the sub-channel task flow of flow and prediction, select respectively wherein smaller as with reference to value, each sub-channel is distinguished a corresponding reference value like this, then be not the direct flow quota using this reference value as each sub-channel, but first according to these reference values, calculate the weight of each sub-channel, for example, the reference value of each sub-channel can be added, the weight of every sub-channel just equals reference value shared ratio in this adds and is worth of this sub-channel self.For example, sub-channel 1,2,3, wherein, the reference value of sub-channel 1 is n1, the reference value of sub-channel 2 is n2, and the reference value of sub-channel 3 is n3, and the weight of sub-channel 1 is n1/(n1+n2+n3), the weight of sub-channel 2 is n2/(n1+n2+n3), the weight of sub-channel 3 is n3/(n1+n2+n3).

S440: according to targeted website total flow quota, and each sub-channel crawl weight, determine each sub-channel quota.

After having calculated the weight of each sub-channel, just can by weight separately, be multiplied by again the total flow quota of affiliated web site, can obtain the quota of sub-channel.Wherein, about the total flow quota of targeted website, can, referring to the record in embodiment mono-, repeat no more here.

During specific implementation, can also task T.T. determine unit interval coefficient according to what capture each sub-channel, then by targeted website total flow quota and each sub-channel weight accounting, and the product of unit interval coefficient is as the sub-channel quota that the sub-channel of correspondence is captured.Finally, just can to the webpage in each sub-channel, capture according to each sub-channel quota.

The sub-channel in definite website providing by the embodiment of the present invention four captures the method for flow quota, can bear flow according to the sub-channel getting, and crawl weight corresponding to sub-each sub-channel of channel task flow rate calculation; According to targeted website total flow quota, and each sub-channel captures weight, determine each sub-channel quota, the conflicting of the reptile program that reduces search engine and crawled website, can, by more reasonably capturing assignment of traffic to each sub-channel, realize each sub-channel of targeted website has more reasonably been browsed to distribution.

The method that the webpage providing with the embodiment of the present invention one captures is corresponding, and the equipment that the embodiment of the present invention one also provides webpage to capture, refers to Fig. 5, and this equipment can comprise:

Dynamic flow quota value acquiring unit 510, is suitable for obtaining the dynamic flow quota value of carrying out webpage crawl on targeted website;

Webpage placement unit 520, is suitable for, according to dynamic flow quota value, the webpage on targeted website being captured.

Wherein dynamic flow quota value acquiring unit 510 can comprise:

Website visitation data acquiring unit, is suitable for obtaining the visit data that is subject to of targeted website;

Website holding capacity determining unit, is suitable for, according to being subject to visit data, determining that flow is born in the crawl of targeted website;

Web page quality distributed acquisition unit, the web page quality that is suitable for obtaining webpage in targeted website distributes; And task flow acquiring unit, is suitable for distributing according to the described web page quality of webpage in described targeted website, determines the task flow that captures targeted website;

Under this implementation, dynamic flow quota value acquiring unit 510 can bear flow according to the crawl of targeted website, and the task flow that captures targeted website, determines the dynamic flow quota value of carrying out webpage crawl on targeted website.

Concrete, website visitation data acquiring unit can be according to search engine the access statistic data to targeted website, determine targeted website described in be subject to visit data.

Website holding capacity determining unit wherein can comprise:

Visit capacity is determined subelement, is suitable for, according to being subject to visit data, determining the born access total amount of targeted website;

Website holding capacity determining unit, can, according to bearing total amount and preset crawl pressure coefficient, determine that flow is born in the crawl of described targeted website.

Under this implementation, visit capacity determine subelement can be according to search engine the access statistic data to targeted website, the market share of search engine, the direct visit capacity of user, and website redundancy flow, determine the born access total amount of targeted website.

Concrete, web page quality distributed acquisition unit can be according to the pagerank of webpage in targeted website, and/or the link degree of depth of webpage, determines the scoring of webpage; Scoring to a plurality of webpages in targeted website is normalized, and obtains the mass distribution that each webpage is corresponding.

Web page quality distributed acquisition unit can comprise:

Web page quality distributed acquisition subelement, the web page quality that obtains all webpages in targeted website distributes;

Under this implementation, task flow acquiring unit can comprise:

Task flow obtains subelement, the summation that in the targeted website that is suitable for obtaining, the web page quality of all webpages distributes, and the summation distributing according to the web page quality of all webpages in targeted website, determines the task flow that captures targeted website.

Under this implementation, this equipment can also comprise:

Described task flow obtains subelement, is suitable for:

The summation distributing according to web page quality and the product of one or more task scale factors, determine the task flow that captures targeted website.

Wherein, task scale factor acquiring unit can obtain in described targeted website, and webpage number to be captured accounts in described targeted website the ratio of webpage sum;

And/or,

During specific implementation, task scale factor acquiring unit can obtain in described targeted website, captures the webpage number upgrading in history, and/or the new webpage number producing in described targeted website, accounts for the ratio of webpage sum in described targeted website.

During specific implementation, task scale factor acquiring unit can, in the crawl history to targeted website, obtain and compare the information fingerprint of captured webpage;

Under another kind of implementation, this equipment can also comprise:

Now, task flow obtains summation and the one or more described task scale factor that subelement can distribute according to web page quality, and the product of unit interval coefficient, determines the task flow that captures targeted website.

In addition this equipment can also comprise task flow adjustment unit, be suitable for bearing flow when task flow is greater than described crawl, and when both differences are greater than preset threshold value, by adjusting described task scale factor, and/or unit interval coefficient, adjustment task flow, bear flow, or both differences is less than preset threshold value until task flow is less than or equal to described crawl.

Dynamic flow quota value acquiring unit 510 can be greater than to capture at task flow and bears flow, and both differences are while being less than preset threshold value, the dynamic flow quota value that task flow is defined as carrying out to webpage crawl on described targeted website.

In addition, dynamic flow quota value acquiring unit 510 can comprise:

Web page quality distributed acquisition unit, the web page quality that is suitable for obtaining webpage in targeted website distributes;

Task flow acquiring unit, is suitable for distributing according to the web page quality of webpage in targeted website, determines the task flow that captures targeted website;

Dynamic flow quota value acquiring unit 510 can, according to the task flow that captures targeted website, be determined the dynamic flow quota value of carrying out webpage crawl on targeted website.

The method of the definite website crawl flow quota providing with the embodiment of the present invention two is corresponding, and the embodiment of the present invention two also provides definite website to capture the equipment of flow quota, refers to Fig. 6, and this equipment can comprise:

Website visitation data acquiring unit 610, obtains the visit data that is subject to of targeted website to be captured;

Website holding capacity determining unit 620, according to being subject to visit data, determines that flow is born in the crawl of targeted website;

Web page quality distributed acquisition unit 630, the web page quality that obtains webpage in targeted website distributes;

Task flow acquiring unit 640, distributes according to the described web page quality of webpage in targeted website, determines the task flow that captures targeted website;

Flow quota determining unit 650, bears flow according to the crawl of targeted website, and the task flow that captures targeted website, determines the flow quota of carrying out webpage crawl on targeted website.

Under another kind of implementation, website visitation data acquiring unit 610, is suitable for:

Access statistic data according to search engine to targeted website, determines the visit data that is subject to of targeted website.

In addition, website holding capacity determining unit 620 can also comprise:

Under this implementation, website holding capacity determining unit 220, can, according to bearing access total amount and preset crawl pressure coefficient, determine that flow is born in the crawl of targeted website.

Visit capacity determines that subelement can also be for the access statistic data to targeted website according to search engine, the market share of search engine, and the direct visit capacity of user, and website redundancy flow, determine the born access total amount of targeted website.

In actual applications, web page quality distributed acquisition unit 630 can be according to the pagerank of webpage in targeted website, and/or the link degree of depth of webpage, determines the scoring of webpage;

And the scoring of a plurality of webpages in targeted website is normalized, obtain the mass distribution that each webpage is corresponding.

Under another kind of implementation, web page quality distributed acquisition unit 630 can also comprise:

Web page quality distributed acquisition subelement, the web page quality that can obtain all webpages in targeted website distributes;

Now, task flow acquiring unit 640 can comprise:

Task flow obtains subelement, the summation that in the described targeted website that is suitable for obtaining, the web page quality of all webpages distributes, and the summation distributing according to the web page quality of all webpages in targeted website, determines the task flow that captures targeted website.

Under this implementation, this equipment can also comprise: task scale factor acquiring unit, is suitable for obtaining one or more task scale factors;

Now, task flow obtains subelement, and the summation that can distribute according to web page quality and the product of one or more task scale factors, determine the task flow that captures targeted website.

Wherein, task scale factor acquiring unit can obtain different task scale factors, as:

Obtain in targeted website, webpage number to be captured accounts in targeted website the ratio of webpage sum;

And/or,

Obtain in targeted website, unduplicated webpage quantity accounts for the ratio of webpage sum in targeted website.

Under this implementation, task scale factor acquiring unit, can obtain in targeted website, captures the webpage number upgrading in history, and/or the new webpage number producing in targeted website, accounts for the ratio of webpage sum in targeted website.

Task scale factor acquiring unit can also, in the crawl history to targeted website, obtain and compare the information fingerprint of captured webpage; According to the result of comparison, obtain unduplicated information fingerprint number, account for the ratio of total fingerprint number, as described unduplicated webpage quantity, account for the ratio of webpage sum in described targeted website.

Under another kind of implementation, this equipment can also comprise unit interval coefficient acquiring unit, according to what capture targeted website, task T.T. determines unit interval coefficient;

Now, task flow obtains summation and the one or more task scale factor that subelement 640 can distribute according to web page quality, and the product of unit interval coefficient, determines the task flow that captures targeted website.

In addition, this equipment can also comprise task flow adjustment unit, at task flow, be greater than to capture and bear flow, and when both differences are greater than preset threshold value, by adjusting task scale factor, and/or unit interval coefficient, task flow adjusted, until task flow is less than or equal to described crawl, bear flow, or both differences are less than preset threshold value.

Flow quota determining unit 650 can be greater than to capture at task flow bears flow, and both differences are while being less than preset threshold value, the flow quota that task flow is defined as carrying out to webpage crawl on targeted website.

The equipment that the definite the website above embodiment of the present invention being provided captures flow quota is described in detail, and this equipment can obtain by website visitation data acquiring unit 610 visit data that is subject to of targeted website to be captured; Website holding capacity determining unit 620, according to being subject to visit data, determines that flow is born in the crawl of targeted website; The web page quality that web page quality distributed acquisition unit 630 obtains webpage in targeted website distributes; Task flow acquiring unit 640, distributes according to the web page quality of webpage in targeted website, determines the task flow that captures targeted website; Flow quota determining unit 650, bears flow according to the crawl of targeted website, and the task flow that captures targeted website, determines the flow quota of carrying out webpage crawl on targeted website.By this equipment, can capture pre-task, the holding capacity to website, and the flow that captures required by task makes prediction accurately, has solved the problem that the unconfined crawl of reptile program causes too much taking site resource thus.Realized in the situation that the crawl pressure permission to website effectively captures the web data of website, to reduce the reptile program of search engine and conflicting of crawled website.

The method of the definite crawl flow providing with the embodiment of the present invention three is corresponding, and the embodiment of the present invention three also provides the equipment of definite crawl flow, refers to Fig. 7, and this equipment can comprise:

Task scale factor acquiring unit 710, obtains task scale factor according to targeted website attributive character;

Task flow acquiring unit 720, is suitable for the web page quality distribution summation based in task scale factor and targeted website, determines the task flow that captures targeted website.

Wherein task scale factor acquiring unit 710 can obtain in targeted website, and webpage number to be captured accounts in described targeted website the ratio of webpage sum;

And/or,

Obtain in described targeted website, unduplicated webpage quantity accounts for the ratio of webpage sum in described targeted website, as task scale factor.

Under this implementation, task scale factor acquiring unit 710 can obtain in targeted website, captures the webpage number upgrading in history, and/or the new webpage number producing in described targeted website, accounts in targeted website the ratio of webpage sum.

Or task scale factor acquiring unit can, in the crawl history to targeted website, obtain and compare the information fingerprint of captured webpage;

According to the result of comparison, obtain unduplicated information fingerprint number, account for the ratio of total fingerprint number, as described unduplicated webpage quantity, account for the ratio of webpage sum in targeted website.

Concrete, the product of the web page quality distribution summation that task flow acquiring unit 720 can be based in one or more task scale factors and targeted website, determines the task flow of crawl targeted website.

Web page quality distribution summation can be passed through to determine as lower unit:

Scoring determining unit, according to the pagerank of webpage in targeted website, and/or the link degree of depth of webpage, determine the scoring of webpage;

Normalized unit, is suitable for the scoring of a plurality of webpages in targeted website to be normalized, and obtains the mass distribution that each webpage is corresponding; And sum unit, according to mass distribution corresponding to each webpage obtaining, determines web page quality distribution summation.

Under another kind of implementation, this determines that the equipment that captures flow can also comprise:

Now, summation and one or more task scale factor that task flow acquiring unit 720 can distribute according to web page quality, and the product of unit interval coefficient, determine the task flow that captures targeted website.

This determines the equipment that captures flow, can also comprise:

Webpage placement unit, the task flow according to capturing targeted website, carries out webpage crawl to targeted website.

By the equipment of above-mentioned definite crawl flow, can obtain task scale factor according to targeted website attributive character; Web page quality distribution summation based in task scale factor and targeted website, determines the task flow that captures targeted website.Thereby when reptile program captures website, required crawl flow has been carried out determining accurately, the web data of website has effectively been captured, reduced the reptile program of search engine and conflicting of crawled website.

The method of the sub-channel crawl in the definite website flow quota providing with the embodiment of the present invention four is corresponding, and the embodiment of the present invention four also provides the sub-channel in definite website to capture the equipment of flow quota, refers to Fig. 8, and this equipment can comprise:

Channel holding capacity acquiring unit 810, obtains each sub-channel of targeted website and bears flow;

Channel task amount acquiring unit 820, distributes according to the web page quality of webpage in each sub-channel, determines each sub-channel task flow;

Capture Weight Acquisition unit 830, according to sub-channel, bear flow, and crawl weight corresponding to sub-each sub-channel of channel task flow rate calculation;

Quota determining unit 840, according to targeted website total flow quota, and each sub-channel crawl weight, determine each sub-channel quota.

Concrete, channel holding capacity acquiring unit 810 can, according to the visit data that is subject to of each sub-channel of targeted website, obtain each sub-channel of targeted website and bear flow.

Under this implementation, channel holding capacity acquiring unit 810 can, according to the visit data that is subject to each sub-channel of targeted website of search engine statistics, obtain each sub-channel of targeted website and bear flow.

Under this implementation, concrete, channel holding capacity acquiring unit can, according to the visit data that is subject to of each sub-channel of targeted website, determine that the channel of each sub-channel of targeted website bears access total amount; Then, according to channel, bear access total amount and preset channel pressure coefficient, determine that each sub-channel of targeted website bears flow.

Under another kind of implementation, channel task amount acquiring unit 820 can be according to the pagerank of webpage in each sub-channel, and/or the link degree of depth of webpage, determines the scoring of webpage in each sub-channel; And then, the described scoring of a plurality of webpages in each sub-channel is normalized, obtain the mass distribution that each webpage is corresponding; According to the web page quality of webpage in each the sub-channel obtaining, distribute again, determine each sub-channel task flow.

In addition, quota determining unit 840 can, according to the website visitation data of targeted website, determine that flow is born in the crawl of targeted website;

According to the web page quality of webpage in targeted website, distribute, determine the website task flow that captures targeted website;

According to the crawl of targeted website, bear flow, and the website task flow that captures targeted website, determine the targeted website total flow quota of carrying out webpage crawl on targeted website; And,

The described targeted website total flow quota of determining according to above-mentioned steps, and each sub-channel crawl weight, determine each sub-channel quota.

This equipment of determining the sub-channel crawl in website flow quota can also comprise:

Channel time coefficient determining unit, is suitable for task T.T. determining channel unit interval coefficient according to what capture each sub-channel;

Under this implementation, quota determining unit 840 can be by targeted website total flow quota and each sub-channel weight accounting, and the product of described channel unit interval coefficient is as the described sub-channel quota that the sub-channel of correspondence is captured.

In addition, this equipment of determining the sub-channel crawl in website flow quota can also comprise:

Channel webpage placement unit, can capture the webpage in each sub-channel according to each sub-channel quota.

The sub-channel in definite website providing by the embodiment of the present invention four captures the equipment of flow quota, can bear flow according to the sub-channel getting, and crawl weight corresponding to sub-each sub-channel of channel task flow rate calculation; According to targeted website total flow quota, and each sub-channel captures weight, determines each sub-channel quota, the conflicting of the reptile program that reduces search engine and crawled website, can more reasonably will capture assignment of traffic to each sub-channel.

The algorithm providing at this is intrinsic not relevant to any certain computer, virtual system or miscellaneous equipment with demonstration.Various general-purpose systems also can with based on using together with this teaching.According to description above, it is apparent constructing the desired structure of this type systematic.In addition, the present invention is not also for any certain programmed language.It should be understood that and can utilize various programming languages to realize content of the present invention described here, and the description of above language-specific being done is in order to disclose preferred forms of the present invention.

In the instructions that provided herein, a large amount of details have been described.Yet, can understand, embodiments of the invention can not put into practice in the situation that there is no these details.In some instances, be not shown specifically known method, structure and technology, so that not fuzzy understanding of this description.

Similarly, be to be understood that, in order to simplify the disclosure and to help to understand one or more in each inventive aspect, in the above in the description of exemplary embodiment of the present invention, each feature of the present invention is grouped together into single embodiment, figure or sometimes in its description.Yet, the method for the disclosure should be construed to the following intention of reflection: the present invention for required protection requires than the more feature of feature of clearly recording in each claim.Or rather, as reflected in claims below, inventive aspect is to be less than all features of disclosed single embodiment above.Therefore, claims of following embodiment are incorporated to this embodiment thus clearly, and wherein each claim itself is as independent embodiment of the present invention.

Those skilled in the art are appreciated that and can the module in the equipment in embodiment are adaptively changed and they are arranged in one or more equipment different from this embodiment.Module in embodiment or unit or assembly can be combined into a module or unit or assembly, and can put them into a plurality of submodules or subelement or sub-component in addition.At least some in such feature and/or process or unit are mutually repelling, and can adopt any combination to combine all processes or the unit of disclosed all features in this instructions (comprising claim, summary and the accompanying drawing followed) and disclosed any method like this or equipment.Unless clearly statement in addition, in this instructions (comprising claim, summary and the accompanying drawing followed) disclosed each feature can be by providing identical, be equal to or the alternative features of similar object replaces.

In addition, those skilled in the art can understand, although embodiment more described herein comprise some feature rather than further feature included in other embodiment, the combination of the feature of different embodiment means within scope of the present invention and forms different embodiment.For example, in the following claims, the one of any of embodiment required for protection can be used with array mode arbitrarily.

All parts embodiment of the present invention can realize with hardware, or realizes with the software module moved on one or more processor, or realizes with their combination.It will be understood by those of skill in the art that and can use in practice microprocessor or digital signal processor (DSP) to realize according to the some or all functions of the some or all parts in the equipment of definite website crawl flow quota of the embodiment of the present invention.The present invention for example can also be embodied as, for carrying out part or all equipment or device program (, computer program and computer program) of method as described herein.Realizing program of the present invention and can be stored on computer-readable medium like this, or can there is the form of one or more signal.Such signal can be downloaded and obtain from internet website, or provides on carrier signal, or provides with any other form.

It should be noted above-described embodiment the present invention will be described rather than limit the invention, and those skilled in the art can design alternative embodiment in the situation that do not depart from the scope of claims.In the claims, any reference symbol between bracket should be configured to limitations on claims.Word " comprises " not to be got rid of existence and is not listed as element or step in the claims.Being positioned at word " " before element or " one " does not get rid of and has a plurality of such elements.The present invention can be by means of including the hardware of some different elements and realizing by means of the computing machine of suitably programming.In having enumerated the unit claim of some devices, several in these devices can be to carry out imbody by same hardware branch.The use of word first, second and C grade does not represent any order.Can be title by these word explanations.

The application can be applied to computer system/server, and it can operation together with numerous other universal or special computingasystem environment or configuration.The example of well-known computing system, environment and/or the configuration that is suitable for using together with computer system/server includes but not limited to: personal computer system, server computer system, thin client, thick client computer, hand-held or laptop devices, the system based on microprocessor, Set Top Box, programmable consumer electronics, NetPC Network PC, little type Ji calculate machine Xi Tong ﹑ large computer system and comprise the distributed cloud computing technology environment of above-mentioned any system, etc.

Computer system/server can be described under the general linguistic context of the computer system executable instruction (such as program module) of being carried out by computer system.Conventionally, program module can comprise routine, program, target program, assembly, logic, data structure etc., and they are carried out specific task or realize specific abstract data type.Computer system/server can be implemented in distributed cloud computing environment, and in distributed cloud computing environment, task is to be carried out by the teleprocessing equipment linking by communication network.In distributed cloud computing environment, program module can be positioned on the Local or Remote computing system storage medium that comprises memory device.

Claims

1. definite website captures a method for flow quota, comprising:

Obtain the visit data that is subject to of targeted website to be captured;

2. the method for claim 1, described in obtain the visit data that is subject to of targeted website to be captured, comprising:

3. method as claimed in claim 1 or 2, is subject to visit data described in described basis, determines that flow is born in the crawl of described targeted website, comprising:

4. the method as described in claim 1-3 any one, is subject to visit data described in described basis, determines the born access total amount of described targeted website, comprising:

5. the method as described in claim 1-4 any one, described in obtain webpage in described targeted website web page quality distribute, comprising:

6. definite website captures an equipment for flow quota, comprising:

7. equipment as claimed in claim 6, described website visitation data acquiring unit, is suitable for:

8. the equipment as described in claim 6 or 7, described website holding capacity determining unit, comprising:

9. the equipment as described in claim 6-8 any one, described visit capacity is determined subelement, is suitable for:

10. the equipment as described in claim 6-9 any one, described web page quality distributed acquisition unit, is suitable for: