CN103544278B - Method and equipment for identifying website capturing flow quota - Google Patents
Method and equipment for identifying website capturing flow quota Download PDFInfo
- Publication number
- CN103544278B CN103544278B CN201310500682.8A CN201310500682A CN103544278B CN 103544278 B CN103544278 B CN 103544278B CN 201310500682 A CN201310500682 A CN 201310500682A CN 103544278 B CN103544278 B CN 103544278B
- Authority
- CN
- China
- Prior art keywords
- targeted website
- webpage
- flow
- website
- crawl
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Active
Links
- 238000000034 method Methods 0.000 title claims abstract description 72
- 239000004744 fabric Substances 0.000 claims description 9
- 241000219098 Parthenocissus Species 0.000 abstract 3
- 230000008569 process Effects 0.000 description 8
- 238000004364 calculation method Methods 0.000 description 6
- 230000008901 benefit Effects 0.000 description 5
- 238000005516 engineering process Methods 0.000 description 4
- 238000010586 diagram Methods 0.000 description 3
- 238000004590 computer program Methods 0.000 description 2
- 230000000694 effects Effects 0.000 description 2
- 230000006870 function Effects 0.000 description 2
- 238000012216 screening Methods 0.000 description 2
- 241000239290 Araneae Species 0.000 description 1
- 238000004458 analytical method Methods 0.000 description 1
- 238000013459 approach Methods 0.000 description 1
- 238000004422 calculation algorithm Methods 0.000 description 1
- 230000008859 change Effects 0.000 description 1
- 238000004891 communication Methods 0.000 description 1
- 230000003247 decreasing effect Effects 0.000 description 1
- 238000013461 design Methods 0.000 description 1
- 238000011161 development Methods 0.000 description 1
- 238000001035 drying Methods 0.000 description 1
- 210000004247 hand Anatomy 0.000 description 1
- 230000007774 longterm Effects 0.000 description 1
- 230000035800 maturation Effects 0.000 description 1
- 238000012544 monitoring process Methods 0.000 description 1
- 230000006855 networking Effects 0.000 description 1
- 238000010606 normalization Methods 0.000 description 1
- 230000004044 response Effects 0.000 description 1
- GOLXNESZZPUPJE-UHFFFAOYSA-N spiromesifen Chemical compound CC1=CC(C)=CC(C)=C1C(C(O1)=O)=C(OC(=O)CC(C)(C)C)C11CCCC1 GOLXNESZZPUPJE-UHFFFAOYSA-N 0.000 description 1
- 238000005303 weighing Methods 0.000 description 1
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/90—Details of database functions independent of the retrieved data types
- G06F16/95—Retrieval from the web
- G06F16/951—Indexing; Web crawling techniques
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/90—Details of database functions independent of the retrieved data types
- G06F16/95—Retrieval from the web
- G06F16/958—Organisation or management of web site content, e.g. publishing, maintaining pages or automatic linking
Landscapes
- Engineering & Computer Science (AREA)
- Databases & Information Systems (AREA)
- Theoretical Computer Science (AREA)
- Data Mining & Analysis (AREA)
- Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Information Transfer Between Computers (AREA)
Abstract
The invention discloses a method and equipment for ensuring website capturing flow quota. The method includes acquiring visited data of target websites to be captured; identifying capturing bearing quota of the target websites according to visited data; acquiring webpage mass distribution of webpages in the target websites; identifying task flow of the target websites to be captured according to the webpage mass distribution of the webpages in the target websites; identifying flow quota used in webpage capturing in the target websites according to the capturing bearing flow of the target websites and the task flow of the capturing target websites. With the method, flow quota for webpage capturing on the target websites can be distributed well when search engine creeper programs captures webpages of websites, conflict between the creeper programs and the websites to be captured is reduced, and capturing action of the creeper programs and updating requirements of searching engines are reasonably balanced.
Description
Technical field
The present invention relates to search engine technique field is and in particular to determine that website captures the method and apparatus of flow quota.
Background technology
Search engine is a kind of means of Internet information platform, can be by a large amount of info webs on the Internet by search engine
Collect, after processed, set up information database and index data base, user can be by providing in search engine
Entrance in input inquiry word, thus obtain search engine be directed to this query word return Search Results.With search engine skill
The continuous development of art and maturation, the service trade that it provides is more and more perfect, obtains institute in people from the Internet in large scale
When needing information, it is very conventional that search engine has become as one kind, also conveniently instrument.
Search engine, in order to download the webpage on the Internet, for analysis web data and foundation index, often needs
To be commonly known as " crawlers " or " spider " using a kind of implementing procedure of crawl webpage, this program.Due to mutual
New web page is always ceaselessly produced on networking, original webpage also updates continuous simultaneously, and therefore crawlers need not stop
Work, to ensure that search engine can obtain up-to-date web data.In order to provide more preferable Search Results, search engine
Crawlers always want to quickly include original webpage of new web page and renewal on the Internet.But web page resources are located at
In each site hosts on network, crawlers will certainly take the Service Source of site hosts to the crawl of web page resources,
As the software and hardware process resource of site hosts, bandwidth etc..If the task of crawl webpage has exceeded the tolerance range of site hosts,
Just influence whether the normal access of website user, then the webpage capture behavior of crawlers just becomes the unfriendly row to website
For, can lead to when serious affect websites response time-out, or even Website server collapse.And, it is the stability of guarding website, net
Stand and usually can monitor the access of crawlers, and restriction is taken to the crawlers producing unfriendly act, or even forbid accessing
Measure.Once crawlers are limited or are forbidden, the webpage capture efficiency meeting step-down of search engine, or even cannot update or download
This website and webpage resource, the finally offer to search service has a negative impact.
Meanwhile, usually set manually in prior art to set flow or the frequency that crawlers can capture to website
Rate, although this mode reduces the crawlers of search engine and conflicting of crawled website, updates to web data and does not have
Have and obtain maximum embodiment, hence in so that the demand that crawlers crawl behavior is updated with website data is not reasonably put down
Weighing apparatus.
Content of the invention
In view of the above problems it is proposed that the present invention so as to provide one kind overcome the problems referred to above or at least in part solve on
The determination website stating problem captures the equipment of flow quota and the corresponding method determining that website captures flow quota.
According to one aspect of the present invention, there is provided a kind of method that determination website captures flow quota, comprising:
Obtain targeted website to be captured by access data;
According to described by accessing data, determine that flow is born in the crawl of described targeted website;
Obtain the web page quality distribution of webpage in described targeted website;
According to the described web page quality distribution of webpage in described targeted website, determine the task flow of crawl targeted website;
Flow, and the task flow of described crawl targeted website are born according to the crawl of described targeted website, determines
The flow quota of webpage capture is carried out on described targeted website.
Alternatively, described obtain targeted website to be captured by access data, comprising:
According to the access statistic data to described targeted website for the search engine, determine that the described of described targeted website is accessed
Data.
Alternatively, it is subject to described in described basis to access data, determine that flow is born in the crawl of described targeted website, comprising:
According to described by accessing data, determine the born access total amount of described targeted website;
Bear access total amount and preset crawl pressure coefficient according to described, determine that the crawl of described targeted website is born
Flow.
Alternatively, it is subject to described in described basis to access data, determine the born access total amount of described targeted website, comprising:
According to the access statistic data to described targeted website for the search engine, the market share of described search engine, use
The direct visit capacity in family, and website redundant flow, determine the born access total amount of described targeted website.
Alternatively, the described web page quality distribution obtaining webpage in described targeted website, comprising:
According to the pagerank of webpage in described targeted website, and/or the link depth of webpage, determine the scoring of webpage;
Scoring to multiple webpages in described targeted website is normalized, and obtains the corresponding quality of each webpage and divides
Cloth.
Alternatively, the described web page quality distribution obtaining webpage in described targeted website, comprising:
Obtain the web page quality distribution of all webpages in described targeted website;
The described described web page quality distribution according to webpage in described targeted website, determines the task flow of crawl targeted website
Amount, comprising:
Obtain the summation of the web page quality distribution of all webpages in described targeted website, according to described targeted website
The summation of the web page quality distribution of interior all webpages, determines the task flow of crawl targeted website.
Alternatively, also include:
Obtain one or more task scale factors;
The summation of the described web page quality distribution according to webpages all in described targeted website, determines crawl target
The task flow of website, comprising:
Summation according to the distribution of described web page quality and the product of one or more described task scale factors, determine crawl
The task flow of targeted website.
Alternatively, the one or more task scale factors of described acquisition, comprising:
Obtain in described targeted website, webpage number to be captured accounts for the ratio of webpage sum in described targeted website
Example;
And/or,
Obtain in described targeted website, unduplicated webpage quantity accounts for the ratio of webpage sum in described targeted website.
Alternatively, in the described targeted website of described acquisition, webpage number to be captured accounts for webpage sum in described targeted website
Ratio, comprising:
Obtain in described targeted website, the webpage number updating in crawl history, and/or, newly produce in described targeted website
Webpage number, account for the ratio of the sum of webpage in described targeted website.
Alternatively, in the described targeted website of described acquisition, it is total that unduplicated webpage quantity accounts for webpage in described targeted website
The ratio of number, comprising:
In the crawl history to targeted website, obtain and compare the information fingerprint of captured webpage;
Unduplicated information fingerprint number is obtained according to the result comparing, accounts for the ratio of total fingerprint number, do not repeat as described
Webpage quantity account for the ratio of the sum of webpage in described targeted website.
Alternatively, also include:
According to crawl targeted website task total time determine unit interval coefficient;
The summation of the described web page quality distribution according to webpages all in described targeted website, determines crawl target
The task flow of website, comprising:
Summation according to the distribution of described web page quality and one or more described task scale factors, and during described unit
Between coefficient product, determine crawl targeted website task flow.
Alternatively, also include:
When described task flow bears flow more than described crawl, and when both differences are more than preset threshold value, by adjusting
Whole described task scale factor, and/or described unit interval coefficient, adjust described task flow, until described task flow is little
In or bear flow equal to described crawl, or both differences are less than preset threshold value.
Alternatively, flow, and the task of described crawl targeted website are born in the described crawl according to described targeted website
Flow, determines the flow quota carrying out webpage capture on described targeted website, comprising:
When described task flow bears flow more than described crawl, and when both differences are less than preset threshold value, will be described
Task flow is defined as carrying out the flow quota of webpage capture on described targeted website.
According to a further aspect in the invention, there is provided a kind of determination website captures the equipment of flow quota, comprising:
Website visitation data acquiring unit, be suitable to obtain targeted website to be captured by accessing data;
Website holding capacity determining unit, is suitable to according to described by accessing data, determines that the crawl of described targeted website is born
Flow;
Web page quality distributed acquisition unit, is suitable to obtain the web page quality distribution of webpage in described targeted website;
Task flow acquiring unit, is suitable to the described web page quality distribution according to webpage in described targeted website, and determination is grabbed
Take the task flow of targeted website;
Flow quota determining unit, is suitable to bear flow according to the crawl of described targeted website, and described crawl target
The task flow of website, determines the flow quota carrying out webpage capture on described targeted website.
Alternatively, described website visitation data acquiring unit, is suitable to:
According to the access statistic data to described targeted website for the search engine, determine that the described of described targeted website is accessed
Data.
Alternatively, described website holding capacity determining unit, comprising:
Visit capacity determination subelement, is suitable to according to described by accessing data, determines the born access of described targeted website
Total amount;
Described website holding capacity determining unit, is suitable to bear access total amount and preset crawl pressure system according to described
Number, determines that flow is born in the crawl of described targeted website.
Alternatively, described visit capacity determination subelement, is suitable to:
According to the access statistic data to described targeted website for the search engine, the market share of described search engine, use
The direct visit capacity in family, and website redundant flow, determine the born access total amount of described targeted website.
Alternatively, described web page quality distributed acquisition unit, is suitable to:
According to the pagerank of webpage in described targeted website, and/or the link depth of webpage, determine the scoring of webpage;
Scoring to multiple webpages in described targeted website is normalized, and obtains the corresponding quality of each webpage and divides
Cloth.
Alternatively, described web page quality distributed acquisition unit, comprising:
Web page quality distributed acquisition subelement, is suitable to obtain the web page quality of all webpages in described targeted website
Distribution;
Described task flow acquiring unit, comprising:
Task flow obtains subelement, and the web page quality being suitable to all webpages in the described targeted website of acquisition divides
The summation of cloth, the summation of the web page quality distribution according to webpages all in described targeted website, determine crawl target network
The task flow stood.
Alternatively, also include:
Task scale factor acquiring unit, is suitable to obtain one or more task scale factors;
Described task flow obtains subelement, is suitable to:
Summation according to the distribution of described web page quality and the product of one or more described task scale factors, determine crawl
The task flow of targeted website.
Alternatively, described task scale factor acquiring unit, is suitable to:
Obtain in described targeted website, webpage number to be captured accounts for the ratio of webpage sum in described targeted website
Example;
And/or,
Obtain in described targeted website, unduplicated webpage quantity accounts for the ratio of webpage sum in described targeted website.
Alternatively, described task scale factor acquiring unit, is suitable to:
Obtain in described targeted website, the webpage number updating in crawl history, and/or, newly produce in described targeted website
Webpage number, account for the ratio of the sum of webpage in described targeted website.
Alternatively, described task scale factor acquiring unit, is suitable to:
In the crawl history to targeted website, obtain and compare the information fingerprint of captured webpage;
Unduplicated information fingerprint number is obtained according to the result comparing, accounts for the ratio of total fingerprint number, do not repeat as described
Webpage quantity account for the ratio of the sum of webpage in described targeted website.
Alternatively, also include:
Unit interval coefficient acquiring unit, be suitable to according to crawl targeted website determine task total time the unit interval be
Number;
Described task flow obtains subelement, is suitable to:
Summation according to the distribution of described web page quality and one or more described task scale factors, and during described unit
Between coefficient product, determine crawl targeted website task flow.
Alternatively, also include:
Task flow adjustment unit, is suitable to bear flow when described task flow more than described crawl, and both differences is big
When preset threshold value, by adjusting described task scale factor, and/or described unit interval coefficient, adjust described task flow
Amount, until described task flow bears flow less than or equal to described crawl, or both differences are less than preset threshold value.
Alternatively, described flow quota determining unit, is suitable to:
When described task flow bears flow more than described crawl, and when both differences are less than preset threshold value, will be described
Task flow is defined as carrying out the flow quota of webpage capture on described targeted website.
The method that determination website according to the present invention captures flow quota being accessed according to targeted website to be captured
Data, when determining that search engine crawlers capture to targeted website, the crawl that can bear of targeted website is born
Flow;And the task flow of crawl targeted website task can be determined according to the web page quality distribution of webpage in targeted website;
And then flow, and the task flow of crawl targeted website are born according to the crawl of targeted website, determine enterprising in targeted website
The flow quota of row webpage capture.Thus solve the unconfined crawl of crawlers to lead to excessively take asking of site resource
Topic.Achieve in the case that the crawl pressure to website allows, the web data of website is effectively captured, to reduce
The crawlers of search engine are conflicted with crawled website.Make crawlers capture behavior and upgrade demand with search engine to obtain
Rational balance.
Described above is only the general introduction of technical solution of the present invention, in order to better understand the technological means of the present invention,
And can be practiced according to the content of description, and in order to allow the above and other objects of the present invention, feature and advantage can
Become apparent, below especially exemplified by the specific embodiment of the present invention.
Brief description
By reading the detailed description of hereafter preferred implementation, various other advantages and benefit are common for this area
Technical staff will be clear from understanding.Accompanying drawing is only used for illustrating the purpose of preferred implementation, and is not considered as to the present invention
Restriction.And in whole accompanying drawing, it is denoted by the same reference numerals identical part.In the accompanying drawings:
The flow chart that Fig. 1 shows the method for webpage capture according to an embodiment of the invention;
Fig. 2 shows the flow chart that determination website according to an embodiment of the invention captures the method for flow quota;
The flow chart that Fig. 3 shows the method determining crawl flow according to an embodiment of the invention;
Fig. 4 shows the flow process determining that website sub channel captures the method for flow quota according to an embodiment of the invention
Figure;
Fig. 5 shows the schematic diagram of the equipment of webpage capture according to an embodiment of the invention;
Fig. 6 shows that determination website according to an embodiment of the invention captures the schematic diagram of the equipment of flow quota;
Fig. 7 shows the schematic diagram of the equipment determining crawl flow according to an embodiment of the invention;
Fig. 8 shows the signal determining that website sub channel captures the equipment of flow quota according to an embodiment of the invention
Figure.
Specific embodiment
It is more fully described exemplary embodiment disclosed by the invention below with reference to accompanying drawings.Although showing this in accompanying drawing
The exemplary embodiment of disclosure of the invention it being understood, however, that may be realized in various forms the present invention open and should not be by here
The embodiment illustrating is limited.On the contrary, these embodiments are provided to be able to be best understood from the disclosure, and can be by
What scope disclosed by the invention was complete conveys to those skilled in the art.
For convenience of description, define the explanation of parameter as shown in table 1 and parameter first:
Table 1
Embodiment one
Refer to Fig. 1, be the flow chart of the method for webpage capture provided in an embodiment of the present invention, as illustrated, the present invention
The method of the webpage capture that embodiment provides may comprise steps of:
S110: obtain the dynamic flow quota value that webpage capture is carried out on targeted website;
During crawlers capture to the webpage in targeted website, in order to avoid unrestricted to same website
Crawl, and lead to affect website normal access situations such as generation, it usually needs to crawlers on targeted website
Crawl flow or frequency carry out certain restriction, and dynamic flow quota value is the crawl on targeted website to crawlers
A kind of restriction of flow.The dynamic flow quota value of webpage capture is carried out on targeted website it can be understood as in crawlers
During execution crawl task, the limit to the flow that the carrying out of same website captures within the unit interval, for example will be to dynamic flow
Quota value is limited to 3,000,000/day.In this step s110, can obtain and the dynamic of webpage capture is carried out on targeted website
Flow quota value.
When obtaining the dynamic flow quota value that webpage capture is carried out on targeted website, can be real by the following method
Existing:
First obtain targeted website by access data, then can according to described by access data, determine targeted website
Crawl bear flow;Obtain the web page quality distribution of webpage in targeted website;Web page quality according to webpage in targeted website
Distribution, determines the task flow of crawl targeted website;And then flow is born according to the crawl of targeted website, and crawl target network
The task flow stood, determines the dynamic flow quota value carrying out webpage capture on targeted website;
Wherein it is possible to according to search engine the access statistic data to targeted website, determine targeted website by access number
According to.According to described by access data, when determining that flow is born in the crawl of targeted website, can according to by access data, first really
Set the goal the born access total amount of website;According to total amount and preset crawl pressure coefficient can be born, to determine targeted website
Crawl bear flow.Specifically, can be according to the access statistic data of the targeted website of search engine collection, and search is drawn
The market share held up, the direct visit capacity of user, and website redundant flow, jointly to determine the born access of targeted website
Total amount, then it is multiplied by preset crawl pressure coefficient, flow is born in the crawl as targeted website.
In targeted website webpage web page quality distribution acquisition, can according to the pagerank of webpage in targeted website,
And/or the link depth of webpage, determine the scoring of webpage;Scoring to webpages multiple in targeted website is normalized,
Obtain the corresponding Mass Distribution of each webpage.In targeted website, the web page quality distribution qi of webpage is it can be understood as to target network
The scoring situation of the web page quality of webpage in standing.Web page quality distribution can be by the pagerank of webpage, and/or the chain of webpage
Connect depth, to determine, such as can be according to the pagerank of webpage in targeted website, and/or the link depth of webpage, determine webpage
Scoring;Then the scoring to webpages multiple in targeted website is normalized, and obtains the corresponding quality of each webpage and divides
Cloth.Normalized, can make the quality score of webpage in targeted website normalize to (0,1] in this interval.
To with search engine for, can obtain state all webpages in targeted website web page quality distribution, enter
And the summation that the web page quality obtaining all webpages in targeted website is distributed, according to the net of webpages all in targeted website
The summation of page Mass Distribution, determines the task flow of crawl targeted website.One or more task ratios specifically can be obtained
The factor;As obtained in targeted website, webpage number to be captured accounts for the ratio of webpage sum in targeted website;And/or, obtain
Unduplicated webpage quantity in targeted website is taken to account for the ratio of webpage sum in targeted website.Then according to web page quality distribution
Summation and the product of one or more task scale factors, determine the task flow of crawl targeted website.
Wherein obtain in described targeted website, the ratio that webpage number to be captured accounts for webpage sum in described targeted website can
To obtain in described targeted website, the webpage number of renewal in crawl history, and/or, in targeted website, the new webpage number producing, accounts for
The ratio of webpage sum in targeted website.Obtain unduplicated webpage quantity in targeted website and account for webpage sum in targeted website
Ratio, in the crawl history to targeted website, can obtain and compare the information fingerprint of captured webpage;According to compare
Result obtains unduplicated information fingerprint number, accounts for the ratio of total fingerprint number, accounts for described target network as unduplicated webpage quantity
The ratio of webpage sum in standing.
Further, it is also possible to according to crawl targeted website task total time determine unit interval coefficient;Determining crawl mesh
During the task flow of mark website, can be according to the summation of web page quality distribution and one or more task scale factors, Yi Jidan
The product of position time coefficient, determines the task flow of crawl targeted website.
Bear flow when the crawl of the task flow according to crawl targeted website and website and determine appointing of crawl targeted website
During business flow, flow can be born in task flow more than crawl, and when both differences are more than preset threshold value, be appointed by adjustment
Business scale factor, and/or unit interval coefficient, to adjust task flow, until task flow bears stream less than or equal to crawl
Amount, or both differences are less than preset threshold value.Realize the dynamic adjustment to dynamic flow quota value.When task flow is more than crawl
Bear flow, and when both differences are less than preset threshold value, task flow can be defined as carrying out webpage on targeted website
The dynamic flow quota value of crawl.
In addition can also obtain and carry out on targeted website only according to the task flow capturing targeted website in other method
The dynamic flow quota value of webpage capture.The web page quality distribution of webpage in targeted website now can be obtained first;According to mesh
In mark website, the web page quality distribution of webpage, determines the task flow of crawl targeted website;Task according to crawl targeted website
Flow, determines the dynamic flow quota value carrying out webpage capture on targeted website.
Description to the more specific realization of each step of the embodiment of the present invention one, may be referred to the embodiment of the present invention two
Determination website in middle step s210 to s240 captures the content of flow quota, and can be with determination crawl stream in embodiment three
The content of the corresponding part in the method for amount is cross-referenced.
S120: according to described dynamic flow quota value, the webpage on described targeted website is captured.
After determining the dynamic flow quota value of webpage capture carried out on targeted website, can according to determined by dynamic
Flow quota value carries out capturing with the flow that dynamic flow quota value is limited on targeted website.If certain crawl demand compares net
The crawl stood bear flow much larger when, can by simplify crawl demand, or to crawl data carry out tightened up sieve
After choosing, then captured.
Above the method for the webpage capture that the embodiment of the present invention one provides is described in detail, can by the method
To obtain the dynamic flow quota value that webpage capture is carried out on targeted website;According to dynamic flow quota value, to targeted website
On webpage captured it is achieved that to website crawl pressure allow in the case of, the web data of website is had
The crawl of effect, to reduce the crawlers of search engine and conflicting of crawled website.
Embodiment two
Refer to Fig. 2, the flow chart capturing the method for flow quota for the determination website that the embodiment of the present invention two provides, such as
Shown in figure, the method that determination website provided in an embodiment of the present invention captures flow quota may comprise steps of:
S210: obtain targeted website to be captured by access data;
Can obtain first targeted website to be captured by access data, targeted website capture by access data permissible
It is the click volume data of the one day of website, such as parameter c in table 1, get after crawl targeted website by accessing after data, can
To be released the access holding capacity of targeted website to be captured by access data according to targeted website.
Can being obtained from many aspects by accessing data of targeted website, such as can be announced in data by website ranking and obtain
Take.In addition, user browses what webpage was often carried out by browser software, so can also be browsed by browser to user
Webpage counted, further according to browser occupation rate in the current marketplace, determine the access holding capacity of website.As by clear
Every daily visit that device of looking at counts on certain website is 1,500,000 times, and the Vehicles Collected from Market occupation rate of this browser is 15%, then permissible
Determine this website accesses total amount day for 10,000,000 times, and that is, the access holding capacity of this website is at least 10,000,000 times.
Access statistic data that can also be according to search engine to targeted website, determine targeted website by accessing data,
This is because it is often necessary to access webpage by search engine during user browses webpage, that is, passing through search engine
The Search Results providing carry out redirecting to access webpage, and search engine can count to the webpage accessing, and then to passing through
The click volume that search engine accesses website is counted, i.e. the access statistic data of the targeted website according to search engine statistics,
Determine targeted website by access data.Specifically, the visit capacity of search engine access target website can be passed through, search divided by this
Index the market share held up, as this website by access data.As count on user by search engine redirect access certain
Every daily visit of website is 1,500,000 times, and the Vehicles Collected from Market occupation rate of this search engine is 15%, then can determine this website
Day access total amount be 10,000,000 times, that is, the access holding capacity of this website is at least 10,000,000 times.
In addition it is also possible to be used in combination multiple methods or approach, to obtain more accurate targeted website by accessing number
According to.For example be used in combination above-mentioned two methods, will client browser software statistical data, with search engine statistical number
According to combining, can determine that user is redirected by search engine simultaneously, and non-search engine redirects access target website
Data, combine both can obtain more accurate targeted website by accessing data.It should be noted that website
By accessing data, typically with being represented by access times of website in the unit interval, in describing as the aforementioned, it is every with website
Daily visit describing, it is of course also possible to according to concrete application situation using other unit of time, website in such as a hour
By access times, the present invention is not restricted to this.
S220: according to described by accessing data, determine that flow is born in the crawl of described targeted website;
Get targeted website by access data after, can according to get by access data, determine targeted website
Crawl bear flow.Flow is born in the crawl of website it can be understood as crawlers that in the unit interval, website can be born
Crawl flow, the unit interval therein, equally can be depending on concrete application situation, below with the time as a unit day
Description this method.
In actual applications, can directly the visit capacity of website in the unit interval getting be held as the crawl of website
By flow.But based on the service that website provides usually is browsed with user, if directly the unit interval of the website getting is visited
The amount of asking bears flow it is possible to the upper limit can be born beyond website for crawlers crawl as the crawl of website, therefore,
Obtain targeted website is multiplied by preset crawl pressure coefficient by accessing data, and flow is born in the crawl obtaining targeted website.In advance
The crawl pressure coefficient put, can be a percent coefficient, and its span is (0,1).Such as certain website by search
Every daily visit that engine redirects is 1,500,000 times, and preset crawl pressure coefficient is 30%, then the targeted website finally determining
It is daily for 450,000 times that flow is born in crawl.
Preset crawl pressure coefficient, can take flexible setting according to the difference by the source accessing data, as above
In example, website is the every daily visit being redirected by search engine by accessing data, and this by actually only accessing data
It is a part for the access total amount that this website can bear, therefore, preset crawl pressure coefficient can be that setting one is relative
Higher value.If can obtain more accurate, close to website the actual access total amount that can bear by accessing number
According to preset crawl pressure coefficient can be able to being then one relatively low value of setting.
Under another kind of implementation, can be subject to access data according to targeted website, determine the born visit of targeted website
Ask total amount;Then the born access total amount according to targeted website and preset crawl pressure coefficient, determine grabbing of targeted website
Take and bear flow.Want to obtain the born access total amount of the relatively targeted website of practical situation, a reasonable side
Method is to try to obtain targeted website by accessing data with reference to many sources, such as can obtain and use according to the statistics of browser
Family directly accesses the visit capacity of website;Simultaneously by the statistics of search engine, obtain user and pass through search engine search results
Redirect the visit capacity accessing website;The market share of search engine;And the redundant flow of website etc. to determine target jointly
The born access total amount of website.The redundant flow of wherein website refers to the redundant access holding capacity of website, can be according to long-term
Website visiting peak value of monitoring etc. obtains it is also possible to obtain based on experience value.For example, being jumped by search engine of certain website
The every daily visit turning is 1,500,000 times, and the market share of this search engine is 15%, additionally, this website also has the flow of half
For the direct visit capacity of user, the flow that the flow that is, user directly accesses redirects this website of access with search engine is suitable, and is somebody's turn to do
Website also has 50% redundant flow, then can determine that the born access total amount of this website unit interval (daily) is:
Ten thousand times/day of 150 ÷, 15% ÷ 50% × 150%=3000
I.e. the access total amount of can bearing daily of this website is 30,000,000 times/day.If preset crawl pressure coefficient is 5%,
Then can determine that the crawl of this website is born flow and is:
Ten thousand times/day of 3000 × 5%=150
In this example, it is subject to access data, acquired target due to reference to many sources and obtaining targeted website
The born access total amount of website, is closer to the actual total flow that can bear in website, and preset crawl pressure coefficient sets
It is set to a relatively low value, that is, with respect to 30% in a upper example, in this example, crawl pressure coefficient is set to 5%.
In table 1, c represents the click volume of targeted website one day, can be all pages in website in same day Search Results
Clicked number of times, and the function that flow is then appreciated that with regard to parameter c, i.e. targeted website are born in the crawl of targeted website
Crawl bear flow and can be designated as f (c).
S230: obtain the web page quality distribution of webpage in described targeted website;
By step s210 and s220, flow is born in the crawl obtaining targeted website, the crawl of this targeted website
Bear flow and be appreciated that to be the access data acquisition according to website, the predictive value of crawlers crawl can be born in website.
On the other hand in addition it is also necessary to know the task situation that crawlers are captured to website, that is, capture the task flow of targeted website.
Obtain the task flow that crawlers capture targeted website, the webpage of webpage in embodiment of the present invention Main Basiss targeted website
Parameter qi in Mass Distribution, such as table 1.Here, in targeted website the web page quality distribution qi of webpage it can be understood as to target
The scoring situation of the web page quality of webpage in website.Web page quality distribution can be by the pagerank of webpage, and/or webpage
Link depth, to determine, such as can be according to the pagerank of webpage in targeted website, and/or the link depth of webpage determines net
The scoring of page;Then the scoring to webpages multiple in targeted website is normalized, and obtains the corresponding quality of each webpage
Distribution.Normalized, can make the quality score of webpage in targeted website normalize to (0,1] in this interval.For example, mesh
Mark has a following webpage in website, and the corresponding pagerank of webpage (the pr value of also referred to as webpage, take 1 to 10 positive integer),
Link depth (depth takes positive integer according to the link depth of webpage), as shown in table 2:
Table 2
Webpage | Pr value | pr÷10×0.7 | depth | 1/depth×0.3 | Mass Distribution |
Webpage 1 | 10 | 0.7 | 1 | 0.3 | 1 |
Webpage 2 | 7 | 0.49 | 3 | 0.1 | 0.59 |
Webpage 3 | 7 | 0.49 | 3 | 0.1 | 0.59 |
Webpage 4 | 6 | 0.42 | 5 | 0.06 | 0.48 |
Webpage 5 | 8 | 0.56 | 2 | 0.15 | 0.71 |
To determine the Mass Distribution of webpage due to being simultaneously used pagerank and web page interlinkage depth, in table 2, permissible
When calculating, it is that pagerank and the link depth of webpage arranges different weights, so because the value of pr is 1 to 10 just
Integer, link depth is the positive integer being taken according to the link depth of webpage, by normalization and imparting weight, has obtained webpage 1
To webpage 5, the Mass Distribution situation of each webpage.Certainly, in actual applications it is also possible to obtain net in other manners
The Mass Distribution of page, is such as used alone the pr value of webpage, or the Mass Distribution to obtain webpage for the link depth of webpage, also may be used
With preset grading module in a browser, when browsing webpage, each webpage is given a mark by grading module by user, enter
And the marking to each webpage for the user collected by search engine, marking is counted and done the matter obtaining webpage after normalized
Amount distribution.The mode of this user marking can certainly be attached to above-mentioned next with pagerank and web page interlinkage depth
Determine in the method for Mass Distribution of webpage, the method to realize another kind of Mass Distribution obtaining webpage, its realize process with
Above-mentioned example is similar to, and will not be described here.
Additionally, according to the difference of crawl task, in the web page quality distribution obtaining targeted website webpage, can be to obtain
In targeted website, the web page quality distribution of all webpages needs the target of crawl it is also possible to obtain targeted website interior part
The web page quality distribution of webpage, has detailed introduction in subsequent step.
S240: according to the described web page quality distribution of webpage in described targeted website, determine the task of crawl targeted website
Flow;
Next the task flow of crawl targeted website can be determined according to the web page quality distribution of webpage in targeted website
Amount.Specifically, can first in the targeted website of acquisition the web page quality distribution of webpage summation, according to webpage in targeted website
Web page quality distribution summation, determine crawl targeted website task flow.
According to the difference of crawlers task strategy or crawl target, obtaining the web page quality distribution of targeted website webpage
When, can be for different scopes.
The crawl demand of crawlers may come from two aspects: is on the one hand the webpage updating in crawl history, that is,
Search engine had captured the webpage in website, and a portion there occurs renewal again, and search engine needs to this part more
New webpage captures again.As in table 1, m represents the webpage number of the website currently included, if the accounting of the wherein webpage of renewal
For a, then the quantity of the webpage updating in crawl history is (a × m).On the other hand it is the newfound webpage not yet capturing,
Its quantity is parameter n in table 1.The crawl demand of this two aspects comprehensive, the crawl task of crawlers needs to capture webpage
Quantity may is that
(a×m)+n
When crawl task is directed to both webpages of website, all webpages in targeted website can be obtained
Web page quality distribution qi, and then obtain targeted website in all webpages web page quality be distributed qi summation it may be assumed that
The summation of the web page quality distribution according to webpages all in targeted website, determines appointing of crawl targeted website
Business flow.The summation that the web page quality of webpage is distributed, can be directly as the task flow of crawl targeted website, this
Outward, one or more task scale factors can also be obtained;Summation according to web page quality distribution and one or more task ratios
The product of the example factor, determines the task flow of crawl targeted website.Wherein, task scale factor can according to the property of itself not
Same, play different effects during the task flow determining crawl targeted website.
Acquired one or more task scale factors, can be following task scale factors:
Obtain in targeted website, webpage number to be captured accounts for the ratio of webpage sum in targeted website;
And/or,
Obtain in targeted website, unduplicated webpage quantity accounts for the ratio of webpage sum in targeted website;
Webpage number wherein to be captured accounts for this task scale factor of ratio of webpage sum in targeted website, can
With with:
Represent, during crawlers capture to the webpage of targeted website, can only to crawl history in more
New webpage and the newfound webpage not yet capturing are captured, and both webpages with webpage sum
Ratio then can be used as task scale factor, and the web page quality of all webpages is distributed the summation phase of qi with targeted website
Take advantage of, that is,
Both are multiplied obtained result as the task flow of crawl targeted website, have more accurately reacted this time and have grabbed
Take the flow of the required by task of targeted website.In actual crawl task, the webpage of renewal in crawl history, and targeted website
In the new webpage producing not necessarily exist simultaneously, when therefore obtaining this task scale factor, can be obtained according to practical situation
In the webpage number updating in crawl history in targeted website and/or targeted website, the new webpage number producing, accounts for described targeted website
The ratio of middle webpage sum.
Further, it is also possible in acquisition targeted website, unduplicated webpage quantity accounts for the ratio of webpage sum in targeted website,
I.e. parameter u in table 1, as another task scale factor.During the crawl to targeted website for the crawlers, usually permissible
The page repeating is identified, the page repeating only is captured once.Therefore this task scale factor can be passed through, enter one
Step further filters out to duplicate pages in the task flow of crawl targeted website, makes the required by task of crawl targeted website
Flow is more accurate.Now can basis:
Unduplicated webpage quantity is added to account for this task scale factor of ratio of webpage sum in described targeted website, and
Determine the task flow of crawl targeted website.
Specifically, account for the ratio of webpage sum in targeted website (as the parameter in table 1 in the unduplicated webpage quantity of acquisition
Info web fingerprint identification technology can be utilized when u), in the crawl history to targeted website, obtain and compare and captured
The information fingerprint of webpage;Unduplicated information fingerprint number is obtained according to the result comparing, accounts for the ratio of total fingerprint number, as described
Unduplicated webpage quantity accounts for the ratio of webpage sum in described targeted website.
In addition, the task that crawlers capture targeted website usually needs a period of time to complete, and the stream of required by task
Amount is often the flow that in the unit interval, crawl targeted website is distributed, and therefore may be incorporated into according to webpage capture required by task
Time determine unit interval coefficient, if crawlers capture targeted website task need time grab for t(such as crawlers
The task of taking targeted website needs 10 days), then can basis:
To determine the task flow of (as daily) crawl targeted website in each unit interval.Crawl due to targeted website
Bear flow to describe with the flow in the unit interval, therefore, the task flow of crawl targeted website can adopt phase
Same unit, in order to compare, such as all adopts (ten thousand times/day) as description units.
S250: flow, and the task flow of described crawl targeted website are born according to the crawl of described targeted website, really
It is scheduled on the flow quota that webpage capture is carried out on described targeted website;
It should be noted that flow is born in the crawl of the described targeted website of determination, and determine appointing of crawl targeted website
Business flow, both execution sequences can be arbitrary, you can to first carry out step s210 and s220, then execution step s230
With s240;Step s230 and s240 can also be first carried out, then execution step s210 and s220.Which kind of, no matter order, can obtain
Flow, and the task flow of crawl targeted website are born in the crawl taking targeted website.
During crawlers capture to webpage, in order to avoid crawl unconfined to same website, and lead
Cause impact website normal access situations such as generation, it usually needs to crawl flow on targeted website for the crawlers or
Frequency carries out certain restriction, and flow quota is one of which.The flow quota of webpage capture is carried out on targeted website, can
To be interpreted as in crawlers execution crawl task, the flow within the unit interval, the carrying out of same website being captured or frequency
Limit, for example will to target flow quota restrictions be 3,000,000/day.In the method that the embodiment of the present invention one provides, permissible
Flow, and the task flow of crawl targeted website are born according to the crawl of targeted website, to determine and to carry out on targeted website
The flow quota of webpage capture.
After flow, and the task flow of crawl targeted website are born in the crawl getting targeted website, can basis
Both determines the flow quota carrying out webpage capture on targeted website.Specifically both can be compared, will be less
One as the flow quota that webpage capture is carried out on targeted website.For example the crawl of targeted website is born flow with f
C () represents, and capture the task flow of targeted website with:
During expression, can be by:
As the flow quota that webpage capture is carried out on targeted website.Wherein min represents to enter in two or more parameter
Row compares, and takes wherein minimum parameter as operation result.
Further, since the born flow of website usually has certain elastic space, the crawl in targeted website is born
Flow, when being more or less the same with task flow, can be joined using task flow as the flow carrying out webpage capture on targeted website
Volume.I.e. when task flow bears flow more than crawl, and when both differences are less than preset threshold value, task flow can be determined
It is the flow quota that webpage capture is carried out on targeted website.When flow is born in the crawl that task flow is more than website, and both
Difference when being more than preset threshold value, can pass through adjustment task scale factor, and/or unit interval coefficient, adjustment task flow,
Until task flow bears flow less than or equal to crawl, or both differences are less than preset threshold value, adjust task scale factor,
It is substantially and simplifies crawl demand, or the data capturing is carried out with the process of tightened up screening, and adjust unit interval system
Number, the substantially time of the task of adjustment crawlers execution crawl targeted website.
The method capturing flow quota to the determination website that the embodiment of the present invention two provides above is described in detail,
The crawlers that targeted website can be born can be determined according to targeted website to be captured by accessing data by the method
Flow is born in the crawl that it is captured;And crawl can be determined according to the web page quality distribution of webpage in targeted website
The task flow of targeted website task;And then flow, and the task of crawl targeted website are born according to the crawl of targeted website
Flow, determines the flow quota carrying out webpage capture on targeted website.Thus solve the unconfined crawl of crawlers to lead
Cause the excessive problem taking site resource.Achieve the webpage number in the case that the crawl pressure to website allows, to website
According to effectively being captured, to reduce the crawlers of search engine and conflicting of crawled website.Crawlers are made to capture row
It is to upgrade demand with search engine reasonably to be balanced.
Embodiment three
Refer to Fig. 3, the flow chart capturing the method for flow for the determination that the embodiment of the present invention three provides, as illustrated,
The method determining crawl flow provided in an embodiment of the present invention may comprise steps of.
S310: task scale factor is obtained according to targeted website attribute character;
S320: based on the web page quality distribution summation in described task scale factor and targeted website, determine crawl target
The task flow of website.
Wherein, acquired task scale factor can be obtained in targeted website, and webpage number to be captured accounts for target network
In standing webpage sum ratio;And/or, obtain in targeted website, unduplicated webpage quantity accounts for net in targeted website
The ratio of page sum.Obtain webpage number to be captured in targeted website account in targeted website webpage sum ratio permissible
It is to obtain in targeted website, the webpage number updating in crawl history, and/or the new webpage number producing in targeted website, account for target
In website webpage sum ratio.Obtain in targeted website, it is total that unduplicated webpage quantity accounts for webpage in targeted website
The ratio of number can be, in the crawl history to targeted website, obtains and compare the information fingerprint of captured webpage;According to
The result comparing obtains unduplicated information fingerprint number, accounts for the ratio of total fingerprint number, accounts for target as unduplicated webpage quantity
The ratio of webpage sum in website.
Then, taking advantage of of summation can be distributed based on one or more task scale factors with the web page quality in targeted website
Long-pending, determine the task flow of crawl targeted website.Web page quality distribution summation can be determined as follows: according to target network
The pagerank of webpage in standing, and/or the link depth of webpage, determine the scoring of webpage;To webpages multiple in targeted website
Scoring is normalized, and obtains the corresponding Mass Distribution of each webpage;The corresponding quality of each webpage according to obtaining is divided
Cloth, determines web page quality distribution summation.
Further, it is also possible to according to crawl targeted website task total time determine unit interval coefficient;Based on described
Web page quality distribution summation in business scale factor and targeted website, during the task flow of determination crawl targeted website, Ke Yigen
According to summation and one or more task scale factors of web page quality distribution, and the product of unit interval coefficient, determine crawl
The task flow of targeted website.
After the task flow getting crawl targeted website, can also be right according to the task flow of crawl targeted website
Targeted website carries out webpage capture.When determined by task flow excessive when, adjustment task scale factor can be passed through, and/or
Unit interval coefficient, the task flow of adjustment crawl targeted website, the task flow of crawl targeted website is adjusted to target network
Stand in the range of can bearing.Adjustment task scale factor, substantially simplifies crawl demand, or the data of crawl is carried out
The process of tightened up screening, and adjust unit interval coefficient, substantially adjustment crawlers execution captures targeted website
The time of task.
Above to the embodiment of the present invention three provide determination capture flow method be described, the method more specific
Realization and applicating example can be cross-referenced with embodiment two.Can be obtained according to targeted website attribute character by the method
Take task scale factor;Web page quality distribution summation in task based access control scale factor and targeted website, determines crawl target network
The task flow stood.So that when crawlers capture to website, accurate determination has been carried out to required crawl flow,
The web data of website is effectively captured, to reduce the crawlers of search engine and conflicting of crawled website.
Example IV
Foregoing individual embodiments are all the introductions for starting point, concrete implementation mode being carried out with website, but in a lot of nets
In standing, usually there is multiple substation points or subchannel, at this point it is possible to the substation point of website or subchannel are regarded as one simultaneously
Individual independent website, applies similar said method provided in an embodiment of the present invention simultaneously, can be to substation point present in website
Or subchannel carries out the acquisition that channel bears flow, that is, according to each subchannel by the son frequency accessing each subchannel of data acquisition
Road bears flow, according to the web page quality distribution of webpage in each subchannel, determines the subchannel task flow of each subchannel;Institute is not
Same, now flow can be born according to subchannel, and subchannel task flow determines the crawl weight of each subchannel;In conjunction with
The flow quota of whole targeted website, and the crawl weight of each subchannel, to determine the channel quota of each subchannel.Last root
According to the corresponding described channel quota of each subchannel, the webpage in each subchannel is captured.Below this is carried out detailed
Introduce.
Refer to Fig. 4, capture the stream of the method for flow quota for the determination website sub channel that the embodiment of the present invention four provides
Cheng Tu, as illustrated, provided in an embodiment of the present invention determine website sub channel capture flow quota method can include following
Step:
S410: obtain each subchannel in targeted website and bear flow;
When implementing, for the subchannel of targeted website, due to can treat as independent website, same mesh
In mark website, user's visit capacity of each subchannel etc. can be come out respectively by accessing data, therefore, obtains subchannel
The specific implementation of flow is born in crawl, can be with the acquisition target network described in step s110 of embodiment one and s120
The implementation that flow is born in the crawl stood is identical.Namely can obtain according to each subchannel in targeted website by accessing data
The each subchannel in targeted website bears flow, specifically, can be subject to according to each subchannel to targeted website of search engine statistics
Access data, obtain each subchannel in targeted website and bear flow.Specifically when determining that subchannel bears flow, can be according to target
The each subchannel in website by access data, determine that the subchannel of each subchannel in targeted website bears access total amount, then according to son
The subchannel of channel bears access total amount and preset channel pressure coefficient, determines that each subchannel in targeted website bears flow.Tool
The process of realizing of body may refer to embodiment one or embodiment two, repeats no more here.
S420: according to the web page quality distribution of webpage in each subchannel, determine each subchannel task flow;
Subchannel task flow when sub-channel is captured, actually one kind capture history according to the past, and
The predictive value of the flow of crawl subchannel task that web page quality obtains.Before determining subchannel task flow, equally permissible
First obtain webpage in each subchannel web page quality distribution, specific acquisition modes it is also possible to step s230 in embodiment one
Described in obtain web page quality distribution same way.And determine the implementation of each subchannel task flow, can be with step
Determine in rapid s140 that the mode of the task flow of targeted website is identical.For example, it is possible to according to webpage in each subchannel
Pagerank, and/or the link depth of webpage, determine the scoring of webpage in each subchannel, to multiple webpages in each subchannel
Scoring is normalized, and obtains the corresponding Mass Distribution of each webpage, according to the webpage of webpage in each subchannel obtaining
Mass Distribution, determines each subchannel task flow.Implement the introduction that still can be found in embodiment one, repeat no more here.
S430: flow is born according to described subchannel, and each subchannel of described subchannel flow rate calculation corresponding crawl power
Weight;
After getting subchannel and bearing flow and subchannel task flow, in the embodiment of the present invention four, also may be used
To calculate each subchannel corresponding crawl weight.That is, for each subchannel, respectively can be from can bear
Wherein smaller is selected to correspond to respectively as reference value, so each subchannel in the subchannel task flow of flow and prediction
One reference value, be not then directly using this reference value as each subchannel flow quota, but first according to these
Reference value calculates the weight of each subchannel, for example, it is possible to the reference value of each subchannel is added, the power of each subchannel
It is equal to the shared ratio in this plus value preset of this subchannel reference value of itself again.For example, subchannel 1,2,3, wherein, son frequency
The reference value in road 1 is n1, and the reference value of subchannel 2 is n2, and the reference value of subchannel 3 is n3, then the weight of subchannel 1 is n1/
(n1+n2+n3), the weight of subchannel 2 is n2/(n1+n2+n3), the weight of subchannel 3 is n3/(n1+n2+n3).
S440: according to targeted website total flow quota, and each subchannel crawl weight, determine each subchannel quota.
It is possible to be multiplied by the total flow of affiliated web site again with respective weight after the weight having calculated each subchannel
Quota, you can obtain the quota of subchannel.Wherein, with regard to the total flow quota of targeted website, may refer in embodiment one
Record, repeat no more here.
When implementing, can also according to capture each subchannel task total time determine unit interval coefficient, then will
Targeted website total flow quota and each subchannel weight accounting, and the product of unit interval coefficient enters as to corresponding subchannel
The subchannel quota of row crawl.Finally it is possible to be captured to the webpage in each subchannel according to each subchannel quota.
The method that the determination website sub channel being provided by the embodiment of the present invention four captures flow quota, can be according to acquisition
To subchannel bear flow, and each subchannel of subchannel task flow rate calculation corresponding crawl weight;Total according to targeted website
Flow quota, and each subchannel crawl weight, determine each subchannel quota, reduce search engine crawlers with grabbed
Take website conflict while, more reasonably can will be captured assignment of traffic to each subchannel it is achieved that each to targeted website
Subchannel more reasonably browses distribution.
Corresponding with the method for the webpage capture that the embodiment of the present invention one provides, the embodiment of the present invention one additionally provides webpage
The equipment of crawl, refers to Fig. 5, this equipment may include that
Dynamic flow quota value acquiring unit 510, is suitable to obtain the dynamic flow carrying out webpage capture on targeted website
Quota value;
Webpage capture unit 520, is suitable to, according to dynamic flow quota value, the webpage on targeted website be captured.
Wherein dynamic flow quota value acquiring unit 510 may include that
Website visitation data acquiring unit, be suitable to obtain targeted website by access data;
Website holding capacity determining unit, is suitable to, according to by accessing data, determine that flow is born in the crawl of targeted website;
Web page quality distributed acquisition unit, is suitable to obtain the web page quality distribution of webpage in targeted website;And, task flow
Amount acquiring unit, is suitable to the described web page quality distribution according to webpage in described targeted website, determines appointing of crawl targeted website
Business flow;
Under this implementation, dynamic flow quota value acquiring unit 510 can bear according to the crawl of targeted website
Flow, and the task flow of crawl targeted website, determine the dynamic flow quota value carrying out webpage capture on targeted website.
Specifically, the access statistic data that website visitation data acquiring unit can be according to search engine to targeted website,
Determine targeted website described by access data.
Website therein holding capacity determining unit may include that
Visit capacity determination subelement, is suitable to, according to by accessing data, determine the born access total amount of targeted website;
Website holding capacity determining unit, can determine described mesh according to bearing total amount and preset crawl pressure coefficient
Flow is born in the crawl of mark website.
Under this implementation, the acess control that visit capacity determination subelement can be according to search engine to targeted website
Data, the market share of search engine, the direct visit capacity of user, and website redundant flow, to determine targeted website can
Bear access total amount.
Specifically, web page quality distributed acquisition unit can be according to the pagerank of webpage in targeted website, and/or webpage
Link depth, determine the scoring of webpage;Scoring to webpages multiple in targeted website is normalized, and obtains each net
The corresponding Mass Distribution of page.
Web page quality distributed acquisition unit may include that
Web page quality distributed acquisition subelement, obtains the web page quality distribution of all webpages in targeted website;
Under this implementation, task flow acquiring unit may include that
Task flow obtains subelement, is suitable to the web page quality distribution of all webpages in the targeted website of acquisition
Summation, the summation of the web page quality distribution according to webpages all in targeted website, determine the task of crawl targeted website
Flow.
Under this implementation, this equipment can also include:
Task scale factor acquiring unit, is suitable to obtain one or more task scale factors;
Described task flow obtains subelement, is suitable to:
Summation according to web page quality distribution and the product of one or more task scale factors, determine crawl targeted website
Task flow.
Wherein, task scale factor acquiring unit can obtain in described targeted website, and webpage number to be captured accounts for described
In targeted website webpage sum ratio;
And/or,
Obtain in described targeted website, unduplicated webpage quantity accounts for the ratio of webpage sum in described targeted website.
When implementing, task scale factor acquiring unit can obtain in described targeted website, updates in crawl history
Webpage number, and/or, the new webpage number producing in described targeted website, account for the ratio of webpage sum in described targeted website.
When implementing, task scale factor acquiring unit can obtain and compare in the crawl history to targeted website
Information fingerprint to the webpage being captured;
Unduplicated information fingerprint number is obtained according to the result comparing, accounts for the ratio of total fingerprint number, do not repeat as described
Webpage quantity account for the ratio of the sum of webpage in described targeted website.
Under another kind of implementation, this equipment can also include:
Unit interval coefficient acquiring unit, be suitable to according to crawl targeted website determine task total time the unit interval be
Number;
Now, task flow acquisition subelement can be according to the summation of web page quality distribution and one or more described tasks
Scale factor, and the product of unit interval coefficient, determine the task flow of crawl targeted website.
In addition this equipment can also include task flow adjustment unit, is suitable to bear stream when task flow more than described crawl
Amount, and when both differences are more than preset threshold value, by adjusting described task scale factor, and/or unit interval coefficient, adjustment
Task flow, until task flow bears flow less than or equal to described crawl, or both differences are less than preset threshold value.
Dynamic flow quota value acquiring unit 510 can bear flow in task flow more than crawl, and both differences are little
When preset threshold value, task flow is defined as carrying out the dynamic flow quota value of webpage capture on described targeted website.
In addition, dynamic flow quota value acquiring unit 510 may include that
Web page quality distributed acquisition unit, is suitable to obtain the web page quality distribution of webpage in targeted website;
Task flow acquiring unit, is suitable to the web page quality distribution according to webpage in targeted website, determines crawl target network
The task flow stood;
Dynamic flow quota value acquiring unit 510 can determine in target network according to the task flow of crawl targeted website
Carry out the dynamic flow quota value of webpage capture on standing.
Corresponding with the method that the determination website that the embodiment of the present invention two provides captures flow quota, the embodiment of the present invention two
Additionally provide the equipment determining that website captures flow quota, refer to Fig. 6, this equipment may include that
Website visitation data acquiring unit 610, obtain targeted website to be captured by access data;
Website holding capacity determining unit 620, according to by accessing data, determines that flow is born in the crawl of targeted website;
Web page quality distributed acquisition unit 630, obtains the web page quality distribution of webpage in targeted website;
Task flow acquiring unit 640, according to the described web page quality distribution of webpage in targeted website, determines crawl target
The task flow of website;
Flow quota determining unit 650, bears flow, and the task of crawl targeted website according to the crawl of targeted website
Flow, determines the flow quota carrying out webpage capture on targeted website.
Under another kind of implementation, website visitation data acquiring unit 610, it is suitable to:
According to the access statistic data to targeted website for the search engine, determine targeted website by accessing data.
Additionally, website holding capacity determining unit 620 can also include:
Visit capacity determination subelement, is suitable to, according to by accessing data, determine the born access total amount of targeted website;
Under this implementation, website holding capacity determining unit 220, can according to can bear access total amount with preset
Crawl pressure coefficient, determines that flow is born in the crawl of targeted website.
Visit capacity determination subelement can be also used for the access statistic data according to search engine to targeted website, and search is drawn
The market share held up, the direct visit capacity of user, and website redundant flow, determine the born access total amount of targeted website.
In actual applications, web page quality distributed acquisition unit 630 can according to the pagerank of webpage in targeted website,
And/or the link depth of webpage, determine the scoring of webpage;
And the scoring to webpages multiple in targeted website is normalized, obtains the corresponding quality of each webpage and divide
Cloth.
Under another kind of implementation, web page quality distributed acquisition unit 630 can also include:
Web page quality distributed acquisition subelement, the web page quality that can obtain all webpages in targeted website divides
Cloth;
Now, task flow acquiring unit 640 may include that
Task flow obtains subelement, and the web page quality being suitable to all webpages in the described targeted website of acquisition divides
The summation of cloth, the summation of the web page quality distribution according to webpages all in targeted website, determine crawl targeted website
Task flow.
Under this implementation, this equipment can also include: task scale factor acquiring unit, be suitable to obtain one or
Multiple tasks scale factor;
Now, task flow obtains subelement, can be according to the summation of web page quality distribution and one or more task ratios
The product of the example factor, determines the task flow of crawl targeted website.
Wherein, task scale factor acquiring unit can obtain different task scale factors, such as:
Obtain in targeted website, webpage number to be captured accounts for the ratio of webpage sum in targeted website;
And/or,
Obtain in targeted website, unduplicated webpage quantity accounts for the ratio of webpage sum in targeted website.
Under this implementation, task scale factor acquiring unit, can obtain in targeted website, in crawl history more
New webpage number, and/or, the new webpage number producing in targeted website, account for the ratio of webpage sum in targeted website.
Task scale factor acquiring unit can also obtain and compare and captured in the crawl history to targeted website
The information fingerprint of webpage;Unduplicated information fingerprint number is obtained according to the result comparing, accounts for the ratio of total fingerprint number, as described
Unduplicated webpage quantity accounts for the ratio of webpage sum in described targeted website.
Under another kind of implementation, this equipment can also include unit interval coefficient acquiring unit, according to crawl target
The task total time of website determines unit interval coefficient;
Now, task flow acquisition subelement 640 can be according to the summation of web page quality distribution and one or more tasks
Scale factor, and the product of unit interval coefficient, determine the task flow of crawl targeted website.
Additionally, this equipment can also include task flow adjustment unit, bear flow in task flow more than crawl, and two
When the difference of person is more than preset threshold value, by adjusting task scale factor, and/or unit interval coefficient, adjust task flow, directly
Bear flow to task flow less than or equal to described crawl, or both differences are less than preset threshold value.
Flow quota determining unit 650 can bear flow in task flow more than crawl, and both differences are less than preset
Threshold value when, task flow is defined as carrying out the flow quota of webpage capture on targeted website.
The equipment capturing flow quota to determination website provided in an embodiment of the present invention above is described in detail, should
Equipment can by website visitation data acquiring unit 610 obtain targeted website to be captured by access data;Website holding capacity
Determining unit 620, according to by accessing data, determines that flow is born in the crawl of targeted website;Web page quality distributed acquisition unit 630
Obtain the web page quality distribution of webpage in targeted website;Task flow acquiring unit 640, according to the webpage of webpage in targeted website
Mass Distribution, determines the task flow of crawl targeted website;Flow quota determining unit 650, holds according to the crawl of targeted website
By flow, and the task flow of crawl targeted website, determine the flow quota that webpage capture is carried out on targeted website.Pass through
This equipment can crawl pre-task, the holding capacity to website, and crawl required by task flow make accurately pre-
Survey, thus solve the problems, such as that the unconfined crawl of crawlers leads to excessively take site resource.Achieve to website
In the case that crawl pressure allows, the web data of website is effectively captured, to reduce the crawlers of search engine
With conflicting of crawled website.
Corresponding with the method that the determination that the embodiment of the present invention three provides captures flow, the embodiment of the present invention three additionally provides
Determine the equipment of crawl flow, refer to Fig. 7, this equipment may include that
Task scale factor acquiring unit 710, obtains task scale factor according to targeted website attribute character;
Task flow acquiring unit 720, the web page quality distribution being suitable in task based access control scale factor and targeted website is total
With the task flow of determination crawl targeted website.
Wherein task scale factor acquiring unit 710 can obtain in targeted website, and webpage number to be captured accounts for described mesh
Mark the ratio of webpage sum in website;
And/or,
Obtain in described targeted website, unduplicated webpage quantity accounts for the ratio of webpage sum in described targeted website, makees
For task scale factor.
Under this implementation, task scale factor acquiring unit 710 can obtain in targeted website, in crawl history
The webpage number updating, and/or, the new webpage number producing in described targeted website, account in targeted website webpage sum
Ratio.
Or, task scale factor acquiring unit can obtain and compare and grabbed in the crawl history to targeted website
The information fingerprint of the webpage taking;
Unduplicated information fingerprint number is obtained according to the result comparing, accounts for the ratio of total fingerprint number, do not repeat as described
Webpage quantity account for the ratio of the sum of webpage in targeted website.
Specifically, task flow acquiring unit 720 can be based in one or more task scale factors and targeted website
Web page quality be distributed summation product, determine crawl targeted website task flow.
Web page quality distribution summation can be determined by such as lower unit:
Scoring determining unit, according to the pagerank of webpage in targeted website, and/or the link depth of webpage, determine net
The scoring of page;
Normalized unit, is suitable to the scoring of webpages multiple in targeted website is normalized, obtains each
The corresponding Mass Distribution of webpage;And, sum unit, according to the corresponding Mass Distribution of each webpage obtaining, determine webpage matter
Amount distribution summation.
Under another kind of implementation, the equipment that this determination captures flow can also include:
Unit interval coefficient acquiring unit, be suitable to according to crawl targeted website determine task total time the unit interval be
Number;
Now, task flow acquiring unit 720 can be according to the summation of web page quality distribution and one or more task ratios
The example factor, and the product of unit interval coefficient, determine the task flow of crawl targeted website.
This determination captures the equipment of flow, can also include:
Webpage capture unit, according to the task flow of crawl targeted website, carries out webpage capture to targeted website.
Capture the equipment of flow by above-mentioned determination, task scale factor can be obtained according to targeted website attribute character;
Web page quality distribution summation in task based access control scale factor and targeted website, determines the task flow of crawl targeted website.From
And when crawlers capture to website, required crawl flow has been carried out with accurate determination, the web data to website
Effectively captured, decreased the crawlers of search engine and conflicting of crawled website.
Corresponding with the method that the determination website sub channel that the embodiment of the present invention four provides captures flow quota, the present invention is real
Apply example four and additionally provide the equipment determining that website sub channel captures flow quota, refer to Fig. 8, this equipment may include that
Channel holding capacity acquiring unit 810, obtains each subchannel in targeted website and bears flow;
Channel task amount acquiring unit 820, according to the web page quality distribution of webpage in each subchannel, determines that each subchannel is appointed
Business flow;
Crawl Weight Acquisition unit 830, bears flow according to subchannel, and each subchannel pair of subchannel task flow rate calculation
The crawl weight answered;
Quota determining unit 840, according to targeted website total flow quota, and each subchannel crawl weight, determine each son
Channel quota.
Specifically, channel holding capacity acquiring unit 810 can obtain according to each subchannel in targeted website by accessing data
The each subchannel in targeted website bears flow.
Under this implementation, channel holding capacity acquiring unit 810 can be according to search engine statistics to target network
Each subchannel of standing by accessing data, obtain targeted website each subchannel and bear flow.
Under this implementation, specifically, channel holding capacity acquiring unit can be according to each subchannel in targeted website
By accessing data, determine that the channel of each subchannel in targeted website bears access total amount;Then, according to channel bear access total amount with
Preset channel pressure coefficient, determines that each subchannel in targeted website bears flow.
Under another kind of implementation, channel task amount acquiring unit 820 can be according to webpage in each subchannel
Pagerank, and/or the link depth of webpage, determine the scoring of webpage in each subchannel;And then, to multiple nets in each subchannel
The described scoring of page is normalized, and obtains the corresponding Mass Distribution of each webpage;Further according in each subchannel obtaining
The web page quality distribution of webpage, determines each subchannel task flow.
Additionally, quota determining unit 840 can determine the crawl of targeted website according to the website visitation data of targeted website
Bear flow;
According to the web page quality distribution of webpage in targeted website, determine the website task flow of crawl targeted website;
Flow, and the website task flow of crawl targeted website are born according to the crawl of targeted website, determines in target
The targeted website total flow quota of webpage capture is carried out on website;And,
The described targeted website total flow quota being determined according to above-mentioned steps, and each subchannel crawl weight, determine each
Subchannel quota.
This determines that the equipment of website sub channel crawl flow quota can also include:
Channel time factor determination unit, be suitable to according to capture each subchannel task total time determine the channel unit interval
Coefficient;
Under this implementation, quota determining unit 840 can be by targeted website total flow quota and each subchannel power
Accounting, and the product of described channel unit interval coefficient again is joined as the described subchannel that corresponding subchannel is captured
Volume.
Additionally, this determines that the equipment of website sub channel crawl flow quota can also include:
Channel webpage capture unit, can capture to the webpage in each subchannel according to each subchannel quota.
The determination website sub channel being provided by the embodiment of the present invention four captures the equipment of flow quota, can be according to acquisition
To subchannel bear flow, and each subchannel of subchannel task flow rate calculation corresponding crawl weight;Total according to targeted website
Flow quota, and each subchannel crawl weight, determine each subchannel quota, reduce search engine crawlers with grabbed
Take website conflict while, can more reasonably will be captured assignment of traffic give each subchannel.
Algorithm and display be not inherently related to any certain computer, virtual system or miscellaneous equipment provided herein.
Various general-purpose systems can also be used together with based on teaching in this.As described above, construct required by this kind of system
Structure be obvious.Additionally, the present invention is also not for any certain programmed language.It is understood that, it is possible to use various
Programming language realizes the content of invention described herein, and the description above language-specific done is to disclose this
Bright preferred forms.
In description mentioned herein, illustrate a large amount of details.It is to be appreciated, however, that the enforcement of the present invention
Example can be put into practice in the case of not having these details.In some instances, known method, structure are not been shown in detail
And technology, so as not to obscure the understanding of this description.
Similarly it will be appreciated that in order to simplify the disclosure and help understand one or more of each inventive aspect,
Above in the description to the exemplary embodiment of the present invention, each feature of the present invention is grouped together into single enforcement sometimes
In example, figure or descriptions thereof.However, the method for the disclosure should be construed to reflect an intention that i.e. required guarantor
The application claims of shield more features than the feature being expressly recited in each claim.More precisely, it is such as following
Claims reflected as, inventive aspect is all features less than single embodiment disclosed above.Therefore,
The claims following specific embodiment are thus expressly incorporated in this specific embodiment, wherein each claim itself
All as the separate embodiments of the present invention.
Those skilled in the art are appreciated that and the module in the equipment in embodiment can be carried out adaptively
Change and they are arranged in one or more equipment different from this embodiment.Can be the module in embodiment or list
Unit or assembly be combined into a module or unit or assembly, and can be divided in addition multiple submodule or subelement or
Sub-component.In addition to such feature and/or at least some of process or unit exclude each other, can adopt any
Combination is to all features disclosed in this specification (including adjoint claim, summary and accompanying drawing) and so disclosed
Where method or all processes of equipment or unit are combined.Unless expressly stated otherwise, this specification (includes adjoint power
Profit requires, summary and accompanying drawing) disclosed in each feature can carry out generation by the alternative features providing identical, equivalent or similar purpose
Replace.
Although additionally, it will be appreciated by those of skill in the art that some embodiments described herein include other embodiments
In included some features rather than further feature, but the combination of the feature of different embodiment means to be in the present invention's
Within the scope of and form different embodiments.For example, in the following claims, embodiment required for protection appoint
One of meaning can in any combination mode using.
The all parts embodiment of the present invention can be realized with hardware, or to run on one or more processor
Software module realize, or with combinations thereof realize.It will be understood by those of skill in the art that can use in practice
Microprocessor or digital signal processor (dsp) capture flow quota realizing determination website according to embodiments of the present invention
The some or all functions of some or all parts in equipment.The present invention is also implemented as being retouched here for execution
Some or all equipment of the method stated or program of device (for example, computer program and computer program).
Such program realizing the present invention can store on a computer-readable medium, or can have one or more signal
Form.Such signal can be downloaded from internet website and obtain, or on carrier signal provide, or with any its
He provides form.
It should be noted that above-described embodiment the present invention will be described rather than limits the invention, and ability
Field technique personnel can design alternative embodiment without departing from the scope of the appended claims.In the claims,
Any reference markss between bracket should not be configured to limitations on claims.Word "comprising" does not exclude the presence of not
Element listed in the claims or step.Word "a" or "an" before element does not exclude the presence of multiple such
Element.The present invention can come real by means of the hardware including some different elements and by means of properly programmed computer
Existing.If in the unit claim listing equipment for drying, several in these devices can be by same hardware branch
To embody.The use of word first, second, and third does not indicate that any order.These words can be explained and run after fame
Claim.
The application can apply to computer system/server, and it can be with numerous other universal or special computing system rings
Border or configuration operate together.The well-known computing system that is suitable to be used together with computer system/server, environment and/
Or the example of configuration includes but is not limited to: personal computer system, server computer system, thin client, thick client computer, handss
Hold or laptop devices, the system based on microprocessor, Set Top Box, programmable consumer electronics, NetPC Network PC, small-sized meter
Calculation machine system large computer system and the distributed cloud computing technology environment including any of the above described system, etc..
Computer system/server can be in computer system executable instruction (the such as journey being executed by computer system
Sequence module) general linguistic context under describe.Generally, program module can include routine, program, target program, assembly, logic, number
According to structure etc., they execute specific task or realize specific abstract data type.Computer system/server is permissible
Distributed cloud computing environment is implemented, in distributed cloud computing environment, task is by long-range by communication network links
The execution of reason equipment.In distributed cloud computing environment, program module may be located at the Local or Remote meter including storage device
On calculation system storage medium.
Claims (26)
1. a kind of method that determination website captures flow quota, comprising:
Obtain targeted website to be captured by access data;
According to described by accessing data, determine that flow is born in the crawl of described targeted website;
Obtain the web page quality distribution of webpage in described targeted website;
According to the described web page quality distribution of webpage in described targeted website, determine the task flow of crawl targeted website;
Flow, and the task flow of described crawl targeted website are born according to the crawl of described targeted website, determines described
The flow quota of webpage capture is carried out on targeted website.
2. the method for claim 1, described obtain targeted website to be captured by access data, comprising:
According to the access statistic data to described targeted website for the search engine, determine described targeted website described by accessing number
According to.
3. method as claimed in claim 1 or 2, by accessing data described in described basis, determines the crawl of described targeted website
Bear flow, comprising:
According to described by accessing data, determine the born access total amount of described targeted website;
Bear access total amount and preset crawl pressure coefficient according to described, determine that stream is born in the crawl of described targeted website
Amount.
4. method as claimed in claim 3, by accessing data described in described basis, determines the born visit of described targeted website
Ask total amount, comprising:
According to the access statistic data to described targeted website for the search engine, the market share of described search engine, Yong Huzhi
Connect visit capacity, and website redundant flow, determine the born access total amount of described targeted website.
5. the method for claim 1, the web page quality distribution of webpage in the described targeted website of described acquisition, comprising:
According to the pagerank of webpage in described targeted website, and/or the link depth of webpage, determine the scoring of webpage;
Scoring to multiple webpages in described targeted website is normalized, and obtains the corresponding Mass Distribution of each webpage.
6. the method for claim 1, the web page quality distribution of webpage in the described targeted website of described acquisition, comprising:
Obtain the web page quality distribution of all webpages in described targeted website;
The described described web page quality distribution according to webpage in described targeted website, determines the task flow of crawl targeted website,
Including:
Obtain the summation of the web page quality distribution of all webpages in described targeted website, according to institute in described targeted website
The summation that the web page quality having webpage is distributed, determines the task flow of crawl targeted website.
7. method as claimed in claim 6, also includes:
Obtain one or more task scale factors;
The summation of the described web page quality distribution according to webpages all in described targeted website, determines crawl targeted website
Task flow, comprising:
Summation according to the distribution of described web page quality and the product of one or more described task scale factors, determine crawl target
The task flow of website.
8. method as claimed in claim 7, the one or more task scale factor of described acquisition, comprising:
Obtain in described targeted website, webpage number to be captured accounts for the ratio of webpage sum in described targeted website;
And/or,
Obtain in described targeted website, unduplicated webpage quantity accounts for the ratio of webpage sum in described targeted website.
9. method as claimed in claim 8, in the described targeted website of described acquisition, webpage number to be captured accounts for described target network
The ratio of webpage sum in standing, comprising:
Obtain in described targeted website, the webpage number updating in crawl history, and/or, the new net producing in described targeted website
Number of pages, accounts for the ratio of webpage sum in described targeted website.
10. method as claimed in claim 8, in the described targeted website of described acquisition, unduplicated webpage quantity accounts for described mesh
The ratio of webpage sum in mark website, comprising:
In the crawl history to targeted website, obtain and compare the information fingerprint of captured webpage;
Unduplicated information fingerprint number is obtained according to the result comparing, accounts for the ratio of total fingerprint number, as described unduplicated net
Number of pages accounts for the ratio of webpage sum in described targeted website.
11. methods as claimed in claim 6, also include:
According to crawl targeted website task total time determine unit interval coefficient;
The summation of the described web page quality distribution according to webpages all in described targeted website, determines crawl targeted website
Task flow, comprising:
Summation according to the distribution of described web page quality and one or more described task scale factors, and system of described unit interval
The product of number, determines the task flow of crawl targeted website.
12. methods as claimed in claim 11, also include:
When described task flow bears flow more than described crawl, and when both differences are more than preset threshold value, by adjusting institute
State task scale factor, and/or described unit interval coefficient, adjust described task flow, until described task flow be less than or
Bear flow equal to described crawl, or both differences are less than preset threshold value.
13. methods as described in claim 1,2, any one of 5-11, flow is born in the described crawl according to described targeted website,
And the task flow of described crawl targeted website, determine the flow quota that webpage capture is carried out on described targeted website, bag
Include:
When described task flow bears flow more than described crawl, and when both differences are less than preset threshold value, by described task
Flow is defined as carrying out the flow quota of webpage capture on described targeted website.
A kind of 14. determination websites capture the equipment of flow quota, comprising:
Website visitation data acquiring unit, be suitable to obtain targeted website to be captured by accessing data;
Website holding capacity determining unit, is suitable to according to described by accessing data, determines that flow is born in the crawl of described targeted website;
Web page quality distributed acquisition unit, is suitable to obtain the web page quality distribution of webpage in described targeted website;
Task flow acquiring unit, is suitable to the described web page quality distribution according to webpage in described targeted website, determines crawl mesh
The task flow of mark website;
Flow quota determining unit, is suitable to bear flow according to the crawl of described targeted website, and described crawl targeted website
Task flow, determine and the flow quota of webpage capture carried out on described targeted website.
15. equipment as claimed in claim 14, described website visitation data acquiring unit, it is suitable to:
According to the access statistic data to described targeted website for the search engine, determine described targeted website described by accessing number
According to.
16. equipment as described in claims 14 or 15, described website holding capacity determining unit, comprising:
Visit capacity determination subelement, is suitable to according to described by accessing data, determines the born access total amount of described targeted website;
Described website holding capacity determining unit, is suitable to bear access total amount and preset crawl pressure coefficient according to described, really
Flow is born in the crawl of fixed described targeted website.
17. equipment as claimed in claim 16, described visit capacity determination subelement, it is suitable to:
According to the access statistic data to described targeted website for the search engine, the market share of described search engine, Yong Huzhi
Connect visit capacity, and website redundant flow, determine the born access total amount of described targeted website.
18. equipment as claimed in claim 14, described web page quality distributed acquisition unit, it is suitable to:
According to the pagerank of webpage in described targeted website, and/or the link depth of webpage, determine the scoring of webpage;
Scoring to multiple webpages in described targeted website is normalized, and obtains the corresponding Mass Distribution of each webpage.
19. equipment as claimed in claim 14, described web page quality distributed acquisition unit, comprising:
Web page quality distributed acquisition subelement, the web page quality being suitable to obtain all webpages in described targeted website divides
Cloth;
Described task flow acquiring unit, comprising:
Task flow obtains subelement, is suitable to the web page quality distribution of all webpages in the described targeted website of acquisition
Summation, the summation of the web page quality distribution according to webpages all in described targeted website, determine crawl targeted website
Task flow.
20. equipment as claimed in claim 19, also include:
Task scale factor acquiring unit, is suitable to obtain one or more task scale factors;
Described task flow obtains subelement, is suitable to:
Summation according to the distribution of described web page quality and the product of one or more described task scale factors, determine crawl target
The task flow of website.
21. equipment as claimed in claim 20, described task scale factor acquiring unit, it is suitable to:
Obtain in described targeted website, webpage number to be captured accounts for the ratio of webpage sum in described targeted website;
And/or,
Obtain in described targeted website, unduplicated webpage quantity accounts for the ratio of webpage sum in described targeted website.
22. equipment as claimed in claim 21, described task scale factor acquiring unit, it is suitable to:
Obtain in described targeted website, the webpage number updating in crawl history, and/or, the new net producing in described targeted website
Number of pages, accounts for the ratio of webpage sum in described targeted website.
23. equipment as claimed in claim 21, described task scale factor acquiring unit, it is suitable to:
In the crawl history to targeted website, obtain and compare the information fingerprint of captured webpage;
Unduplicated information fingerprint number is obtained according to the result comparing, accounts for the ratio of total fingerprint number, as described unduplicated net
Number of pages accounts for the ratio of webpage sum in described targeted website.
24. equipment as claimed in claim 19, also include:
Unit interval coefficient acquiring unit, be suitable to according to crawl targeted website task total time determine unit interval coefficient;
Described task flow obtains subelement, is suitable to:
Summation according to the distribution of described web page quality and one or more described task scale factors, and system of described unit interval
The product of number, determines the task flow of crawl targeted website.
25. equipment as claimed in claim 24, also include:
Task flow adjustment unit, is suitable to bear flow when described task flow more than described crawl, and both differences are more than in advance
During the threshold value put, by adjusting described task scale factor, and/or described unit interval coefficient, adjust described task flow, directly
Bear flow to described task flow less than or equal to described crawl, or both differences are less than preset threshold value.
26. equipment as described in claim 14,15, any one of 18-24, described flow quota determining unit, it is suitable to:
When described task flow bears flow more than described crawl, and when both differences are less than preset threshold value, by described task
Flow is defined as carrying out the flow quota of webpage capture on described targeted website.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201310500682.8A CN103544278B (en) | 2013-10-22 | 2013-10-22 | Method and equipment for identifying website capturing flow quota |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201310500682.8A CN103544278B (en) | 2013-10-22 | 2013-10-22 | Method and equipment for identifying website capturing flow quota |
Publications (2)
Publication Number | Publication Date |
---|---|
CN103544278A CN103544278A (en) | 2014-01-29 |
CN103544278B true CN103544278B (en) | 2017-02-01 |
Family
ID=49967730
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201310500682.8A Active CN103544278B (en) | 2013-10-22 | 2013-10-22 | Method and equipment for identifying website capturing flow quota |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN103544278B (en) |
Families Citing this family (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN104392000B (en) * | 2014-12-15 | 2016-10-12 | 北京奇虎科技有限公司 | Determine the method and apparatus that mobile site captures quota |
CN111985086B (en) * | 2020-07-24 | 2024-04-09 | 西安理工大学 | Community detection method integrating priori information and sparse constraint |
CN113486229B (en) * | 2021-07-05 | 2023-11-07 | 北京百度网讯科技有限公司 | Control method and device for grabbing pressure, electronic equipment and readable storage medium |
Citations (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN102063477A (en) * | 2010-12-13 | 2011-05-18 | 百度在线网络技术(北京)有限公司 | Website data extraction device and method |
JP2012059295A (en) * | 2011-12-19 | 2012-03-22 | Intec Inc | Internet site information analysis method and apparatus |
CN102469132A (en) * | 2010-11-15 | 2012-05-23 | 北大方正集团有限公司 | Method and system for grabbing web pages from servers with different IPs (Internet Protocols) in website |
-
2013
- 2013-10-22 CN CN201310500682.8A patent/CN103544278B/en active Active
Patent Citations (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN102469132A (en) * | 2010-11-15 | 2012-05-23 | 北大方正集团有限公司 | Method and system for grabbing web pages from servers with different IPs (Internet Protocols) in website |
CN102063477A (en) * | 2010-12-13 | 2011-05-18 | 百度在线网络技术(北京)有限公司 | Website data extraction device and method |
JP2012059295A (en) * | 2011-12-19 | 2012-03-22 | Intec Inc | Internet site information analysis method and apparatus |
Also Published As
Publication number | Publication date |
---|---|
CN103544278A (en) | 2014-01-29 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN103530390B (en) | The method and apparatus of webpage capture | |
CN110019396A (en) | A kind of data analysis system and method based on distributed multidimensional analysis | |
DE202014010893U1 (en) | Rufwegsucher | |
Zhou et al. | Ranking scientific publications with similarity-preferential mechanism | |
CN107862022A (en) | Cultural resource commending system | |
Griffith | Modeling spatial autocorrelation in spatial interaction data: empirical evidence from 2002 Germany journey-to-work flows | |
CN110928739B (en) | Process monitoring method and device and computing equipment | |
CN103970753B (en) | The method for pushing and device of association knowledge | |
CN109101607B (en) | Method, apparatus and storage medium for searching blockchain data | |
CN103530392B (en) | Determine the method and apparatus of crawl flow | |
CN105373546B (en) | A kind of information processing method and system for knowledge services | |
US20170109413A1 (en) | Search System and Method for Updating a Scoring Model of Search Results based on a Normalized CTR | |
CN104391953B (en) | Detect the method and device of webpage renewal | |
CN103544278B (en) | Method and equipment for identifying website capturing flow quota | |
CN106599299A (en) | Determining method and device of website key words | |
DE102018010163A1 (en) | Automatic generation of useful user segments | |
CN106446179A (en) | Hot topic generation method and device | |
CN104462554A (en) | Method and device for recommending question and answer page related questions | |
CN106326297A (en) | Application recommendation method and device | |
CN107153702A (en) | A kind of data processing method and device | |
CN109543092A (en) | Financial product recommended method, device, storage medium and computer equipment | |
CN107203623B (en) | Load balancing and adjusting method of web crawler system | |
CN103530393B (en) | Determine that website sub channel captures the method and apparatus of flow quota | |
CN113792084A (en) | Data heat analysis method, device, equipment and storage medium | |
CN102929948B (en) | list page identification system and method |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
C06 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
C14 | Grant of patent or utility model | ||
GR01 | Patent grant | ||
TR01 | Transfer of patent right | ||
TR01 | Transfer of patent right |
Effective date of registration: 20220727 Address after: Room 801, 8th floor, No. 104, floors 1-19, building 2, yard 6, Jiuxianqiao Road, Chaoyang District, Beijing 100015 Patentee after: BEIJING QIHOO TECHNOLOGY Co.,Ltd. Address before: 100088 room 112, block D, 28 new street, new street, Xicheng District, Beijing (Desheng Park) Patentee before: BEIJING QIHOO TECHNOLOGY Co.,Ltd. Patentee before: Qizhi software (Beijing) Co.,Ltd. |