CN103544278B - Method and equipment for identifying website capturing flow quota - Google Patents

Method and equipment for identifying website capturing flow quota Download PDF

Info

Publication number
CN103544278B
CN103544278B CN201310500682.8A CN201310500682A CN103544278B CN 103544278 B CN103544278 B CN 103544278B CN 201310500682 A CN201310500682 A CN 201310500682A CN 103544278 B CN103544278 B CN 103544278B
Authority
CN
China
Prior art keywords
targeted website
webpage
flow
website
crawl
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201310500682.8A
Other languages
Chinese (zh)
Other versions
CN103544278A (en
Inventor
魏少俊
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing Qihoo Technology Co Ltd
Original Assignee
Beijing Qihoo Technology Co Ltd
Qizhi Software Beijing Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing Qihoo Technology Co Ltd, Qizhi Software Beijing Co Ltd filed Critical Beijing Qihoo Technology Co Ltd
Priority to CN201310500682.8A priority Critical patent/CN103544278B/en
Publication of CN103544278A publication Critical patent/CN103544278A/en
Application granted granted Critical
Publication of CN103544278B publication Critical patent/CN103544278B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/95Retrieval from the web
    • G06F16/951Indexing; Web crawling techniques
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/95Retrieval from the web
    • G06F16/958Organisation or management of web site content, e.g. publishing, maintaining pages or automatic linking

Landscapes

  • Engineering & Computer Science (AREA)
  • Databases & Information Systems (AREA)
  • Theoretical Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Information Transfer Between Computers (AREA)

Abstract

The invention discloses a method and equipment for ensuring website capturing flow quota. The method includes acquiring visited data of target websites to be captured; identifying capturing bearing quota of the target websites according to visited data; acquiring webpage mass distribution of webpages in the target websites; identifying task flow of the target websites to be captured according to the webpage mass distribution of the webpages in the target websites; identifying flow quota used in webpage capturing in the target websites according to the capturing bearing flow of the target websites and the task flow of the capturing target websites. With the method, flow quota for webpage capturing on the target websites can be distributed well when search engine creeper programs captures webpages of websites, conflict between the creeper programs and the websites to be captured is reduced, and capturing action of the creeper programs and updating requirements of searching engines are reasonably balanced.

Description

Determine that website captures the method and apparatus of flow quota
Technical field
The present invention relates to search engine technique field is and in particular to determine that website captures the method and apparatus of flow quota.
Background technology
Search engine is a kind of means of Internet information platform, can be by a large amount of info webs on the Internet by search engine Collect, after processed, set up information database and index data base, user can be by providing in search engine Entrance in input inquiry word, thus obtain search engine be directed to this query word return Search Results.With search engine skill The continuous development of art and maturation, the service trade that it provides is more and more perfect, obtains institute in people from the Internet in large scale When needing information, it is very conventional that search engine has become as one kind, also conveniently instrument.
Search engine, in order to download the webpage on the Internet, for analysis web data and foundation index, often needs To be commonly known as " crawlers " or " spider " using a kind of implementing procedure of crawl webpage, this program.Due to mutual New web page is always ceaselessly produced on networking, original webpage also updates continuous simultaneously, and therefore crawlers need not stop Work, to ensure that search engine can obtain up-to-date web data.In order to provide more preferable Search Results, search engine Crawlers always want to quickly include original webpage of new web page and renewal on the Internet.But web page resources are located at In each site hosts on network, crawlers will certainly take the Service Source of site hosts to the crawl of web page resources, As the software and hardware process resource of site hosts, bandwidth etc..If the task of crawl webpage has exceeded the tolerance range of site hosts, Just influence whether the normal access of website user, then the webpage capture behavior of crawlers just becomes the unfriendly row to website For, can lead to when serious affect websites response time-out, or even Website server collapse.And, it is the stability of guarding website, net Stand and usually can monitor the access of crawlers, and restriction is taken to the crawlers producing unfriendly act, or even forbid accessing Measure.Once crawlers are limited or are forbidden, the webpage capture efficiency meeting step-down of search engine, or even cannot update or download This website and webpage resource, the finally offer to search service has a negative impact.
Meanwhile, usually set manually in prior art to set flow or the frequency that crawlers can capture to website Rate, although this mode reduces the crawlers of search engine and conflicting of crawled website, updates to web data and does not have Have and obtain maximum embodiment, hence in so that the demand that crawlers crawl behavior is updated with website data is not reasonably put down Weighing apparatus.
Content of the invention
In view of the above problems it is proposed that the present invention so as to provide one kind overcome the problems referred to above or at least in part solve on The determination website stating problem captures the equipment of flow quota and the corresponding method determining that website captures flow quota.
According to one aspect of the present invention, there is provided a kind of method that determination website captures flow quota, comprising:
Obtain targeted website to be captured by access data;
According to described by accessing data, determine that flow is born in the crawl of described targeted website;
Obtain the web page quality distribution of webpage in described targeted website;
According to the described web page quality distribution of webpage in described targeted website, determine the task flow of crawl targeted website;
Flow, and the task flow of described crawl targeted website are born according to the crawl of described targeted website, determines The flow quota of webpage capture is carried out on described targeted website.
Alternatively, described obtain targeted website to be captured by access data, comprising:
According to the access statistic data to described targeted website for the search engine, determine that the described of described targeted website is accessed Data.
Alternatively, it is subject to described in described basis to access data, determine that flow is born in the crawl of described targeted website, comprising:
According to described by accessing data, determine the born access total amount of described targeted website;
Bear access total amount and preset crawl pressure coefficient according to described, determine that the crawl of described targeted website is born Flow.
Alternatively, it is subject to described in described basis to access data, determine the born access total amount of described targeted website, comprising:
According to the access statistic data to described targeted website for the search engine, the market share of described search engine, use The direct visit capacity in family, and website redundant flow, determine the born access total amount of described targeted website.
Alternatively, the described web page quality distribution obtaining webpage in described targeted website, comprising:
According to the pagerank of webpage in described targeted website, and/or the link depth of webpage, determine the scoring of webpage;
Scoring to multiple webpages in described targeted website is normalized, and obtains the corresponding quality of each webpage and divides Cloth.
Alternatively, the described web page quality distribution obtaining webpage in described targeted website, comprising:
Obtain the web page quality distribution of all webpages in described targeted website;
The described described web page quality distribution according to webpage in described targeted website, determines the task flow of crawl targeted website Amount, comprising:
Obtain the summation of the web page quality distribution of all webpages in described targeted website, according to described targeted website The summation of the web page quality distribution of interior all webpages, determines the task flow of crawl targeted website.
Alternatively, also include:
Obtain one or more task scale factors;
The summation of the described web page quality distribution according to webpages all in described targeted website, determines crawl target The task flow of website, comprising:
Summation according to the distribution of described web page quality and the product of one or more described task scale factors, determine crawl The task flow of targeted website.
Alternatively, the one or more task scale factors of described acquisition, comprising:
Obtain in described targeted website, webpage number to be captured accounts for the ratio of webpage sum in described targeted website Example;
And/or,
Obtain in described targeted website, unduplicated webpage quantity accounts for the ratio of webpage sum in described targeted website.
Alternatively, in the described targeted website of described acquisition, webpage number to be captured accounts for webpage sum in described targeted website Ratio, comprising:
Obtain in described targeted website, the webpage number updating in crawl history, and/or, newly produce in described targeted website Webpage number, account for the ratio of the sum of webpage in described targeted website.
Alternatively, in the described targeted website of described acquisition, it is total that unduplicated webpage quantity accounts for webpage in described targeted website The ratio of number, comprising:
In the crawl history to targeted website, obtain and compare the information fingerprint of captured webpage;
Unduplicated information fingerprint number is obtained according to the result comparing, accounts for the ratio of total fingerprint number, do not repeat as described Webpage quantity account for the ratio of the sum of webpage in described targeted website.
Alternatively, also include:
According to crawl targeted website task total time determine unit interval coefficient;
The summation of the described web page quality distribution according to webpages all in described targeted website, determines crawl target The task flow of website, comprising:
Summation according to the distribution of described web page quality and one or more described task scale factors, and during described unit Between coefficient product, determine crawl targeted website task flow.
Alternatively, also include:
When described task flow bears flow more than described crawl, and when both differences are more than preset threshold value, by adjusting Whole described task scale factor, and/or described unit interval coefficient, adjust described task flow, until described task flow is little In or bear flow equal to described crawl, or both differences are less than preset threshold value.
Alternatively, flow, and the task of described crawl targeted website are born in the described crawl according to described targeted website Flow, determines the flow quota carrying out webpage capture on described targeted website, comprising:
When described task flow bears flow more than described crawl, and when both differences are less than preset threshold value, will be described Task flow is defined as carrying out the flow quota of webpage capture on described targeted website.
According to a further aspect in the invention, there is provided a kind of determination website captures the equipment of flow quota, comprising:
Website visitation data acquiring unit, be suitable to obtain targeted website to be captured by accessing data;
Website holding capacity determining unit, is suitable to according to described by accessing data, determines that the crawl of described targeted website is born Flow;
Web page quality distributed acquisition unit, is suitable to obtain the web page quality distribution of webpage in described targeted website;
Task flow acquiring unit, is suitable to the described web page quality distribution according to webpage in described targeted website, and determination is grabbed Take the task flow of targeted website;
Flow quota determining unit, is suitable to bear flow according to the crawl of described targeted website, and described crawl target The task flow of website, determines the flow quota carrying out webpage capture on described targeted website.
Alternatively, described website visitation data acquiring unit, is suitable to:
According to the access statistic data to described targeted website for the search engine, determine that the described of described targeted website is accessed Data.
Alternatively, described website holding capacity determining unit, comprising:
Visit capacity determination subelement, is suitable to according to described by accessing data, determines the born access of described targeted website Total amount;
Described website holding capacity determining unit, is suitable to bear access total amount and preset crawl pressure system according to described Number, determines that flow is born in the crawl of described targeted website.
Alternatively, described visit capacity determination subelement, is suitable to:
According to the access statistic data to described targeted website for the search engine, the market share of described search engine, use The direct visit capacity in family, and website redundant flow, determine the born access total amount of described targeted website.
Alternatively, described web page quality distributed acquisition unit, is suitable to:
According to the pagerank of webpage in described targeted website, and/or the link depth of webpage, determine the scoring of webpage;
Scoring to multiple webpages in described targeted website is normalized, and obtains the corresponding quality of each webpage and divides Cloth.
Alternatively, described web page quality distributed acquisition unit, comprising:
Web page quality distributed acquisition subelement, is suitable to obtain the web page quality of all webpages in described targeted website Distribution;
Described task flow acquiring unit, comprising:
Task flow obtains subelement, and the web page quality being suitable to all webpages in the described targeted website of acquisition divides The summation of cloth, the summation of the web page quality distribution according to webpages all in described targeted website, determine crawl target network The task flow stood.
Alternatively, also include:
Task scale factor acquiring unit, is suitable to obtain one or more task scale factors;
Described task flow obtains subelement, is suitable to:
Summation according to the distribution of described web page quality and the product of one or more described task scale factors, determine crawl The task flow of targeted website.
Alternatively, described task scale factor acquiring unit, is suitable to:
Obtain in described targeted website, webpage number to be captured accounts for the ratio of webpage sum in described targeted website Example;
And/or,
Obtain in described targeted website, unduplicated webpage quantity accounts for the ratio of webpage sum in described targeted website.
Alternatively, described task scale factor acquiring unit, is suitable to:
Obtain in described targeted website, the webpage number updating in crawl history, and/or, newly produce in described targeted website Webpage number, account for the ratio of the sum of webpage in described targeted website.
Alternatively, described task scale factor acquiring unit, is suitable to:
In the crawl history to targeted website, obtain and compare the information fingerprint of captured webpage;
Unduplicated information fingerprint number is obtained according to the result comparing, accounts for the ratio of total fingerprint number, do not repeat as described Webpage quantity account for the ratio of the sum of webpage in described targeted website.
Alternatively, also include:
Unit interval coefficient acquiring unit, be suitable to according to crawl targeted website determine task total time the unit interval be Number;
Described task flow obtains subelement, is suitable to:
Summation according to the distribution of described web page quality and one or more described task scale factors, and during described unit Between coefficient product, determine crawl targeted website task flow.
Alternatively, also include:
Task flow adjustment unit, is suitable to bear flow when described task flow more than described crawl, and both differences is big When preset threshold value, by adjusting described task scale factor, and/or described unit interval coefficient, adjust described task flow Amount, until described task flow bears flow less than or equal to described crawl, or both differences are less than preset threshold value.
Alternatively, described flow quota determining unit, is suitable to:
When described task flow bears flow more than described crawl, and when both differences are less than preset threshold value, will be described Task flow is defined as carrying out the flow quota of webpage capture on described targeted website.
The method that determination website according to the present invention captures flow quota being accessed according to targeted website to be captured Data, when determining that search engine crawlers capture to targeted website, the crawl that can bear of targeted website is born Flow;And the task flow of crawl targeted website task can be determined according to the web page quality distribution of webpage in targeted website; And then flow, and the task flow of crawl targeted website are born according to the crawl of targeted website, determine enterprising in targeted website The flow quota of row webpage capture.Thus solve the unconfined crawl of crawlers to lead to excessively take asking of site resource Topic.Achieve in the case that the crawl pressure to website allows, the web data of website is effectively captured, to reduce The crawlers of search engine are conflicted with crawled website.Make crawlers capture behavior and upgrade demand with search engine to obtain Rational balance.
Described above is only the general introduction of technical solution of the present invention, in order to better understand the technological means of the present invention, And can be practiced according to the content of description, and in order to allow the above and other objects of the present invention, feature and advantage can Become apparent, below especially exemplified by the specific embodiment of the present invention.
Brief description
By reading the detailed description of hereafter preferred implementation, various other advantages and benefit are common for this area Technical staff will be clear from understanding.Accompanying drawing is only used for illustrating the purpose of preferred implementation, and is not considered as to the present invention Restriction.And in whole accompanying drawing, it is denoted by the same reference numerals identical part.In the accompanying drawings:
The flow chart that Fig. 1 shows the method for webpage capture according to an embodiment of the invention;
Fig. 2 shows the flow chart that determination website according to an embodiment of the invention captures the method for flow quota;
The flow chart that Fig. 3 shows the method determining crawl flow according to an embodiment of the invention;
Fig. 4 shows the flow process determining that website sub channel captures the method for flow quota according to an embodiment of the invention Figure;
Fig. 5 shows the schematic diagram of the equipment of webpage capture according to an embodiment of the invention;
Fig. 6 shows that determination website according to an embodiment of the invention captures the schematic diagram of the equipment of flow quota;
Fig. 7 shows the schematic diagram of the equipment determining crawl flow according to an embodiment of the invention;
Fig. 8 shows the signal determining that website sub channel captures the equipment of flow quota according to an embodiment of the invention Figure.
Specific embodiment
It is more fully described exemplary embodiment disclosed by the invention below with reference to accompanying drawings.Although showing this in accompanying drawing The exemplary embodiment of disclosure of the invention it being understood, however, that may be realized in various forms the present invention open and should not be by here The embodiment illustrating is limited.On the contrary, these embodiments are provided to be able to be best understood from the disclosure, and can be by What scope disclosed by the invention was complete conveys to those skilled in the art.
For convenience of description, define the explanation of parameter as shown in table 1 and parameter first:
Table 1
Embodiment one
Refer to Fig. 1, be the flow chart of the method for webpage capture provided in an embodiment of the present invention, as illustrated, the present invention The method of the webpage capture that embodiment provides may comprise steps of:
S110: obtain the dynamic flow quota value that webpage capture is carried out on targeted website;
During crawlers capture to the webpage in targeted website, in order to avoid unrestricted to same website Crawl, and lead to affect website normal access situations such as generation, it usually needs to crawlers on targeted website Crawl flow or frequency carry out certain restriction, and dynamic flow quota value is the crawl on targeted website to crawlers A kind of restriction of flow.The dynamic flow quota value of webpage capture is carried out on targeted website it can be understood as in crawlers During execution crawl task, the limit to the flow that the carrying out of same website captures within the unit interval, for example will be to dynamic flow Quota value is limited to 3,000,000/day.In this step s110, can obtain and the dynamic of webpage capture is carried out on targeted website Flow quota value.
When obtaining the dynamic flow quota value that webpage capture is carried out on targeted website, can be real by the following method Existing:
First obtain targeted website by access data, then can according to described by access data, determine targeted website Crawl bear flow;Obtain the web page quality distribution of webpage in targeted website;Web page quality according to webpage in targeted website Distribution, determines the task flow of crawl targeted website;And then flow is born according to the crawl of targeted website, and crawl target network The task flow stood, determines the dynamic flow quota value carrying out webpage capture on targeted website;
Wherein it is possible to according to search engine the access statistic data to targeted website, determine targeted website by access number According to.According to described by access data, when determining that flow is born in the crawl of targeted website, can according to by access data, first really Set the goal the born access total amount of website;According to total amount and preset crawl pressure coefficient can be born, to determine targeted website Crawl bear flow.Specifically, can be according to the access statistic data of the targeted website of search engine collection, and search is drawn The market share held up, the direct visit capacity of user, and website redundant flow, jointly to determine the born access of targeted website Total amount, then it is multiplied by preset crawl pressure coefficient, flow is born in the crawl as targeted website.
In targeted website webpage web page quality distribution acquisition, can according to the pagerank of webpage in targeted website, And/or the link depth of webpage, determine the scoring of webpage;Scoring to webpages multiple in targeted website is normalized, Obtain the corresponding Mass Distribution of each webpage.In targeted website, the web page quality distribution qi of webpage is it can be understood as to target network The scoring situation of the web page quality of webpage in standing.Web page quality distribution can be by the pagerank of webpage, and/or the chain of webpage Connect depth, to determine, such as can be according to the pagerank of webpage in targeted website, and/or the link depth of webpage, determine webpage Scoring;Then the scoring to webpages multiple in targeted website is normalized, and obtains the corresponding quality of each webpage and divides Cloth.Normalized, can make the quality score of webpage in targeted website normalize to (0,1] in this interval.
To with search engine for, can obtain state all webpages in targeted website web page quality distribution, enter And the summation that the web page quality obtaining all webpages in targeted website is distributed, according to the net of webpages all in targeted website The summation of page Mass Distribution, determines the task flow of crawl targeted website.One or more task ratios specifically can be obtained The factor;As obtained in targeted website, webpage number to be captured accounts for the ratio of webpage sum in targeted website;And/or, obtain Unduplicated webpage quantity in targeted website is taken to account for the ratio of webpage sum in targeted website.Then according to web page quality distribution Summation and the product of one or more task scale factors, determine the task flow of crawl targeted website.
Wherein obtain in described targeted website, the ratio that webpage number to be captured accounts for webpage sum in described targeted website can To obtain in described targeted website, the webpage number of renewal in crawl history, and/or, in targeted website, the new webpage number producing, accounts for The ratio of webpage sum in targeted website.Obtain unduplicated webpage quantity in targeted website and account for webpage sum in targeted website Ratio, in the crawl history to targeted website, can obtain and compare the information fingerprint of captured webpage;According to compare Result obtains unduplicated information fingerprint number, accounts for the ratio of total fingerprint number, accounts for described target network as unduplicated webpage quantity The ratio of webpage sum in standing.
Further, it is also possible to according to crawl targeted website task total time determine unit interval coefficient;Determining crawl mesh During the task flow of mark website, can be according to the summation of web page quality distribution and one or more task scale factors, Yi Jidan The product of position time coefficient, determines the task flow of crawl targeted website.
Bear flow when the crawl of the task flow according to crawl targeted website and website and determine appointing of crawl targeted website During business flow, flow can be born in task flow more than crawl, and when both differences are more than preset threshold value, be appointed by adjustment Business scale factor, and/or unit interval coefficient, to adjust task flow, until task flow bears stream less than or equal to crawl Amount, or both differences are less than preset threshold value.Realize the dynamic adjustment to dynamic flow quota value.When task flow is more than crawl Bear flow, and when both differences are less than preset threshold value, task flow can be defined as carrying out webpage on targeted website The dynamic flow quota value of crawl.
In addition can also obtain and carry out on targeted website only according to the task flow capturing targeted website in other method The dynamic flow quota value of webpage capture.The web page quality distribution of webpage in targeted website now can be obtained first;According to mesh In mark website, the web page quality distribution of webpage, determines the task flow of crawl targeted website;Task according to crawl targeted website Flow, determines the dynamic flow quota value carrying out webpage capture on targeted website.
Description to the more specific realization of each step of the embodiment of the present invention one, may be referred to the embodiment of the present invention two Determination website in middle step s210 to s240 captures the content of flow quota, and can be with determination crawl stream in embodiment three The content of the corresponding part in the method for amount is cross-referenced.
S120: according to described dynamic flow quota value, the webpage on described targeted website is captured.
After determining the dynamic flow quota value of webpage capture carried out on targeted website, can according to determined by dynamic Flow quota value carries out capturing with the flow that dynamic flow quota value is limited on targeted website.If certain crawl demand compares net The crawl stood bear flow much larger when, can by simplify crawl demand, or to crawl data carry out tightened up sieve After choosing, then captured.
Above the method for the webpage capture that the embodiment of the present invention one provides is described in detail, can by the method To obtain the dynamic flow quota value that webpage capture is carried out on targeted website;According to dynamic flow quota value, to targeted website On webpage captured it is achieved that to website crawl pressure allow in the case of, the web data of website is had The crawl of effect, to reduce the crawlers of search engine and conflicting of crawled website.
Embodiment two
Refer to Fig. 2, the flow chart capturing the method for flow quota for the determination website that the embodiment of the present invention two provides, such as Shown in figure, the method that determination website provided in an embodiment of the present invention captures flow quota may comprise steps of:
S210: obtain targeted website to be captured by access data;
Can obtain first targeted website to be captured by access data, targeted website capture by access data permissible It is the click volume data of the one day of website, such as parameter c in table 1, get after crawl targeted website by accessing after data, can To be released the access holding capacity of targeted website to be captured by access data according to targeted website.
Can being obtained from many aspects by accessing data of targeted website, such as can be announced in data by website ranking and obtain Take.In addition, user browses what webpage was often carried out by browser software, so can also be browsed by browser to user Webpage counted, further according to browser occupation rate in the current marketplace, determine the access holding capacity of website.As by clear Every daily visit that device of looking at counts on certain website is 1,500,000 times, and the Vehicles Collected from Market occupation rate of this browser is 15%, then permissible Determine this website accesses total amount day for 10,000,000 times, and that is, the access holding capacity of this website is at least 10,000,000 times.
Access statistic data that can also be according to search engine to targeted website, determine targeted website by accessing data, This is because it is often necessary to access webpage by search engine during user browses webpage, that is, passing through search engine The Search Results providing carry out redirecting to access webpage, and search engine can count to the webpage accessing, and then to passing through The click volume that search engine accesses website is counted, i.e. the access statistic data of the targeted website according to search engine statistics, Determine targeted website by access data.Specifically, the visit capacity of search engine access target website can be passed through, search divided by this Index the market share held up, as this website by access data.As count on user by search engine redirect access certain Every daily visit of website is 1,500,000 times, and the Vehicles Collected from Market occupation rate of this search engine is 15%, then can determine this website Day access total amount be 10,000,000 times, that is, the access holding capacity of this website is at least 10,000,000 times.
In addition it is also possible to be used in combination multiple methods or approach, to obtain more accurate targeted website by accessing number According to.For example be used in combination above-mentioned two methods, will client browser software statistical data, with search engine statistical number According to combining, can determine that user is redirected by search engine simultaneously, and non-search engine redirects access target website Data, combine both can obtain more accurate targeted website by accessing data.It should be noted that website By accessing data, typically with being represented by access times of website in the unit interval, in describing as the aforementioned, it is every with website Daily visit describing, it is of course also possible to according to concrete application situation using other unit of time, website in such as a hour By access times, the present invention is not restricted to this.
S220: according to described by accessing data, determine that flow is born in the crawl of described targeted website;
Get targeted website by access data after, can according to get by access data, determine targeted website Crawl bear flow.Flow is born in the crawl of website it can be understood as crawlers that in the unit interval, website can be born Crawl flow, the unit interval therein, equally can be depending on concrete application situation, below with the time as a unit day Description this method.
In actual applications, can directly the visit capacity of website in the unit interval getting be held as the crawl of website By flow.But based on the service that website provides usually is browsed with user, if directly the unit interval of the website getting is visited The amount of asking bears flow it is possible to the upper limit can be born beyond website for crawlers crawl as the crawl of website, therefore, Obtain targeted website is multiplied by preset crawl pressure coefficient by accessing data, and flow is born in the crawl obtaining targeted website.In advance The crawl pressure coefficient put, can be a percent coefficient, and its span is (0,1).Such as certain website by search Every daily visit that engine redirects is 1,500,000 times, and preset crawl pressure coefficient is 30%, then the targeted website finally determining It is daily for 450,000 times that flow is born in crawl.
Preset crawl pressure coefficient, can take flexible setting according to the difference by the source accessing data, as above In example, website is the every daily visit being redirected by search engine by accessing data, and this by actually only accessing data It is a part for the access total amount that this website can bear, therefore, preset crawl pressure coefficient can be that setting one is relative Higher value.If can obtain more accurate, close to website the actual access total amount that can bear by accessing number According to preset crawl pressure coefficient can be able to being then one relatively low value of setting.
Under another kind of implementation, can be subject to access data according to targeted website, determine the born visit of targeted website Ask total amount;Then the born access total amount according to targeted website and preset crawl pressure coefficient, determine grabbing of targeted website Take and bear flow.Want to obtain the born access total amount of the relatively targeted website of practical situation, a reasonable side Method is to try to obtain targeted website by accessing data with reference to many sources, such as can obtain and use according to the statistics of browser Family directly accesses the visit capacity of website;Simultaneously by the statistics of search engine, obtain user and pass through search engine search results Redirect the visit capacity accessing website;The market share of search engine;And the redundant flow of website etc. to determine target jointly The born access total amount of website.The redundant flow of wherein website refers to the redundant access holding capacity of website, can be according to long-term Website visiting peak value of monitoring etc. obtains it is also possible to obtain based on experience value.For example, being jumped by search engine of certain website The every daily visit turning is 1,500,000 times, and the market share of this search engine is 15%, additionally, this website also has the flow of half For the direct visit capacity of user, the flow that the flow that is, user directly accesses redirects this website of access with search engine is suitable, and is somebody's turn to do Website also has 50% redundant flow, then can determine that the born access total amount of this website unit interval (daily) is:
Ten thousand times/day of 150 ÷, 15% ÷ 50% × 150%=3000
I.e. the access total amount of can bearing daily of this website is 30,000,000 times/day.If preset crawl pressure coefficient is 5%, Then can determine that the crawl of this website is born flow and is:
Ten thousand times/day of 3000 × 5%=150
In this example, it is subject to access data, acquired target due to reference to many sources and obtaining targeted website The born access total amount of website, is closer to the actual total flow that can bear in website, and preset crawl pressure coefficient sets It is set to a relatively low value, that is, with respect to 30% in a upper example, in this example, crawl pressure coefficient is set to 5%.
In table 1, c represents the click volume of targeted website one day, can be all pages in website in same day Search Results Clicked number of times, and the function that flow is then appreciated that with regard to parameter c, i.e. targeted website are born in the crawl of targeted website Crawl bear flow and can be designated as f (c).
S230: obtain the web page quality distribution of webpage in described targeted website;
By step s210 and s220, flow is born in the crawl obtaining targeted website, the crawl of this targeted website Bear flow and be appreciated that to be the access data acquisition according to website, the predictive value of crawlers crawl can be born in website. On the other hand in addition it is also necessary to know the task situation that crawlers are captured to website, that is, capture the task flow of targeted website. Obtain the task flow that crawlers capture targeted website, the webpage of webpage in embodiment of the present invention Main Basiss targeted website Parameter qi in Mass Distribution, such as table 1.Here, in targeted website the web page quality distribution qi of webpage it can be understood as to target The scoring situation of the web page quality of webpage in website.Web page quality distribution can be by the pagerank of webpage, and/or webpage Link depth, to determine, such as can be according to the pagerank of webpage in targeted website, and/or the link depth of webpage determines net The scoring of page;Then the scoring to webpages multiple in targeted website is normalized, and obtains the corresponding quality of each webpage Distribution.Normalized, can make the quality score of webpage in targeted website normalize to (0,1] in this interval.For example, mesh Mark has a following webpage in website, and the corresponding pagerank of webpage (the pr value of also referred to as webpage, take 1 to 10 positive integer), Link depth (depth takes positive integer according to the link depth of webpage), as shown in table 2:
Table 2
Webpage Pr value pr÷10×0.7 depth 1/depth×0.3 Mass Distribution
Webpage 1 10 0.7 1 0.3 1
Webpage 2 7 0.49 3 0.1 0.59
Webpage 3 7 0.49 3 0.1 0.59
Webpage 4 6 0.42 5 0.06 0.48
Webpage 5 8 0.56 2 0.15 0.71
To determine the Mass Distribution of webpage due to being simultaneously used pagerank and web page interlinkage depth, in table 2, permissible When calculating, it is that pagerank and the link depth of webpage arranges different weights, so because the value of pr is 1 to 10 just Integer, link depth is the positive integer being taken according to the link depth of webpage, by normalization and imparting weight, has obtained webpage 1 To webpage 5, the Mass Distribution situation of each webpage.Certainly, in actual applications it is also possible to obtain net in other manners The Mass Distribution of page, is such as used alone the pr value of webpage, or the Mass Distribution to obtain webpage for the link depth of webpage, also may be used With preset grading module in a browser, when browsing webpage, each webpage is given a mark by grading module by user, enter And the marking to each webpage for the user collected by search engine, marking is counted and done the matter obtaining webpage after normalized Amount distribution.The mode of this user marking can certainly be attached to above-mentioned next with pagerank and web page interlinkage depth Determine in the method for Mass Distribution of webpage, the method to realize another kind of Mass Distribution obtaining webpage, its realize process with Above-mentioned example is similar to, and will not be described here.
Additionally, according to the difference of crawl task, in the web page quality distribution obtaining targeted website webpage, can be to obtain In targeted website, the web page quality distribution of all webpages needs the target of crawl it is also possible to obtain targeted website interior part The web page quality distribution of webpage, has detailed introduction in subsequent step.
S240: according to the described web page quality distribution of webpage in described targeted website, determine the task of crawl targeted website Flow;
Next the task flow of crawl targeted website can be determined according to the web page quality distribution of webpage in targeted website Amount.Specifically, can first in the targeted website of acquisition the web page quality distribution of webpage summation, according to webpage in targeted website Web page quality distribution summation, determine crawl targeted website task flow.
According to the difference of crawlers task strategy or crawl target, obtaining the web page quality distribution of targeted website webpage When, can be for different scopes.
The crawl demand of crawlers may come from two aspects: is on the one hand the webpage updating in crawl history, that is, Search engine had captured the webpage in website, and a portion there occurs renewal again, and search engine needs to this part more New webpage captures again.As in table 1, m represents the webpage number of the website currently included, if the accounting of the wherein webpage of renewal For a, then the quantity of the webpage updating in crawl history is (a × m).On the other hand it is the newfound webpage not yet capturing, Its quantity is parameter n in table 1.The crawl demand of this two aspects comprehensive, the crawl task of crawlers needs to capture webpage Quantity may is that
(a×m)+n
When crawl task is directed to both webpages of website, all webpages in targeted website can be obtained Web page quality distribution qi, and then obtain targeted website in all webpages web page quality be distributed qi summation it may be assumed that
σ i = 1 m qi
The summation of the web page quality distribution according to webpages all in targeted website, determines appointing of crawl targeted website Business flow.The summation that the web page quality of webpage is distributed, can be directly as the task flow of crawl targeted website, this Outward, one or more task scale factors can also be obtained;Summation according to web page quality distribution and one or more task ratios The product of the example factor, determines the task flow of crawl targeted website.Wherein, task scale factor can according to the property of itself not Same, play different effects during the task flow determining crawl targeted website.
Acquired one or more task scale factors, can be following task scale factors:
Obtain in targeted website, webpage number to be captured accounts for the ratio of webpage sum in targeted website;
And/or,
Obtain in targeted website, unduplicated webpage quantity accounts for the ratio of webpage sum in targeted website;
Webpage number wherein to be captured accounts for this task scale factor of ratio of webpage sum in targeted website, can With with:
( a × m ) + n m
Represent, during crawlers capture to the webpage of targeted website, can only to crawl history in more New webpage and the newfound webpage not yet capturing are captured, and both webpages with webpage sum Ratio then can be used as task scale factor, and the web page quality of all webpages is distributed the summation phase of qi with targeted website Take advantage of, that is,
( a × m ) + n m × σ i = 1 m qi
Both are multiplied obtained result as the task flow of crawl targeted website, have more accurately reacted this time and have grabbed Take the flow of the required by task of targeted website.In actual crawl task, the webpage of renewal in crawl history, and targeted website In the new webpage producing not necessarily exist simultaneously, when therefore obtaining this task scale factor, can be obtained according to practical situation In the webpage number updating in crawl history in targeted website and/or targeted website, the new webpage number producing, accounts for described targeted website The ratio of middle webpage sum.
Further, it is also possible in acquisition targeted website, unduplicated webpage quantity accounts for the ratio of webpage sum in targeted website, I.e. parameter u in table 1, as another task scale factor.During the crawl to targeted website for the crawlers, usually permissible The page repeating is identified, the page repeating only is captured once.Therefore this task scale factor can be passed through, enter one Step further filters out to duplicate pages in the task flow of crawl targeted website, makes the required by task of crawl targeted website Flow is more accurate.Now can basis:
( a × m ) + n m × u × σ i = 1 m qi
Unduplicated webpage quantity is added to account for this task scale factor of ratio of webpage sum in described targeted website, and Determine the task flow of crawl targeted website.
Specifically, account for the ratio of webpage sum in targeted website (as the parameter in table 1 in the unduplicated webpage quantity of acquisition Info web fingerprint identification technology can be utilized when u), in the crawl history to targeted website, obtain and compare and captured The information fingerprint of webpage;Unduplicated information fingerprint number is obtained according to the result comparing, accounts for the ratio of total fingerprint number, as described Unduplicated webpage quantity accounts for the ratio of webpage sum in described targeted website.
In addition, the task that crawlers capture targeted website usually needs a period of time to complete, and the stream of required by task Amount is often the flow that in the unit interval, crawl targeted website is distributed, and therefore may be incorporated into according to webpage capture required by task Time determine unit interval coefficient, if crawlers capture targeted website task need time grab for t(such as crawlers The task of taking targeted website needs 10 days), then can basis:
1 t × ( a × m ) + n m × u × σ i = 1 m qi
To determine the task flow of (as daily) crawl targeted website in each unit interval.Crawl due to targeted website Bear flow to describe with the flow in the unit interval, therefore, the task flow of crawl targeted website can adopt phase Same unit, in order to compare, such as all adopts (ten thousand times/day) as description units.
S250: flow, and the task flow of described crawl targeted website are born according to the crawl of described targeted website, really It is scheduled on the flow quota that webpage capture is carried out on described targeted website;
It should be noted that flow is born in the crawl of the described targeted website of determination, and determine appointing of crawl targeted website Business flow, both execution sequences can be arbitrary, you can to first carry out step s210 and s220, then execution step s230 With s240;Step s230 and s240 can also be first carried out, then execution step s210 and s220.Which kind of, no matter order, can obtain Flow, and the task flow of crawl targeted website are born in the crawl taking targeted website.
During crawlers capture to webpage, in order to avoid crawl unconfined to same website, and lead Cause impact website normal access situations such as generation, it usually needs to crawl flow on targeted website for the crawlers or Frequency carries out certain restriction, and flow quota is one of which.The flow quota of webpage capture is carried out on targeted website, can To be interpreted as in crawlers execution crawl task, the flow within the unit interval, the carrying out of same website being captured or frequency Limit, for example will to target flow quota restrictions be 3,000,000/day.In the method that the embodiment of the present invention one provides, permissible Flow, and the task flow of crawl targeted website are born according to the crawl of targeted website, to determine and to carry out on targeted website The flow quota of webpage capture.
After flow, and the task flow of crawl targeted website are born in the crawl getting targeted website, can basis Both determines the flow quota carrying out webpage capture on targeted website.Specifically both can be compared, will be less One as the flow quota that webpage capture is carried out on targeted website.For example the crawl of targeted website is born flow with f C () represents, and capture the task flow of targeted website with:
1 t × ( a × m ) + n m × u × σ i = 1 m qi
During expression, can be by:
min ( f ( c ) , 1 t × ( a × m ) + n m × u × σ i = 1 m qi )
As the flow quota that webpage capture is carried out on targeted website.Wherein min represents to enter in two or more parameter Row compares, and takes wherein minimum parameter as operation result.
Further, since the born flow of website usually has certain elastic space, the crawl in targeted website is born Flow, when being more or less the same with task flow, can be joined using task flow as the flow carrying out webpage capture on targeted website Volume.I.e. when task flow bears flow more than crawl, and when both differences are less than preset threshold value, task flow can be determined It is the flow quota that webpage capture is carried out on targeted website.When flow is born in the crawl that task flow is more than website, and both Difference when being more than preset threshold value, can pass through adjustment task scale factor, and/or unit interval coefficient, adjustment task flow, Until task flow bears flow less than or equal to crawl, or both differences are less than preset threshold value, adjust task scale factor, It is substantially and simplifies crawl demand, or the data capturing is carried out with the process of tightened up screening, and adjust unit interval system Number, the substantially time of the task of adjustment crawlers execution crawl targeted website.
The method capturing flow quota to the determination website that the embodiment of the present invention two provides above is described in detail, The crawlers that targeted website can be born can be determined according to targeted website to be captured by accessing data by the method Flow is born in the crawl that it is captured;And crawl can be determined according to the web page quality distribution of webpage in targeted website The task flow of targeted website task;And then flow, and the task of crawl targeted website are born according to the crawl of targeted website Flow, determines the flow quota carrying out webpage capture on targeted website.Thus solve the unconfined crawl of crawlers to lead Cause the excessive problem taking site resource.Achieve the webpage number in the case that the crawl pressure to website allows, to website According to effectively being captured, to reduce the crawlers of search engine and conflicting of crawled website.Crawlers are made to capture row It is to upgrade demand with search engine reasonably to be balanced.
Embodiment three
Refer to Fig. 3, the flow chart capturing the method for flow for the determination that the embodiment of the present invention three provides, as illustrated, The method determining crawl flow provided in an embodiment of the present invention may comprise steps of.
S310: task scale factor is obtained according to targeted website attribute character;
S320: based on the web page quality distribution summation in described task scale factor and targeted website, determine crawl target The task flow of website.
Wherein, acquired task scale factor can be obtained in targeted website, and webpage number to be captured accounts for target network In standing webpage sum ratio;And/or, obtain in targeted website, unduplicated webpage quantity accounts for net in targeted website The ratio of page sum.Obtain webpage number to be captured in targeted website account in targeted website webpage sum ratio permissible It is to obtain in targeted website, the webpage number updating in crawl history, and/or the new webpage number producing in targeted website, account for target In website webpage sum ratio.Obtain in targeted website, it is total that unduplicated webpage quantity accounts for webpage in targeted website The ratio of number can be, in the crawl history to targeted website, obtains and compare the information fingerprint of captured webpage;According to The result comparing obtains unduplicated information fingerprint number, accounts for the ratio of total fingerprint number, accounts for target as unduplicated webpage quantity The ratio of webpage sum in website.
Then, taking advantage of of summation can be distributed based on one or more task scale factors with the web page quality in targeted website Long-pending, determine the task flow of crawl targeted website.Web page quality distribution summation can be determined as follows: according to target network The pagerank of webpage in standing, and/or the link depth of webpage, determine the scoring of webpage;To webpages multiple in targeted website Scoring is normalized, and obtains the corresponding Mass Distribution of each webpage;The corresponding quality of each webpage according to obtaining is divided Cloth, determines web page quality distribution summation.
Further, it is also possible to according to crawl targeted website task total time determine unit interval coefficient;Based on described Web page quality distribution summation in business scale factor and targeted website, during the task flow of determination crawl targeted website, Ke Yigen According to summation and one or more task scale factors of web page quality distribution, and the product of unit interval coefficient, determine crawl The task flow of targeted website.
After the task flow getting crawl targeted website, can also be right according to the task flow of crawl targeted website Targeted website carries out webpage capture.When determined by task flow excessive when, adjustment task scale factor can be passed through, and/or Unit interval coefficient, the task flow of adjustment crawl targeted website, the task flow of crawl targeted website is adjusted to target network Stand in the range of can bearing.Adjustment task scale factor, substantially simplifies crawl demand, or the data of crawl is carried out The process of tightened up screening, and adjust unit interval coefficient, substantially adjustment crawlers execution captures targeted website The time of task.
Above to the embodiment of the present invention three provide determination capture flow method be described, the method more specific Realization and applicating example can be cross-referenced with embodiment two.Can be obtained according to targeted website attribute character by the method Take task scale factor;Web page quality distribution summation in task based access control scale factor and targeted website, determines crawl target network The task flow stood.So that when crawlers capture to website, accurate determination has been carried out to required crawl flow, The web data of website is effectively captured, to reduce the crawlers of search engine and conflicting of crawled website.
Example IV
Foregoing individual embodiments are all the introductions for starting point, concrete implementation mode being carried out with website, but in a lot of nets In standing, usually there is multiple substation points or subchannel, at this point it is possible to the substation point of website or subchannel are regarded as one simultaneously Individual independent website, applies similar said method provided in an embodiment of the present invention simultaneously, can be to substation point present in website Or subchannel carries out the acquisition that channel bears flow, that is, according to each subchannel by the son frequency accessing each subchannel of data acquisition Road bears flow, according to the web page quality distribution of webpage in each subchannel, determines the subchannel task flow of each subchannel;Institute is not Same, now flow can be born according to subchannel, and subchannel task flow determines the crawl weight of each subchannel;In conjunction with The flow quota of whole targeted website, and the crawl weight of each subchannel, to determine the channel quota of each subchannel.Last root According to the corresponding described channel quota of each subchannel, the webpage in each subchannel is captured.Below this is carried out detailed Introduce.
Refer to Fig. 4, capture the stream of the method for flow quota for the determination website sub channel that the embodiment of the present invention four provides Cheng Tu, as illustrated, provided in an embodiment of the present invention determine website sub channel capture flow quota method can include following Step:
S410: obtain each subchannel in targeted website and bear flow;
When implementing, for the subchannel of targeted website, due to can treat as independent website, same mesh In mark website, user's visit capacity of each subchannel etc. can be come out respectively by accessing data, therefore, obtains subchannel The specific implementation of flow is born in crawl, can be with the acquisition target network described in step s110 of embodiment one and s120 The implementation that flow is born in the crawl stood is identical.Namely can obtain according to each subchannel in targeted website by accessing data The each subchannel in targeted website bears flow, specifically, can be subject to according to each subchannel to targeted website of search engine statistics Access data, obtain each subchannel in targeted website and bear flow.Specifically when determining that subchannel bears flow, can be according to target The each subchannel in website by access data, determine that the subchannel of each subchannel in targeted website bears access total amount, then according to son The subchannel of channel bears access total amount and preset channel pressure coefficient, determines that each subchannel in targeted website bears flow.Tool The process of realizing of body may refer to embodiment one or embodiment two, repeats no more here.
S420: according to the web page quality distribution of webpage in each subchannel, determine each subchannel task flow;
Subchannel task flow when sub-channel is captured, actually one kind capture history according to the past, and The predictive value of the flow of crawl subchannel task that web page quality obtains.Before determining subchannel task flow, equally permissible First obtain webpage in each subchannel web page quality distribution, specific acquisition modes it is also possible to step s230 in embodiment one Described in obtain web page quality distribution same way.And determine the implementation of each subchannel task flow, can be with step Determine in rapid s140 that the mode of the task flow of targeted website is identical.For example, it is possible to according to webpage in each subchannel Pagerank, and/or the link depth of webpage, determine the scoring of webpage in each subchannel, to multiple webpages in each subchannel Scoring is normalized, and obtains the corresponding Mass Distribution of each webpage, according to the webpage of webpage in each subchannel obtaining Mass Distribution, determines each subchannel task flow.Implement the introduction that still can be found in embodiment one, repeat no more here.
S430: flow is born according to described subchannel, and each subchannel of described subchannel flow rate calculation corresponding crawl power Weight;
After getting subchannel and bearing flow and subchannel task flow, in the embodiment of the present invention four, also may be used To calculate each subchannel corresponding crawl weight.That is, for each subchannel, respectively can be from can bear Wherein smaller is selected to correspond to respectively as reference value, so each subchannel in the subchannel task flow of flow and prediction One reference value, be not then directly using this reference value as each subchannel flow quota, but first according to these Reference value calculates the weight of each subchannel, for example, it is possible to the reference value of each subchannel is added, the power of each subchannel It is equal to the shared ratio in this plus value preset of this subchannel reference value of itself again.For example, subchannel 1,2,3, wherein, son frequency The reference value in road 1 is n1, and the reference value of subchannel 2 is n2, and the reference value of subchannel 3 is n3, then the weight of subchannel 1 is n1/ (n1+n2+n3), the weight of subchannel 2 is n2/(n1+n2+n3), the weight of subchannel 3 is n3/(n1+n2+n3).
S440: according to targeted website total flow quota, and each subchannel crawl weight, determine each subchannel quota.
It is possible to be multiplied by the total flow of affiliated web site again with respective weight after the weight having calculated each subchannel Quota, you can obtain the quota of subchannel.Wherein, with regard to the total flow quota of targeted website, may refer in embodiment one Record, repeat no more here.
When implementing, can also according to capture each subchannel task total time determine unit interval coefficient, then will Targeted website total flow quota and each subchannel weight accounting, and the product of unit interval coefficient enters as to corresponding subchannel The subchannel quota of row crawl.Finally it is possible to be captured to the webpage in each subchannel according to each subchannel quota.
The method that the determination website sub channel being provided by the embodiment of the present invention four captures flow quota, can be according to acquisition To subchannel bear flow, and each subchannel of subchannel task flow rate calculation corresponding crawl weight;Total according to targeted website Flow quota, and each subchannel crawl weight, determine each subchannel quota, reduce search engine crawlers with grabbed Take website conflict while, more reasonably can will be captured assignment of traffic to each subchannel it is achieved that each to targeted website Subchannel more reasonably browses distribution.
Corresponding with the method for the webpage capture that the embodiment of the present invention one provides, the embodiment of the present invention one additionally provides webpage The equipment of crawl, refers to Fig. 5, this equipment may include that
Dynamic flow quota value acquiring unit 510, is suitable to obtain the dynamic flow carrying out webpage capture on targeted website Quota value;
Webpage capture unit 520, is suitable to, according to dynamic flow quota value, the webpage on targeted website be captured.
Wherein dynamic flow quota value acquiring unit 510 may include that
Website visitation data acquiring unit, be suitable to obtain targeted website by access data;
Website holding capacity determining unit, is suitable to, according to by accessing data, determine that flow is born in the crawl of targeted website;
Web page quality distributed acquisition unit, is suitable to obtain the web page quality distribution of webpage in targeted website;And, task flow Amount acquiring unit, is suitable to the described web page quality distribution according to webpage in described targeted website, determines appointing of crawl targeted website Business flow;
Under this implementation, dynamic flow quota value acquiring unit 510 can bear according to the crawl of targeted website Flow, and the task flow of crawl targeted website, determine the dynamic flow quota value carrying out webpage capture on targeted website.
Specifically, the access statistic data that website visitation data acquiring unit can be according to search engine to targeted website, Determine targeted website described by access data.
Website therein holding capacity determining unit may include that
Visit capacity determination subelement, is suitable to, according to by accessing data, determine the born access total amount of targeted website;
Website holding capacity determining unit, can determine described mesh according to bearing total amount and preset crawl pressure coefficient Flow is born in the crawl of mark website.
Under this implementation, the acess control that visit capacity determination subelement can be according to search engine to targeted website Data, the market share of search engine, the direct visit capacity of user, and website redundant flow, to determine targeted website can Bear access total amount.
Specifically, web page quality distributed acquisition unit can be according to the pagerank of webpage in targeted website, and/or webpage Link depth, determine the scoring of webpage;Scoring to webpages multiple in targeted website is normalized, and obtains each net The corresponding Mass Distribution of page.
Web page quality distributed acquisition unit may include that
Web page quality distributed acquisition subelement, obtains the web page quality distribution of all webpages in targeted website;
Under this implementation, task flow acquiring unit may include that
Task flow obtains subelement, is suitable to the web page quality distribution of all webpages in the targeted website of acquisition Summation, the summation of the web page quality distribution according to webpages all in targeted website, determine the task of crawl targeted website Flow.
Under this implementation, this equipment can also include:
Task scale factor acquiring unit, is suitable to obtain one or more task scale factors;
Described task flow obtains subelement, is suitable to:
Summation according to web page quality distribution and the product of one or more task scale factors, determine crawl targeted website Task flow.
Wherein, task scale factor acquiring unit can obtain in described targeted website, and webpage number to be captured accounts for described In targeted website webpage sum ratio;
And/or,
Obtain in described targeted website, unduplicated webpage quantity accounts for the ratio of webpage sum in described targeted website.
When implementing, task scale factor acquiring unit can obtain in described targeted website, updates in crawl history Webpage number, and/or, the new webpage number producing in described targeted website, account for the ratio of webpage sum in described targeted website.
When implementing, task scale factor acquiring unit can obtain and compare in the crawl history to targeted website Information fingerprint to the webpage being captured;
Unduplicated information fingerprint number is obtained according to the result comparing, accounts for the ratio of total fingerprint number, do not repeat as described Webpage quantity account for the ratio of the sum of webpage in described targeted website.
Under another kind of implementation, this equipment can also include:
Unit interval coefficient acquiring unit, be suitable to according to crawl targeted website determine task total time the unit interval be Number;
Now, task flow acquisition subelement can be according to the summation of web page quality distribution and one or more described tasks Scale factor, and the product of unit interval coefficient, determine the task flow of crawl targeted website.
In addition this equipment can also include task flow adjustment unit, is suitable to bear stream when task flow more than described crawl Amount, and when both differences are more than preset threshold value, by adjusting described task scale factor, and/or unit interval coefficient, adjustment Task flow, until task flow bears flow less than or equal to described crawl, or both differences are less than preset threshold value.
Dynamic flow quota value acquiring unit 510 can bear flow in task flow more than crawl, and both differences are little When preset threshold value, task flow is defined as carrying out the dynamic flow quota value of webpage capture on described targeted website.
In addition, dynamic flow quota value acquiring unit 510 may include that
Web page quality distributed acquisition unit, is suitable to obtain the web page quality distribution of webpage in targeted website;
Task flow acquiring unit, is suitable to the web page quality distribution according to webpage in targeted website, determines crawl target network The task flow stood;
Dynamic flow quota value acquiring unit 510 can determine in target network according to the task flow of crawl targeted website Carry out the dynamic flow quota value of webpage capture on standing.
Corresponding with the method that the determination website that the embodiment of the present invention two provides captures flow quota, the embodiment of the present invention two Additionally provide the equipment determining that website captures flow quota, refer to Fig. 6, this equipment may include that
Website visitation data acquiring unit 610, obtain targeted website to be captured by access data;
Website holding capacity determining unit 620, according to by accessing data, determines that flow is born in the crawl of targeted website;
Web page quality distributed acquisition unit 630, obtains the web page quality distribution of webpage in targeted website;
Task flow acquiring unit 640, according to the described web page quality distribution of webpage in targeted website, determines crawl target The task flow of website;
Flow quota determining unit 650, bears flow, and the task of crawl targeted website according to the crawl of targeted website Flow, determines the flow quota carrying out webpage capture on targeted website.
Under another kind of implementation, website visitation data acquiring unit 610, it is suitable to:
According to the access statistic data to targeted website for the search engine, determine targeted website by accessing data.
Additionally, website holding capacity determining unit 620 can also include:
Visit capacity determination subelement, is suitable to, according to by accessing data, determine the born access total amount of targeted website;
Under this implementation, website holding capacity determining unit 220, can according to can bear access total amount with preset Crawl pressure coefficient, determines that flow is born in the crawl of targeted website.
Visit capacity determination subelement can be also used for the access statistic data according to search engine to targeted website, and search is drawn The market share held up, the direct visit capacity of user, and website redundant flow, determine the born access total amount of targeted website.
In actual applications, web page quality distributed acquisition unit 630 can according to the pagerank of webpage in targeted website, And/or the link depth of webpage, determine the scoring of webpage;
And the scoring to webpages multiple in targeted website is normalized, obtains the corresponding quality of each webpage and divide Cloth.
Under another kind of implementation, web page quality distributed acquisition unit 630 can also include:
Web page quality distributed acquisition subelement, the web page quality that can obtain all webpages in targeted website divides Cloth;
Now, task flow acquiring unit 640 may include that
Task flow obtains subelement, and the web page quality being suitable to all webpages in the described targeted website of acquisition divides The summation of cloth, the summation of the web page quality distribution according to webpages all in targeted website, determine crawl targeted website Task flow.
Under this implementation, this equipment can also include: task scale factor acquiring unit, be suitable to obtain one or Multiple tasks scale factor;
Now, task flow obtains subelement, can be according to the summation of web page quality distribution and one or more task ratios The product of the example factor, determines the task flow of crawl targeted website.
Wherein, task scale factor acquiring unit can obtain different task scale factors, such as:
Obtain in targeted website, webpage number to be captured accounts for the ratio of webpage sum in targeted website;
And/or,
Obtain in targeted website, unduplicated webpage quantity accounts for the ratio of webpage sum in targeted website.
Under this implementation, task scale factor acquiring unit, can obtain in targeted website, in crawl history more New webpage number, and/or, the new webpage number producing in targeted website, account for the ratio of webpage sum in targeted website.
Task scale factor acquiring unit can also obtain and compare and captured in the crawl history to targeted website The information fingerprint of webpage;Unduplicated information fingerprint number is obtained according to the result comparing, accounts for the ratio of total fingerprint number, as described Unduplicated webpage quantity accounts for the ratio of webpage sum in described targeted website.
Under another kind of implementation, this equipment can also include unit interval coefficient acquiring unit, according to crawl target The task total time of website determines unit interval coefficient;
Now, task flow acquisition subelement 640 can be according to the summation of web page quality distribution and one or more tasks Scale factor, and the product of unit interval coefficient, determine the task flow of crawl targeted website.
Additionally, this equipment can also include task flow adjustment unit, bear flow in task flow more than crawl, and two When the difference of person is more than preset threshold value, by adjusting task scale factor, and/or unit interval coefficient, adjust task flow, directly Bear flow to task flow less than or equal to described crawl, or both differences are less than preset threshold value.
Flow quota determining unit 650 can bear flow in task flow more than crawl, and both differences are less than preset Threshold value when, task flow is defined as carrying out the flow quota of webpage capture on targeted website.
The equipment capturing flow quota to determination website provided in an embodiment of the present invention above is described in detail, should Equipment can by website visitation data acquiring unit 610 obtain targeted website to be captured by access data;Website holding capacity Determining unit 620, according to by accessing data, determines that flow is born in the crawl of targeted website;Web page quality distributed acquisition unit 630 Obtain the web page quality distribution of webpage in targeted website;Task flow acquiring unit 640, according to the webpage of webpage in targeted website Mass Distribution, determines the task flow of crawl targeted website;Flow quota determining unit 650, holds according to the crawl of targeted website By flow, and the task flow of crawl targeted website, determine the flow quota that webpage capture is carried out on targeted website.Pass through This equipment can crawl pre-task, the holding capacity to website, and crawl required by task flow make accurately pre- Survey, thus solve the problems, such as that the unconfined crawl of crawlers leads to excessively take site resource.Achieve to website In the case that crawl pressure allows, the web data of website is effectively captured, to reduce the crawlers of search engine With conflicting of crawled website.
Corresponding with the method that the determination that the embodiment of the present invention three provides captures flow, the embodiment of the present invention three additionally provides Determine the equipment of crawl flow, refer to Fig. 7, this equipment may include that
Task scale factor acquiring unit 710, obtains task scale factor according to targeted website attribute character;
Task flow acquiring unit 720, the web page quality distribution being suitable in task based access control scale factor and targeted website is total With the task flow of determination crawl targeted website.
Wherein task scale factor acquiring unit 710 can obtain in targeted website, and webpage number to be captured accounts for described mesh Mark the ratio of webpage sum in website;
And/or,
Obtain in described targeted website, unduplicated webpage quantity accounts for the ratio of webpage sum in described targeted website, makees For task scale factor.
Under this implementation, task scale factor acquiring unit 710 can obtain in targeted website, in crawl history The webpage number updating, and/or, the new webpage number producing in described targeted website, account in targeted website webpage sum Ratio.
Or, task scale factor acquiring unit can obtain and compare and grabbed in the crawl history to targeted website The information fingerprint of the webpage taking;
Unduplicated information fingerprint number is obtained according to the result comparing, accounts for the ratio of total fingerprint number, do not repeat as described Webpage quantity account for the ratio of the sum of webpage in targeted website.
Specifically, task flow acquiring unit 720 can be based in one or more task scale factors and targeted website Web page quality be distributed summation product, determine crawl targeted website task flow.
Web page quality distribution summation can be determined by such as lower unit:
Scoring determining unit, according to the pagerank of webpage in targeted website, and/or the link depth of webpage, determine net The scoring of page;
Normalized unit, is suitable to the scoring of webpages multiple in targeted website is normalized, obtains each The corresponding Mass Distribution of webpage;And, sum unit, according to the corresponding Mass Distribution of each webpage obtaining, determine webpage matter Amount distribution summation.
Under another kind of implementation, the equipment that this determination captures flow can also include:
Unit interval coefficient acquiring unit, be suitable to according to crawl targeted website determine task total time the unit interval be Number;
Now, task flow acquiring unit 720 can be according to the summation of web page quality distribution and one or more task ratios The example factor, and the product of unit interval coefficient, determine the task flow of crawl targeted website.
This determination captures the equipment of flow, can also include:
Webpage capture unit, according to the task flow of crawl targeted website, carries out webpage capture to targeted website.
Capture the equipment of flow by above-mentioned determination, task scale factor can be obtained according to targeted website attribute character; Web page quality distribution summation in task based access control scale factor and targeted website, determines the task flow of crawl targeted website.From And when crawlers capture to website, required crawl flow has been carried out with accurate determination, the web data to website Effectively captured, decreased the crawlers of search engine and conflicting of crawled website.
Corresponding with the method that the determination website sub channel that the embodiment of the present invention four provides captures flow quota, the present invention is real Apply example four and additionally provide the equipment determining that website sub channel captures flow quota, refer to Fig. 8, this equipment may include that
Channel holding capacity acquiring unit 810, obtains each subchannel in targeted website and bears flow;
Channel task amount acquiring unit 820, according to the web page quality distribution of webpage in each subchannel, determines that each subchannel is appointed Business flow;
Crawl Weight Acquisition unit 830, bears flow according to subchannel, and each subchannel pair of subchannel task flow rate calculation The crawl weight answered;
Quota determining unit 840, according to targeted website total flow quota, and each subchannel crawl weight, determine each son Channel quota.
Specifically, channel holding capacity acquiring unit 810 can obtain according to each subchannel in targeted website by accessing data The each subchannel in targeted website bears flow.
Under this implementation, channel holding capacity acquiring unit 810 can be according to search engine statistics to target network Each subchannel of standing by accessing data, obtain targeted website each subchannel and bear flow.
Under this implementation, specifically, channel holding capacity acquiring unit can be according to each subchannel in targeted website By accessing data, determine that the channel of each subchannel in targeted website bears access total amount;Then, according to channel bear access total amount with Preset channel pressure coefficient, determines that each subchannel in targeted website bears flow.
Under another kind of implementation, channel task amount acquiring unit 820 can be according to webpage in each subchannel Pagerank, and/or the link depth of webpage, determine the scoring of webpage in each subchannel;And then, to multiple nets in each subchannel The described scoring of page is normalized, and obtains the corresponding Mass Distribution of each webpage;Further according in each subchannel obtaining The web page quality distribution of webpage, determines each subchannel task flow.
Additionally, quota determining unit 840 can determine the crawl of targeted website according to the website visitation data of targeted website Bear flow;
According to the web page quality distribution of webpage in targeted website, determine the website task flow of crawl targeted website;
Flow, and the website task flow of crawl targeted website are born according to the crawl of targeted website, determines in target The targeted website total flow quota of webpage capture is carried out on website;And,
The described targeted website total flow quota being determined according to above-mentioned steps, and each subchannel crawl weight, determine each Subchannel quota.
This determines that the equipment of website sub channel crawl flow quota can also include:
Channel time factor determination unit, be suitable to according to capture each subchannel task total time determine the channel unit interval Coefficient;
Under this implementation, quota determining unit 840 can be by targeted website total flow quota and each subchannel power Accounting, and the product of described channel unit interval coefficient again is joined as the described subchannel that corresponding subchannel is captured Volume.
Additionally, this determines that the equipment of website sub channel crawl flow quota can also include:
Channel webpage capture unit, can capture to the webpage in each subchannel according to each subchannel quota.
The determination website sub channel being provided by the embodiment of the present invention four captures the equipment of flow quota, can be according to acquisition To subchannel bear flow, and each subchannel of subchannel task flow rate calculation corresponding crawl weight;Total according to targeted website Flow quota, and each subchannel crawl weight, determine each subchannel quota, reduce search engine crawlers with grabbed Take website conflict while, can more reasonably will be captured assignment of traffic give each subchannel.
Algorithm and display be not inherently related to any certain computer, virtual system or miscellaneous equipment provided herein. Various general-purpose systems can also be used together with based on teaching in this.As described above, construct required by this kind of system Structure be obvious.Additionally, the present invention is also not for any certain programmed language.It is understood that, it is possible to use various Programming language realizes the content of invention described herein, and the description above language-specific done is to disclose this Bright preferred forms.
In description mentioned herein, illustrate a large amount of details.It is to be appreciated, however, that the enforcement of the present invention Example can be put into practice in the case of not having these details.In some instances, known method, structure are not been shown in detail And technology, so as not to obscure the understanding of this description.
Similarly it will be appreciated that in order to simplify the disclosure and help understand one or more of each inventive aspect, Above in the description to the exemplary embodiment of the present invention, each feature of the present invention is grouped together into single enforcement sometimes In example, figure or descriptions thereof.However, the method for the disclosure should be construed to reflect an intention that i.e. required guarantor The application claims of shield more features than the feature being expressly recited in each claim.More precisely, it is such as following Claims reflected as, inventive aspect is all features less than single embodiment disclosed above.Therefore, The claims following specific embodiment are thus expressly incorporated in this specific embodiment, wherein each claim itself All as the separate embodiments of the present invention.
Those skilled in the art are appreciated that and the module in the equipment in embodiment can be carried out adaptively Change and they are arranged in one or more equipment different from this embodiment.Can be the module in embodiment or list Unit or assembly be combined into a module or unit or assembly, and can be divided in addition multiple submodule or subelement or Sub-component.In addition to such feature and/or at least some of process or unit exclude each other, can adopt any Combination is to all features disclosed in this specification (including adjoint claim, summary and accompanying drawing) and so disclosed Where method or all processes of equipment or unit are combined.Unless expressly stated otherwise, this specification (includes adjoint power Profit requires, summary and accompanying drawing) disclosed in each feature can carry out generation by the alternative features providing identical, equivalent or similar purpose Replace.
Although additionally, it will be appreciated by those of skill in the art that some embodiments described herein include other embodiments In included some features rather than further feature, but the combination of the feature of different embodiment means to be in the present invention's Within the scope of and form different embodiments.For example, in the following claims, embodiment required for protection appoint One of meaning can in any combination mode using.
The all parts embodiment of the present invention can be realized with hardware, or to run on one or more processor Software module realize, or with combinations thereof realize.It will be understood by those of skill in the art that can use in practice Microprocessor or digital signal processor (dsp) capture flow quota realizing determination website according to embodiments of the present invention The some or all functions of some or all parts in equipment.The present invention is also implemented as being retouched here for execution Some or all equipment of the method stated or program of device (for example, computer program and computer program). Such program realizing the present invention can store on a computer-readable medium, or can have one or more signal Form.Such signal can be downloaded from internet website and obtain, or on carrier signal provide, or with any its He provides form.
It should be noted that above-described embodiment the present invention will be described rather than limits the invention, and ability Field technique personnel can design alternative embodiment without departing from the scope of the appended claims.In the claims, Any reference markss between bracket should not be configured to limitations on claims.Word "comprising" does not exclude the presence of not Element listed in the claims or step.Word "a" or "an" before element does not exclude the presence of multiple such Element.The present invention can come real by means of the hardware including some different elements and by means of properly programmed computer Existing.If in the unit claim listing equipment for drying, several in these devices can be by same hardware branch To embody.The use of word first, second, and third does not indicate that any order.These words can be explained and run after fame Claim.
The application can apply to computer system/server, and it can be with numerous other universal or special computing system rings Border or configuration operate together.The well-known computing system that is suitable to be used together with computer system/server, environment and/ Or the example of configuration includes but is not limited to: personal computer system, server computer system, thin client, thick client computer, handss Hold or laptop devices, the system based on microprocessor, Set Top Box, programmable consumer electronics, NetPC Network PC, small-sized meter Calculation machine system large computer system and the distributed cloud computing technology environment including any of the above described system, etc..
Computer system/server can be in computer system executable instruction (the such as journey being executed by computer system Sequence module) general linguistic context under describe.Generally, program module can include routine, program, target program, assembly, logic, number According to structure etc., they execute specific task or realize specific abstract data type.Computer system/server is permissible Distributed cloud computing environment is implemented, in distributed cloud computing environment, task is by long-range by communication network links The execution of reason equipment.In distributed cloud computing environment, program module may be located at the Local or Remote meter including storage device On calculation system storage medium.

Claims (26)

1. a kind of method that determination website captures flow quota, comprising:
Obtain targeted website to be captured by access data;
According to described by accessing data, determine that flow is born in the crawl of described targeted website;
Obtain the web page quality distribution of webpage in described targeted website;
According to the described web page quality distribution of webpage in described targeted website, determine the task flow of crawl targeted website;
Flow, and the task flow of described crawl targeted website are born according to the crawl of described targeted website, determines described The flow quota of webpage capture is carried out on targeted website.
2. the method for claim 1, described obtain targeted website to be captured by access data, comprising:
According to the access statistic data to described targeted website for the search engine, determine described targeted website described by accessing number According to.
3. method as claimed in claim 1 or 2, by accessing data described in described basis, determines the crawl of described targeted website Bear flow, comprising:
According to described by accessing data, determine the born access total amount of described targeted website;
Bear access total amount and preset crawl pressure coefficient according to described, determine that stream is born in the crawl of described targeted website Amount.
4. method as claimed in claim 3, by accessing data described in described basis, determines the born visit of described targeted website Ask total amount, comprising:
According to the access statistic data to described targeted website for the search engine, the market share of described search engine, Yong Huzhi Connect visit capacity, and website redundant flow, determine the born access total amount of described targeted website.
5. the method for claim 1, the web page quality distribution of webpage in the described targeted website of described acquisition, comprising:
According to the pagerank of webpage in described targeted website, and/or the link depth of webpage, determine the scoring of webpage;
Scoring to multiple webpages in described targeted website is normalized, and obtains the corresponding Mass Distribution of each webpage.
6. the method for claim 1, the web page quality distribution of webpage in the described targeted website of described acquisition, comprising:
Obtain the web page quality distribution of all webpages in described targeted website;
The described described web page quality distribution according to webpage in described targeted website, determines the task flow of crawl targeted website, Including:
Obtain the summation of the web page quality distribution of all webpages in described targeted website, according to institute in described targeted website The summation that the web page quality having webpage is distributed, determines the task flow of crawl targeted website.
7. method as claimed in claim 6, also includes:
Obtain one or more task scale factors;
The summation of the described web page quality distribution according to webpages all in described targeted website, determines crawl targeted website Task flow, comprising:
Summation according to the distribution of described web page quality and the product of one or more described task scale factors, determine crawl target The task flow of website.
8. method as claimed in claim 7, the one or more task scale factor of described acquisition, comprising:
Obtain in described targeted website, webpage number to be captured accounts for the ratio of webpage sum in described targeted website;
And/or,
Obtain in described targeted website, unduplicated webpage quantity accounts for the ratio of webpage sum in described targeted website.
9. method as claimed in claim 8, in the described targeted website of described acquisition, webpage number to be captured accounts for described target network The ratio of webpage sum in standing, comprising:
Obtain in described targeted website, the webpage number updating in crawl history, and/or, the new net producing in described targeted website Number of pages, accounts for the ratio of webpage sum in described targeted website.
10. method as claimed in claim 8, in the described targeted website of described acquisition, unduplicated webpage quantity accounts for described mesh The ratio of webpage sum in mark website, comprising:
In the crawl history to targeted website, obtain and compare the information fingerprint of captured webpage;
Unduplicated information fingerprint number is obtained according to the result comparing, accounts for the ratio of total fingerprint number, as described unduplicated net Number of pages accounts for the ratio of webpage sum in described targeted website.
11. methods as claimed in claim 6, also include:
According to crawl targeted website task total time determine unit interval coefficient;
The summation of the described web page quality distribution according to webpages all in described targeted website, determines crawl targeted website Task flow, comprising:
Summation according to the distribution of described web page quality and one or more described task scale factors, and system of described unit interval The product of number, determines the task flow of crawl targeted website.
12. methods as claimed in claim 11, also include:
When described task flow bears flow more than described crawl, and when both differences are more than preset threshold value, by adjusting institute State task scale factor, and/or described unit interval coefficient, adjust described task flow, until described task flow be less than or Bear flow equal to described crawl, or both differences are less than preset threshold value.
13. methods as described in claim 1,2, any one of 5-11, flow is born in the described crawl according to described targeted website, And the task flow of described crawl targeted website, determine the flow quota that webpage capture is carried out on described targeted website, bag Include:
When described task flow bears flow more than described crawl, and when both differences are less than preset threshold value, by described task Flow is defined as carrying out the flow quota of webpage capture on described targeted website.
A kind of 14. determination websites capture the equipment of flow quota, comprising:
Website visitation data acquiring unit, be suitable to obtain targeted website to be captured by accessing data;
Website holding capacity determining unit, is suitable to according to described by accessing data, determines that flow is born in the crawl of described targeted website;
Web page quality distributed acquisition unit, is suitable to obtain the web page quality distribution of webpage in described targeted website;
Task flow acquiring unit, is suitable to the described web page quality distribution according to webpage in described targeted website, determines crawl mesh The task flow of mark website;
Flow quota determining unit, is suitable to bear flow according to the crawl of described targeted website, and described crawl targeted website Task flow, determine and the flow quota of webpage capture carried out on described targeted website.
15. equipment as claimed in claim 14, described website visitation data acquiring unit, it is suitable to:
According to the access statistic data to described targeted website for the search engine, determine described targeted website described by accessing number According to.
16. equipment as described in claims 14 or 15, described website holding capacity determining unit, comprising:
Visit capacity determination subelement, is suitable to according to described by accessing data, determines the born access total amount of described targeted website;
Described website holding capacity determining unit, is suitable to bear access total amount and preset crawl pressure coefficient according to described, really Flow is born in the crawl of fixed described targeted website.
17. equipment as claimed in claim 16, described visit capacity determination subelement, it is suitable to:
According to the access statistic data to described targeted website for the search engine, the market share of described search engine, Yong Huzhi Connect visit capacity, and website redundant flow, determine the born access total amount of described targeted website.
18. equipment as claimed in claim 14, described web page quality distributed acquisition unit, it is suitable to:
According to the pagerank of webpage in described targeted website, and/or the link depth of webpage, determine the scoring of webpage;
Scoring to multiple webpages in described targeted website is normalized, and obtains the corresponding Mass Distribution of each webpage.
19. equipment as claimed in claim 14, described web page quality distributed acquisition unit, comprising:
Web page quality distributed acquisition subelement, the web page quality being suitable to obtain all webpages in described targeted website divides Cloth;
Described task flow acquiring unit, comprising:
Task flow obtains subelement, is suitable to the web page quality distribution of all webpages in the described targeted website of acquisition Summation, the summation of the web page quality distribution according to webpages all in described targeted website, determine crawl targeted website Task flow.
20. equipment as claimed in claim 19, also include:
Task scale factor acquiring unit, is suitable to obtain one or more task scale factors;
Described task flow obtains subelement, is suitable to:
Summation according to the distribution of described web page quality and the product of one or more described task scale factors, determine crawl target The task flow of website.
21. equipment as claimed in claim 20, described task scale factor acquiring unit, it is suitable to:
Obtain in described targeted website, webpage number to be captured accounts for the ratio of webpage sum in described targeted website;
And/or,
Obtain in described targeted website, unduplicated webpage quantity accounts for the ratio of webpage sum in described targeted website.
22. equipment as claimed in claim 21, described task scale factor acquiring unit, it is suitable to:
Obtain in described targeted website, the webpage number updating in crawl history, and/or, the new net producing in described targeted website Number of pages, accounts for the ratio of webpage sum in described targeted website.
23. equipment as claimed in claim 21, described task scale factor acquiring unit, it is suitable to:
In the crawl history to targeted website, obtain and compare the information fingerprint of captured webpage;
Unduplicated information fingerprint number is obtained according to the result comparing, accounts for the ratio of total fingerprint number, as described unduplicated net Number of pages accounts for the ratio of webpage sum in described targeted website.
24. equipment as claimed in claim 19, also include:
Unit interval coefficient acquiring unit, be suitable to according to crawl targeted website task total time determine unit interval coefficient;
Described task flow obtains subelement, is suitable to:
Summation according to the distribution of described web page quality and one or more described task scale factors, and system of described unit interval The product of number, determines the task flow of crawl targeted website.
25. equipment as claimed in claim 24, also include:
Task flow adjustment unit, is suitable to bear flow when described task flow more than described crawl, and both differences are more than in advance During the threshold value put, by adjusting described task scale factor, and/or described unit interval coefficient, adjust described task flow, directly Bear flow to described task flow less than or equal to described crawl, or both differences are less than preset threshold value.
26. equipment as described in claim 14,15, any one of 18-24, described flow quota determining unit, it is suitable to:
When described task flow bears flow more than described crawl, and when both differences are less than preset threshold value, by described task Flow is defined as carrying out the flow quota of webpage capture on described targeted website.
CN201310500682.8A 2013-10-22 2013-10-22 Method and equipment for identifying website capturing flow quota Active CN103544278B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201310500682.8A CN103544278B (en) 2013-10-22 2013-10-22 Method and equipment for identifying website capturing flow quota

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201310500682.8A CN103544278B (en) 2013-10-22 2013-10-22 Method and equipment for identifying website capturing flow quota

Publications (2)

Publication Number Publication Date
CN103544278A CN103544278A (en) 2014-01-29
CN103544278B true CN103544278B (en) 2017-02-01

Family

ID=49967730

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201310500682.8A Active CN103544278B (en) 2013-10-22 2013-10-22 Method and equipment for identifying website capturing flow quota

Country Status (1)

Country Link
CN (1) CN103544278B (en)

Families Citing this family (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN104392000B (en) * 2014-12-15 2016-10-12 北京奇虎科技有限公司 Determine the method and apparatus that mobile site captures quota
CN111985086B (en) * 2020-07-24 2024-04-09 西安理工大学 Community detection method integrating priori information and sparse constraint
CN113486229B (en) * 2021-07-05 2023-11-07 北京百度网讯科技有限公司 Control method and device for grabbing pressure, electronic equipment and readable storage medium

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102063477A (en) * 2010-12-13 2011-05-18 百度在线网络技术(北京)有限公司 Website data extraction device and method
JP2012059295A (en) * 2011-12-19 2012-03-22 Intec Inc Internet site information analysis method and apparatus
CN102469132A (en) * 2010-11-15 2012-05-23 北大方正集团有限公司 Method and system for grabbing web pages from servers with different IPs (Internet Protocols) in website

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102469132A (en) * 2010-11-15 2012-05-23 北大方正集团有限公司 Method and system for grabbing web pages from servers with different IPs (Internet Protocols) in website
CN102063477A (en) * 2010-12-13 2011-05-18 百度在线网络技术(北京)有限公司 Website data extraction device and method
JP2012059295A (en) * 2011-12-19 2012-03-22 Intec Inc Internet site information analysis method and apparatus

Also Published As

Publication number Publication date
CN103544278A (en) 2014-01-29

Similar Documents

Publication Publication Date Title
CN103530390B (en) The method and apparatus of webpage capture
CN110019396A (en) A kind of data analysis system and method based on distributed multidimensional analysis
DE202014010893U1 (en) Rufwegsucher
Zhou et al. Ranking scientific publications with similarity-preferential mechanism
CN107862022A (en) Cultural resource commending system
Griffith Modeling spatial autocorrelation in spatial interaction data: empirical evidence from 2002 Germany journey-to-work flows
CN110928739B (en) Process monitoring method and device and computing equipment
CN103970753B (en) The method for pushing and device of association knowledge
CN109101607B (en) Method, apparatus and storage medium for searching blockchain data
CN103530392B (en) Determine the method and apparatus of crawl flow
CN105373546B (en) A kind of information processing method and system for knowledge services
US20170109413A1 (en) Search System and Method for Updating a Scoring Model of Search Results based on a Normalized CTR
CN104391953B (en) Detect the method and device of webpage renewal
CN103544278B (en) Method and equipment for identifying website capturing flow quota
CN106599299A (en) Determining method and device of website key words
DE102018010163A1 (en) Automatic generation of useful user segments
CN106446179A (en) Hot topic generation method and device
CN104462554A (en) Method and device for recommending question and answer page related questions
CN106326297A (en) Application recommendation method and device
CN107153702A (en) A kind of data processing method and device
CN109543092A (en) Financial product recommended method, device, storage medium and computer equipment
CN107203623B (en) Load balancing and adjusting method of web crawler system
CN103530393B (en) Determine that website sub channel captures the method and apparatus of flow quota
CN113792084A (en) Data heat analysis method, device, equipment and storage medium
CN102929948B (en) list page identification system and method

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
C14 Grant of patent or utility model
GR01 Patent grant
TR01 Transfer of patent right
TR01 Transfer of patent right

Effective date of registration: 20220727

Address after: Room 801, 8th floor, No. 104, floors 1-19, building 2, yard 6, Jiuxianqiao Road, Chaoyang District, Beijing 100015

Patentee after: BEIJING QIHOO TECHNOLOGY Co.,Ltd.

Address before: 100088 room 112, block D, 28 new street, new street, Xicheng District, Beijing (Desheng Park)

Patentee before: BEIJING QIHOO TECHNOLOGY Co.,Ltd.

Patentee before: Qizhi software (Beijing) Co.,Ltd.