CN104657355A - Web concurrent crawling method and system - Google Patents

Web concurrent crawling method and system Download PDF

Info

Publication number
CN104657355A
CN104657355A CN201310575226.XA CN201310575226A CN104657355A CN 104657355 A CN104657355 A CN 104657355A CN 201310575226 A CN201310575226 A CN 201310575226A CN 104657355 A CN104657355 A CN 104657355A
Authority
CN
China
Prior art keywords
concurrent
tps
parameter
crawl
request
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN201310575226.XA
Other languages
Chinese (zh)
Other versions
CN104657355B (en
Inventor
金伟
孟凡光
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Alibaba Group Holding Ltd
Original Assignee
Alibaba Group Holding Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Alibaba Group Holding Ltd filed Critical Alibaba Group Holding Ltd
Priority to CN201310575226.XA priority Critical patent/CN104657355B/en
Publication of CN104657355A publication Critical patent/CN104657355A/en
Application granted granted Critical
Publication of CN104657355B publication Critical patent/CN104657355B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Abstract

This invention provides a web concurrent crawling method and system. The method comprises performing concurrent process on crawling requests to be processed and monitoring processing event messages corresponding to the processed crawling requests; analyzing the processing event messages to obtain current crawling index parameters; and turning down the concurrent number of web concurrent crawling when the current crawling index parameters exceed preset safety range. According to the web concurrent crawling method and system, the response speed of websites during web concurrent crawling can be increased.

Description

A kind of concurrent grasping means of webpage and system
Technical field
The application relates to networking technology area, particularly relates to a kind of concurrent grasping means and system of webpage.
Background technology
Search engine, refer to according to certain strategy, use specific computer program to gather information from internet, after organizing information and processing, for user provides retrieval service, the information display of being correlated with by user search is to the system of user.For the process that described search engine gathers information from internet, depend on web crawlers crawling related web site information.
Described web crawlers is a kind of program of automatic acquisition web page contents, is the important component part of search engine.
In the prior art, for common search engine, tradition reptile is from the URL(URL(uniform resource locator) of one or several Initial page, Uniform Resource Locator) start, obtain the URL on Initial page, in the process capturing webpage, constantly extract new URL from current page and put into queue, until meet certain stop condition of system.
In currently available technology, the analysis ability of web crawlers to web page contents is poor, can only pass through mechanical continuous grasping information of web site, and often concurrent tens or up to a hundred requests are cycled to repeat crawl; Because most website processing power is limited, therefore a large amount of concurrent requests is easy to cause, and websites response is slack-off even collapses.
Summary of the invention
Technical problems to be solved in this application are to provide a kind of concurrent grasping means and system of webpage, can improve the response speed of website in the concurrent crawl process of webpage.
In order to solve the problem, this application discloses a kind of concurrent grasping means of webpage, comprising:
Concurrent processing is carried out to pending crawl request, and monitors the handled process event message capturing request correspondence;
Current crawl index parameter is obtained according to the analysis of described process event message;
When current crawl index parameter exceeds default safe range, turn down the number of concurrent of the concurrent crawl of webpage.
Preferably, described method also comprises:
When current crawl index parameter is lower than the safe range preset, heighten the number of concurrent of the concurrent crawl of webpage.
Preferably, the described step obtaining current crawl index parameter according to the analysis of described process event message, comprising:
Obtain the handled process event message capturing request correspondence in each time period;
The process event message capturing request corresponding handled in current slot is analyzed separately, and/or, the process event message capturing request corresponding handled in time adjacent segments is analyzed, obtains current crawl index parameter.
Preferably, described the step of concurrent processing is carried out to pending crawl request before, described method also comprises:
When the processing transactions number TPS per second of current reality does not exceed the highest upper safety limit TPS, permit the pending process capturing request;
Then describedly capture the step that request carries out concurrent processing be specially pending, concurrent processing is carried out to permitted pending crawl request.
Preferably, what described crawl index parameter comprised in websites response index parameter and network condition parameter is one or more.
Preferably, described websites response index parameter comprise response time parameter and response scale parameter in one or more;
Wherein, described response time parameter is for representing that website is to the handled response time capturing request, and described response scale parameter handledly captures request whole handled proportion captured in request within the corresponding time period for what represent that response time in each time period meets default safe range.
Preferably, between the reptile of described website network condition parameter to comprise in number of errors parameter, error rate parameter, grasp speed parameter, grasp speed scale parameter one or more;
Wherein, the handled quantity capturing request of exception error is there is in described number of errors parameter for representing, described error rate parameter is for representing the increase ratio of the number of errors parameter in current slot relative to the number of errors parameter in the upper time period, and described grasp speed scale parameter is for being less than or equal to the ratio of current grasp speed parameter.
Preferably, described when current crawl index parameter exceeds default safe range, turn down the step of the number of concurrent of the concurrent crawl of webpage, comprising:
Turn down the concurrent concurrent thread number capturing process, and/or, turn down the concurrent TPS capturing process.
Preferably, described in turn down the concurrent step of TPS capturing process, comprising:
Carry out concurrently capturing turning down of the TPS processed according to the difference of the highest upper safety limit TPS and current TPS; Wherein, the highest described upper safety limit TPS is for representing that crawl index parameter does not exceed the historical high TPS in default safe range situation.
Preferably, described the step of concurrent processing is carried out to pending crawl request before, described method also comprises the step of the highest upper safety limit TPS described in following acquisition:
The initial value arranging described upper safety limit TPS is preset large numerical value;
Progressively increase the concurrent concurrent thread number capturing process and reach maximum concurrent thread number until described concurrent thread is counted to;
The pending concurrent processing capturing request is carried out according to current safety upper limit TPS;
When current crawl index parameter does not exceed default safe range, heighten current safety upper limit TPS;
When current crawl index parameter exceeds default safe range, turn down current safety upper limit TPS;
Record the upper safety limit TPS after heightening or turning down;
Choose the highest TPS in recorded upper safety limit TPS as the highest upper safety limit TPS.
On the other hand, disclosed herein as well is a kind of concurrent grasping system of webpage, comprising:
Request processing module, for carrying out concurrent processing to pending crawl request;
Message monitors module, for monitoring the handled process event message capturing request correspondence;
Message-analysis module, for obtaining current crawl index parameter according to the analysis of described process event message; And
Number of concurrent turns down module, for when current crawl index parameter exceeds default safe range, turns down the number of concurrent of the concurrent crawl of webpage.
Preferably, described system also comprises:
Number of concurrent heightens module, for when current crawl index parameter is lower than the safe range preset, turns down the number of concurrent of the concurrent crawl of webpage.
Preferably, described message-analysis module comprises:
Message obtains submodule, for obtaining the handled process event message capturing request correspondence in each time period;
Message analysis submodule, for analyzing separately the process event message capturing request corresponding handled in current slot, and/or, the process event message capturing request corresponding handled in time adjacent segments is analyzed, obtains current crawl index parameter.
Preferably, described system also comprises:
Permit module, for described the operation of concurrent processing is carried out to pending crawls request before, when the processing transactions number TPS per second of current reality does not exceed the highest upper safety limit TPS, permit pendingly capturing the process of asking;
Then described request processing module, specifically for carrying out concurrent processing to permitted pending crawl request.
Preferably, what described crawl index parameter comprised in websites response index parameter and network condition parameter is one or more.
Preferably, described websites response index parameter comprise response time parameter and response scale parameter in one or more;
Wherein, described response time parameter is for representing that website is to the handled response time capturing request, and described response scale parameter handledly captures request whole handled proportion captured in request within the corresponding time period for what represent that response time in each time period meets default safe range.
Preferably, between the reptile of described website network condition parameter to comprise in number of errors parameter, error rate parameter, grasp speed parameter, grasp speed scale parameter one or more;
Wherein, the handled quantity capturing request of exception error is there is in described number of errors parameter for representing, described error rate parameter is for representing the increase ratio of the number of errors parameter in current slot relative to the number of errors parameter in the upper time period, and described grasp speed scale parameter is for being less than or equal to the ratio of current grasp speed parameter.
Preferably, described number of concurrent is turned down module and is comprised:
First turns down submodule, for turning down the concurrent concurrent thread number capturing process; And/or
Second turns down submodule, for turning down the concurrent TPS capturing process.
Preferably, described second turns down submodule, concurrently captures turning down of the TPS processed specifically for carrying out according to the difference of the highest upper safety limit TPS and current TPS; Wherein, the highest described upper safety limit TPS is for representing that crawl index parameter does not exceed the historical high TPS in default safe range situation.
Preferably, described system also comprises: for described the operation of concurrent processing is carried out to pending crawl request before, the upper limit TPS acquisition module of the highest upper safety limit TPS described in acquisition;
Described upper limit TPS acquisition module comprises:
Arranging submodule, is preset large numerical value for arranging the initial value of described upper safety limit TPS;
Progressively increasing submodule, reaching maximum concurrent thread number for progressively increasing the concurrent concurrent thread number capturing process until described concurrent thread is counted to;
Concurrent processing submodule, for carrying out the pending concurrent processing capturing request according to current safety upper limit TPS;
Heighten submodule, for when current crawl index parameter does not exceed default safe range, heighten current safety upper limit TPS;
Turn down submodule, for when current crawl index parameter exceeds default safe range, turn down current safety upper limit TPS;
Record sub module, for recording the upper safety limit TPS after heightening or turning down; And
Choose submodule, for choosing the highest TPS in recorded upper safety limit TPS as the highest upper safety limit TPS.
Compared with prior art, the application has the following advantages:
The application is when current crawl index parameter exceeds default safe range, turn down the number of concurrent of the concurrent crawl of webpage, wherein, described crawl index parameter for weigh webpage concurrent crawl process in website load condition, because the number of concurrent turning down the concurrent crawl of webpage means that namely the quantity turning down the request of crawl also turn down the frequency of request website, it can reduce the load of Website server in the concurrent crawl process of webpage, therefore the application can will capture index state modulator in the safe range preset by the number of concurrent turning down the concurrent crawl of webpage, also namely can by website spatial load forecasting in the safe range preset, therefore, it is possible to avoid a large amount of concurrent requests easily to cause the slack-off situation of even collapsing of websites response, thus the response speed of website in the concurrent crawl process of webpage can be improved,
Secondly, the application can also when current crawl index parameter be lower than the safe range preset, heighten the number of concurrent of the concurrent crawl of webpage, because the number of concurrent heightening the concurrent crawl of webpage means that namely the quantity heightening the request of crawl also heighten the frequency of request website, therefore the application can make it capture index parametric approximation but not exceed default safe range by the number of concurrent heightening the concurrent crawl of webpage, such as, grasp speed can be made to approach a desirable numerical value, therefore, the application can ensure the response speed of website and the grasp speed of reptile simultaneously;
Further, the application can also when the processing transactions number TPS per second of current reality exceed the highest upper safety limit TPS, just permit the pending process capturing request, described allowance mechanism disapproves those the pending process capturing request exceeding the highest upper safety limit TPS, therefore can the processing transactions number TPS per second of current reality be strict controlled in the highest upper safety limit TPS, website load can be controlled further, thus the response speed of website in the concurrent crawl process of webpage can be improved further.
Accompanying drawing explanation
Fig. 1 is the process flow diagram of the concurrent grasping means embodiment 1 of a kind of webpage of the application;
Fig. 2 is the process flow diagram of the concurrent grasping means embodiment 2 of a kind of webpage of the application;
Fig. 3 is the process flow diagram of the concurrent grasping means embodiment 3 of a kind of webpage of the application;
Fig. 4 is the structural drawing of the concurrent grasping system embodiment of a kind of webpage of the application;
Fig. 5 is the structural representation of the concurrent grasping system embodiment of a kind of webpage of the application.
Embodiment
For enabling above-mentioned purpose, the feature and advantage of the application more become apparent, below in conjunction with the drawings and specific embodiments, the application is described in further detail.
With reference to Fig. 1, show the process flow diagram of the concurrent grasping means embodiment 1 of a kind of webpage of the application, specifically can comprise:
Step 101, concurrent processing is carried out to pending crawl request, and monitor and handledly capture process event message corresponding to request;
In the embodiment of the present application, pending crawl asks to can be used for representing untreated crawl request, in the concurrent crawl process of webpage, pending crawl can be generated according to the new URL extracted from current page ask and put in request queue, pending crawl request is obtained in request queue, and judge whether pending crawl request is processed, if then abandon, otherwise carries out concurrent processing before treatment.
In actual applications, corresponding content crawl can be carried out for pending crawl request is sent to corresponding website by web crawlers to the pending implementation procedure of asking to carry out concurrent processing that captures, be appreciated that the application is not limited concrete disposal route.
The embodiment of the present application monitors the communication process between web crawlers and website, wherein, website can return process event message to web crawlers, described process event message specifically can comprise and handled captures crawl unexpected message corresponding to crawl success message corresponding to response time message corresponding to request, handled crawl request, handled crawl request etc., and the application is not limited the process event message capturing request corresponding handled by concrete.
Step 102, obtain current crawl index parameter according to the analysis of described process event message;
In the embodiment of the present application, capture index parameter and can be used for weighing website load condition in the concurrent crawl process of webpage; If it is in the safe range preset, then illustrate that website load condition is good, website normal process can capture request, without the need to turning down the number of concurrent of the concurrent crawl of webpage; If it exceeds default safe range, then illustrate that website load is full, cannot process more number of concurrent, even more number of concurrent easily causes website to collapse, and therefore needs the number of concurrent turning down the concurrent crawl of webpage.
In a preferred embodiment of the present application, it is one or more that described crawl index parameter specifically can comprise in websites response index parameter and network condition parameter.Wherein, whether the responding ability that described websites response index parameter can be used for assessing website is normal, controlled in the safe range preset, then a large amount of concurrent requests can be avoided easily to cause the slack-off situation of even collapsing of websites response, therefore the response speed of website can be improved; Whether the network condition that described network condition parameter can be used between website and reptile is normal, controlled in the safe range preset, then capture request in a large number when network condition can be avoided abnormal and can not get process, therefore, it is possible to improve grasp speed.
In a preferred embodiment of the present application, described websites response index parameter specifically can comprise response time parameter and response scale parameter in one or more; Wherein, described response time parameter can be used for representing that website is to the handled response time capturing request, and described response scale parameter can be used for representing that response time in each time period meets default safe range and handledly captures request whole handled proportion captured in request within the corresponding time period.
In another preferred embodiment of the present application, between the reptile of described website, specifically can to comprise in number of errors parameter, error rate parameter, grasp speed parameter, grasp speed scale parameter one or more for network condition parameter; Wherein, described number of errors parameter can be used for representing the handled quantity capturing request that exception error occurs, described error rate parameter can be used for representing the increase ratio of the number of errors parameter in current slot relative to the number of errors parameter in the upper time period, and described grasp speed scale parameter is for representing the ratio being less than or equal to current grasp speed parameter.
In a kind of application example of the application, suppose that the length of time period is 1 minute, supposed that the number of concurrent of the concurrent crawl of webpage in the upper time period was 100 times, the number of concurrent in current slot is 120 times:
1) suppose that the safe range that response time parameter correspondence is preset is 200ms, then the handled response time parameter capturing request illustrates that in 200ms the responding ability of website is normal;
2) supposing to respond the safe range that scale parameter correspondence presets is 80%, then in certain time period, response time parameter captures request and to account in this time period all handled proportion of asking that captures and be more than or equal to 80% and can think that the responding ability of website is normal handled by 200ms;
3) suppose that the safe range that number of errors parameter correspondence is preset is 20, then the reason such as certain internal cause time-out or Server Error returns abnormal handled crawls and asks to think that within 20 the network condition between website and reptile is normal time period;
4) suppose that error rate parameter is 10%, suppose that in first time period, the handled number of errors capturing request is 20, then can think that the network condition between website and reptile is abnormal when the handled number of errors captured is greater than 22 in second time period.
Be appreciated that above-mentioned response time parameter and response scale parameter are only as capturing the preferred embodiment of index parameter, and the application being not understood to the application limits those skilled in the art can adopt various crawl index parameter according to the actual requirements.
In a preferred embodiment of the present application, the described step S102 obtaining current crawl index parameter according to the analysis of described process event message, specifically can comprise:
Sub-step S101, obtain and handledly in each time period capture process event message corresponding to request;
Sub-step S102, the process event message corresponding to crawl request handled in current slot are analyzed separately, and/or, the process event message capturing request corresponding handled in time adjacent segments is analyzed, obtains current crawl index parameter.
Those skilled in the art can according to the length of actual conditions determining time, and as half a minute, 1 minute, 2 minutes etc., the length of the application to the concrete time period is not limited.
In a kind of application example of the application, according to the number of concurrent in certain time period and the length of this time period, current grasp speed can be analyzed separately.
In the another kind of application example of the application, can according to the handled number of errors captured in current slot and in the upper time period, comparative analysis goes out current error rate parameter.
Step 103, when current crawl index parameter exceeds default safe range, turn down the number of concurrent of the concurrent crawl of webpage.
The number of concurrent of the concurrent crawl of described webpage can be used for representing that web crawlers is to the quantity capturing request handled by web site requests; Because the number of concurrent turning down the concurrent crawl of webpage means that namely the quantity of the request of crawl also ask the frequency of website, it can reduce the load of Website server in the concurrent crawl process of webpage, therefore the application can will capture index state modulator in the safe range preset by the number of concurrent turning down the concurrent crawl of webpage, also namely can by website spatial load forecasting in the safe range preset.
In actual applications, every captures the safe range preset that index parameter all can have correspondence.Further, those skilled in the art according to the actual requirements, can adopt one or more crawl index parameter in the concurrent crawl process of webpage, and the item number obviously capturing index parameter is more, and the condition turned down is stricter.
The application can provide the technical scheme of the number of concurrent turning down the concurrent crawl of webpage as follows:
Technical scheme 1,
Turn down the concurrent concurrent thread number capturing process.
The quantity of the crawl request processed can be made to reduce owing to turning down the concurrent concurrent thread number capturing process, therefore the number of concurrent of the concurrent crawl of webpage can be made to reduce.
Technical scheme 2,
Turn down the concurrent TPS capturing process.
TPS(processing transactions number per second, Transactions Per Second) be the measuring unit of software test result, in the embodiment of the present application, affairs can be used for representing that web crawlers sends to capture to Website server asks then Website server to make the process of response, specifically can start timing when sending and capturing request, terminate timing after receiving Website server response, carry out with this affairs number that completes in calculated response time and time period.
In a preferred embodiment of the present application, described in turn down the concurrent step of TPS capturing process, specifically can comprise:
Sub-step S201, carry out concurrently capturing turning down of the TPS processed according to the difference of the highest upper safety limit TPS and current TPS; Wherein, the highest described upper safety limit TPS can be used for representing that capturing index parameter does not exceed historical high TPS in default safe range situation.
Such as, in a kind of application example of the application, described in the expression formula turned down can be expressed as:
The current TPS of TPS=(the highest upper safety limit TPS – after turning down)/2 (1)
In a preferred embodiment of the present application, described the step 101 of concurrent processing is carried out to pending crawl request before, described method can also comprise the step of the highest upper safety limit TPS described in following acquisition:
Step S301, the initial value arranging described upper safety limit TPS are preset large numerical value;
Step S302, progressively increase concurrent capture process concurrent thread number reach maximum concurrent thread number until described concurrent thread is counted to;
Step S303, foundation current safety upper limit TPS carry out the pending concurrent processing capturing request;
Step S304, when current crawl index parameter does not exceed default safe range, heighten current safety upper limit TPS;
Step S305, when current crawl index parameter exceeds default safe range, turn down current safety upper limit TPS;
Step S306, record heighten or turn down after upper safety limit TPS;
Step S307, choose the highest TPS in recorded upper safety limit TPS as the highest upper safety limit TPS.
Above-mentioned steps S304 and step S305 is when concurrent thread number is fixed as maximum concurrent thread number, the process adjusted is carried out according to network environment, those skilled in the art can determine to adjust the time spent according to actual conditions, such as, can described the step 101 of concurrent processing is carried out to pending crawl request before spend N minute and adjust obtain described in the highest upper safety limit TPS, described N be natural number.
In a word, the application is when current crawl index parameter exceeds default safe range, turn down the number of concurrent of the concurrent crawl of webpage, wherein, described crawl index parameter for weigh webpage concurrent crawl process in website load condition, because the number of concurrent turning down the concurrent crawl of webpage means that namely the quantity of the request of crawl also ask the frequency of website, it can reduce the load of Website server in the concurrent crawl process of webpage, therefore the application can will capture index state modulator in the safe range preset by the number of concurrent turning down the concurrent crawl of webpage, also namely can by website spatial load forecasting in the safe range preset, therefore, it is possible to avoid a large amount of concurrent requests easily to cause the slack-off situation of even collapsing of websites response, thus the response speed of website in the concurrent crawl process of webpage can be improved.
With reference to Fig. 2, show the process flow diagram of the concurrent grasping means embodiment 2 of a kind of webpage of the application, specifically can comprise:
Step 201, concurrent processing is carried out to pending crawl request, and monitor and handledly capture process event message corresponding to request;
Step 202, obtain current crawl index parameter according to the analysis of described process event message;
Step 203, when current crawl index parameter exceeds default safe range, turn down the number of concurrent of the concurrent crawl of webpage;
Step 204, current crawl index parameter lower than preset safe range time, heighten the number of concurrent of the concurrent crawl of webpage.
Relative to embodiment 1, embodiment 2 can when current crawl index parameter be lower than the safe range preset, heighten the number of concurrent of the concurrent crawl of webpage, because the number of concurrent heightening the concurrent crawl of webpage means that namely the quantity heightening the request of crawl also heighten the frequency of request website, therefore the application can make it capture index parametric approximation but not exceed default safe range by the number of concurrent heightening the concurrent crawl of webpage, such as, grasp speed can be made to approach a desirable numerical value, therefore, the application can ensure the response speed of website and the grasp speed of reptile simultaneously.
With reference to Fig. 3, show the process flow diagram of the concurrent grasping means embodiment 3 of a kind of webpage of the application, specifically can comprise:
Step 301, when the processing transactions number TPS per second of current reality does not exceed the highest upper safety limit TPS, permit pending capture request process;
Step 302, concurrent processing is carried out to permitted pending crawl request, and monitor and handledly capture process event message corresponding to request;
Step 303, obtain current crawl index parameter according to the analysis of described process event message;
Step 304, when current crawl index parameter exceeds default safe range, turn down the number of concurrent of the concurrent crawl of webpage.
Relative to embodiment 1, embodiment 3 is not when the processing transactions number TPS per second of current reality exceeds the highest upper safety limit TPS, just permit the pending process capturing request, described allowance mechanism disapproves those the pending process capturing request exceeding the highest upper safety limit TPS, therefore can the processing transactions number TPS per second of current reality be strict controlled in the highest upper safety limit TPS, therefore relative to embodiment 1, website load can be controlled further, thus the response speed of website in the concurrent crawl process of webpage can be improved further.
In actual applications, can according to the highest upper safety limit TPS described in the flow process acquisition of abovementioned steps S301-step S307 before step 301.
Be appreciated that as preferred embodiment, the combination of embodiment 2 and embodiment 3 is also feasible, and also, the method flow of embodiment 2 can also comprise step 301 and step 302, and the combination of the application to specific embodiment is not limited.
Corresponding to preceding method embodiment, disclosed herein as well is a kind of concurrent grasping system of webpage, with reference to the structural drawing shown in Fig. 3, specifically can comprise:
Request processing module 401, for carrying out concurrent processing to pending crawl request;
Message monitors module 402, for monitoring the handled process event message capturing request correspondence;
Message-analysis module 403, for obtaining current crawl index parameter according to the analysis of described process event message; And
Number of concurrent turns down module 404, for when current crawl index parameter exceeds default safe range, turns down the number of concurrent of the concurrent crawl of webpage.
In a preferred embodiment of the present application, described system can also comprise: number of concurrent heightens module, for when current crawl index parameter is lower than the safe range preset, turns down the number of concurrent of the concurrent crawl of webpage.
In a preferred embodiment of the present application, described message-analysis module 403 specifically can comprise:
Message obtains submodule, for obtaining the handled process event message capturing request correspondence in each time period; And
Message analysis submodule, for analyzing separately the process event message capturing request corresponding handled in current slot, and/or, the process event message capturing request corresponding handled in time adjacent segments is analyzed, obtains current crawl index parameter.
In another preferred embodiment of the present application, described system can also comprise:
Permit module, for described the operation of concurrent processing is carried out to pending crawls request before, when the processing transactions number TPS per second of current reality does not exceed the highest upper safety limit TPS, permit pendingly capturing the process of asking;
Then described request processing module 401, can specifically for carrying out concurrent processing to permitted pending crawl request.
In the embodiment of the present application, preferably, what described crawl index parameter specifically can comprise in websites response index parameter and network condition parameter is one or more.
In a preferred embodiment of the present application, described websites response index parameter specifically can comprise response time parameter and response scale parameter in one or more;
Wherein, described response time parameter can be used for representing that website is to the handled response time capturing request, and described response scale parameter can be used for representing that response time in each time period meets default safe range and handledly captures request whole handled proportion captured in request within the corresponding time period.
In another preferred embodiment of the present application, between the reptile of described website, specifically can to comprise in number of errors parameter, error rate parameter, grasp speed parameter, grasp speed scale parameter one or more for network condition parameter;
Wherein, described number of errors parameter can be used for representing the handled quantity capturing request that exception error occurs, described error rate parameter can be used for representing the increase ratio of the number of errors parameter in current slot relative to the number of errors parameter in the upper time period, and described grasp speed scale parameter can be used for the ratio being less than or equal to current grasp speed parameter.
In another preferred embodiment of the application, described number of concurrent is turned down module 404 and specifically can be comprised:
First turns down submodule, for turning down the concurrent concurrent thread number capturing process; And/or
Second turns down submodule, for turning down the concurrent TPS capturing process.
In the embodiment of the present application, preferably, described second turns down submodule, concurrently can capture turning down of the TPS processed specifically for carrying out according to the difference of the highest upper safety limit TPS and current TPS; Wherein, the highest described upper safety limit TPS is for representing that crawl index parameter does not exceed the historical high TPS in default safe range situation.
In a preferred embodiment of the present application, described system can also comprise: for described the operation of concurrent processing is carried out to pending crawl request before, the upper limit TPS acquisition module of the highest upper safety limit TPS described in acquisition;
Described upper limit TPS acquisition module specifically can comprise:
Arranging submodule, is preset large numerical value for arranging the initial value of described upper safety limit TPS;
Progressively increasing submodule, reaching maximum concurrent thread number for progressively increasing the concurrent concurrent thread number capturing process until described concurrent thread is counted to;
Concurrent processing submodule, for carrying out the pending concurrent processing capturing request according to current safety upper limit TPS;
Heighten submodule, for when current crawl index parameter does not exceed default safe range, heighten current safety upper limit TPS;
Turn down submodule, for when current crawl index parameter exceeds default safe range, turn down current safety upper limit TPS;
Record sub module, for recording the upper safety limit TPS after heightening or turning down; And
Choose submodule, for choosing the highest TPS in recorded upper safety limit TPS as the highest upper safety limit TPS.
The application is understood better for making those skilled in the art, with reference to Fig. 5, show the structural representation of the concurrent grasping system of a kind of webpage of the application, described system specifically can comprise initialization module 501, requirement analysis module 502, request processing module 503, reptile module 504, message monitoring module 505, message-analysis module 506, permit judge module 507, number of concurrent turns down module 508 and stop module 509, and the corresponding flow process that captures specifically can comprise:
Step S1, initialization module 501 read the configuration information of crawl from configuration file, described configuration information specifically can comprise and captures web portal (as URL), the safe range information etc. preset that every crawl index parameter is corresponding, and is asked to submit to requirement analysis module 502 as pending crawls by described crawl web portal;
Step S2, requirement analysis module 502 receive the pending crawl request that described initialization module 501 is submitted to, or, from pending request queue, obtain pending crawl request;
By analyzing, step S3, requirement analysis module 502 judge whether current pending crawl request is processed request, if then abandon, otherwise described pending crawl request is committed to request processing module 503;
Step S4, request processing module 503 when acquiring untreated crawl request, to allowance judge module 507 transmission processing license request;
Step S5, allowance judge module 507, when receiving process license request, judge that whether the processing transactions number TPS per second of current reality is beyond the highest upper safety limit TPS, if not, then returns allowance process information to request processing module 503;
The allowance process information that step S6, request processing module 503 return according to allowance judge module 507, submits to pending crawl to ask to reptile module 504;
Step S7, reptile module 504 carry out concurrent processing to pending crawl request, and in concurrent processing process, generate new pending crawl request, and are saved to pending request queue, and, monitor module 505 transmission processing result event message to message;
Step S8, message monitor the process event message that module 505 monitors the handled crawl request correspondence that reptile module 504 sends;
Step S9, message-analysis module 506 obtain current crawl index parameter according to the analysis of prison audible process event message;
Step S10, number of concurrent turn down module 508 when current crawl index parameter exceeds default safe range, turn down the number of concurrent of the concurrent crawl of webpage;
Step S11, requirement analysis module 502 judge whether pending request queue is empty, if so, then monitor module 505 to message and send crawl end of a period event message;
Step S12, message are monitored module 505 and audible for prison crawl end of a period event message are sent to termination module 509;
Step S13, termination module 509, according to described crawl end of a period event message, terminate to capture flow process.
Each embodiment in this instructions all adopts the mode of going forward one by one to describe, and what each embodiment stressed is the difference with other embodiments, between each embodiment identical similar part mutually see.For system embodiment, due to itself and embodiment of the method basic simlarity, so description is fairly simple, relevant part illustrates see the part of embodiment of the method.
The embodiment of the present invention can be used in numerous general or special purpose computing system environment or configuration.Such as: personal computer, server computer, handheld device or portable set, laptop device, multicomputer system, system, network PC, small-size computer, mainframe computer, the distributed computing environment comprising above any system or equipment etc. based on microprocessor.The embodiment of the present invention is preferably applied in embedded system.
The embodiment of the present invention can describe in the general context of computer executable instructions, such as program module.Usually, program module comprises the routine, program, object, assembly, data structure etc. that perform particular task or realize particular abstract data type.Also can put into practice the embodiment of the present invention in a distributed computing environment, in these distributed computing environment, be executed the task by the remote processing devices be connected by communication network.In a distributed computing environment, program module can be arranged in the local and remote computer-readable storage medium comprising memory device.In one typically configuration, described computer equipment comprises one or more processor (CPU), input/output interface, network interface and internal memory.Internal memory may comprise the volatile memory in computer-readable medium, and the forms such as random access memory (RAM) and/or Nonvolatile memory, as ROM (read-only memory) (ROM) or flash memory (flash RAM).Internal memory is the example of computer-readable medium.Computer-readable medium comprises permanent and impermanency, removable and non-removable media can be stored to realize information by any method or technology.Information can be computer-readable instruction, data structure, the module of program or other data.The example of the storage medium of computing machine comprises, but be not limited to phase transition internal memory (PRAM), static RAM (SRAM), dynamic RAM (DRAM), the random access memory (RAM) of other types, ROM (read-only memory) (ROM), Electrically Erasable Read Only Memory (EEPROM), fast flash memory bank or other memory techniques, read-only optical disc ROM (read-only memory) (CD-ROM), digital versatile disc (DVD) or other optical memory, magnetic magnetic tape cassette, tape magnetic rigid disk stores or other magnetic storage apparatus or any other non-transmitting medium, can be used for storing the information can accessed by computing equipment.According to defining herein, computer-readable medium does not comprise non-standing ground computer readable media (transitory media), as data-signal and the carrier wave of modulation.
Above to concurrent grasping means and the system of a kind of webpage that the application provides, be described in detail, apply specific case herein to set forth the principle of the application and embodiment, the explanation of above embodiment is just for helping method and the core concept thereof of understanding the application; Meanwhile, for one of ordinary skill in the art, according to the thought of the application, all will change in specific embodiments and applications, in sum, this description should not be construed as the restriction to the application.

Claims (20)

1. a concurrent grasping means for webpage, is characterized in that, comprising:
Concurrent processing is carried out to pending crawl request, and monitors the handled process event message capturing request correspondence;
Current crawl index parameter is obtained according to the analysis of described process event message;
When current crawl index parameter exceeds default safe range, turn down the number of concurrent of the concurrent crawl of webpage.
2. the method for claim 1, is characterized in that, also comprises:
When current crawl index parameter is lower than the safe range preset, heighten the number of concurrent of the concurrent crawl of webpage.
3. the method for claim 1, is characterized in that, the described step obtaining current crawl index parameter according to the analysis of described process event message, comprising:
Obtain the handled process event message capturing request correspondence in each time period;
The process event message capturing request corresponding handled in current slot is analyzed separately, and/or, the process event message capturing request corresponding handled in time adjacent segments is analyzed, obtains current crawl index parameter.
4. the method for claim 1, is characterized in that, described the step of concurrent processing is carried out to pending crawl request before, described method also comprises:
When the processing transactions number TPS per second of current reality does not exceed the highest upper safety limit TPS, permit the pending process capturing request;
Then describedly capture the step that request carries out concurrent processing be specially pending, concurrent processing is carried out to permitted pending crawl request.
5. the method as described in claim 1 or 2 or 3, is characterized in that, it is one or more that described crawl index parameter comprises in websites response index parameter and network condition parameter.
6. method as claimed in claim 5, is characterized in that, it is one or more that described websites response index parameter comprises in response time parameter and response scale parameter;
Wherein, described response time parameter is for representing that website is to the handled response time capturing request, and described response scale parameter handledly captures request whole handled proportion captured in request within the corresponding time period for what represent that response time in each time period meets default safe range.
7. method as claimed in claim 5, is characterized in that, between the reptile of described website, to comprise in number of errors parameter, error rate parameter, grasp speed parameter, grasp speed scale parameter one or more for network condition parameter;
Wherein, the handled quantity capturing request of exception error is there is in described number of errors parameter for representing, described error rate parameter is for representing the increase ratio of the number of errors parameter in current slot relative to the number of errors parameter in the upper time period, and described grasp speed scale parameter is for being less than or equal to the ratio of current grasp speed parameter.
8. the method for claim 1, is characterized in that, described when current crawl index parameter exceeds default safe range, turns down the step of the number of concurrent of the concurrent crawl of webpage, comprising:
Turn down the concurrent concurrent thread number capturing process, and/or, turn down the concurrent TPS capturing process.
9. method as claimed in claim 8, is characterized in that, described in turn down the concurrent step capturing the TPS of process, comprising:
Carry out concurrently capturing turning down of the TPS processed according to the difference of the highest upper safety limit TPS and current TPS; Wherein, the highest described upper safety limit TPS is for representing that crawl index parameter does not exceed the historical high TPS in default safe range situation.
10. the method as described in claim 4 or 9, is characterized in that, described the step of concurrent processing is carried out to pending crawl request before, also comprise the step of the highest upper safety limit TPS described in following acquisition:
The initial value arranging described upper safety limit TPS is preset large numerical value;
Progressively increase the concurrent concurrent thread number capturing process and reach maximum concurrent thread number until described concurrent thread is counted to;
The pending concurrent processing capturing request is carried out according to current safety upper limit TPS;
When current crawl index parameter does not exceed default safe range, heighten current safety upper limit TPS;
When current crawl index parameter exceeds default safe range, turn down current safety upper limit TPS;
Record the upper safety limit TPS after heightening or turning down;
Choose the highest TPS in recorded upper safety limit TPS as the highest upper safety limit TPS.
The concurrent grasping system of 11. 1 kinds of webpages, is characterized in that, comprising:
Request processing module, for carrying out concurrent processing to pending crawl request;
Message monitors module, for monitoring the handled process event message capturing request correspondence;
Message-analysis module, for obtaining current crawl index parameter according to the analysis of described process event message; And
Number of concurrent turns down module, for when current crawl index parameter exceeds default safe range, turns down the number of concurrent of the concurrent crawl of webpage.
12. systems as claimed in claim 11, is characterized in that, also comprise:
Number of concurrent heightens module, for when current crawl index parameter is lower than the safe range preset, turns down the number of concurrent of the concurrent crawl of webpage.
13. systems as claimed in claim 11, it is characterized in that, described message-analysis module comprises:
Message obtains submodule, for obtaining the handled process event message capturing request correspondence in each time period;
Message analysis submodule, for analyzing separately the process event message capturing request corresponding handled in current slot, and/or, the process event message capturing request corresponding handled in time adjacent segments is analyzed, obtains current crawl index parameter.
14. systems as claimed in claim 11, it is characterized in that, described system also comprises:
Permit module, for described the operation of concurrent processing is carried out to pending crawls request before, when the processing transactions number TPS per second of current reality does not exceed the highest upper safety limit TPS, permit pendingly capturing the process of asking;
Then described request processing module, specifically for carrying out concurrent processing to permitted pending crawl request.
15. systems as described in claim 11 or 12 or 13, is characterized in that, it is one or more that described crawl index parameter comprises in websites response index parameter and network condition parameter.
16. systems as claimed in claim 15, is characterized in that, described websites response index parameter comprise response time parameter and response scale parameter in one or more;
Wherein, described response time parameter is for representing that website is to the handled response time capturing request, and described response scale parameter handledly captures request whole handled proportion captured in request within the corresponding time period for what represent that response time in each time period meets default safe range.
17. systems as claimed in claim 15, is characterized in that, between the reptile of described website, to comprise in number of errors parameter, error rate parameter, grasp speed parameter, grasp speed scale parameter one or more for network condition parameter;
Wherein, the handled quantity capturing request of exception error is there is in described number of errors parameter for representing, described error rate parameter is for representing the increase ratio of the number of errors parameter in current slot relative to the number of errors parameter in the upper time period, and described grasp speed scale parameter is for being less than or equal to the ratio of current grasp speed parameter.
18. systems as claimed in claim 11, it is characterized in that, described number of concurrent is turned down module and is comprised:
First turns down submodule, for turning down the concurrent concurrent thread number capturing process; And/or
Second turns down submodule, for turning down the concurrent TPS capturing process.
19. systems stated as claim 16, is characterized in that, described second turns down submodule, concurrently capture turning down of the TPS processed specifically for carrying out according to the difference of the highest upper safety limit TPS and current TPS; Wherein, the highest described upper safety limit TPS is for representing that crawl index parameter does not exceed the historical high TPS in default safe range situation.
20. systems as described in claim 14 or 19, it is characterized in that, described system also comprises: for described the operation of concurrent processing is carried out to pending crawl request before, the upper limit TPS acquisition module of the highest upper safety limit TPS described in acquisition;
Described upper limit TPS acquisition module comprises:
Arranging submodule, is preset large numerical value for arranging the initial value of described upper safety limit TPS;
Progressively increasing submodule, reaching maximum concurrent thread number for progressively increasing the concurrent concurrent thread number capturing process until described concurrent thread is counted to;
Concurrent processing submodule, for carrying out the pending concurrent processing capturing request according to current safety upper limit TPS;
Heighten submodule, for when current crawl index parameter does not exceed default safe range, heighten current safety upper limit TPS;
Turn down submodule, for when current crawl index parameter exceeds default safe range, turn down current safety upper limit TPS;
Record sub module, for recording the upper safety limit TPS after heightening or turning down; And
Choose submodule, for choosing the highest TPS in recorded upper safety limit TPS as the highest upper safety limit TPS.
CN201310575226.XA 2013-11-15 2013-11-15 A kind of concurrent grasping means of webpage and system Active CN104657355B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201310575226.XA CN104657355B (en) 2013-11-15 2013-11-15 A kind of concurrent grasping means of webpage and system

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201310575226.XA CN104657355B (en) 2013-11-15 2013-11-15 A kind of concurrent grasping means of webpage and system

Publications (2)

Publication Number Publication Date
CN104657355A true CN104657355A (en) 2015-05-27
CN104657355B CN104657355B (en) 2018-10-23

Family

ID=53248504

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201310575226.XA Active CN104657355B (en) 2013-11-15 2013-11-15 A kind of concurrent grasping means of webpage and system

Country Status (1)

Country Link
CN (1) CN104657355B (en)

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106059849A (en) * 2016-05-09 2016-10-26 上海斐讯数据通信技术有限公司 Automatic trigger packet capture system and method
CN106921695A (en) * 2015-12-24 2017-07-04 阿里巴巴集团控股有限公司 Resource encapsulation method and device and assets packaging method
CN108632325A (en) * 2017-03-24 2018-10-09 中国移动通信集团浙江有限公司 A kind of call method and device of application

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US6961341B1 (en) * 1996-07-02 2005-11-01 Microsoft Corporation Adaptive bandwidth throttling for network services
CN101719377A (en) * 2009-11-24 2010-06-02 成都市华为赛门铁克科技有限公司 Method and device for controlling power consumption
CN102811258A (en) * 2012-07-27 2012-12-05 北京星网锐捷网络技术有限公司 Data parallel-downloading method, apparatus and network device
CN102868573A (en) * 2012-09-12 2013-01-09 北京航空航天大学 Method and device for Web service load cloud test

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US6961341B1 (en) * 1996-07-02 2005-11-01 Microsoft Corporation Adaptive bandwidth throttling for network services
CN101719377A (en) * 2009-11-24 2010-06-02 成都市华为赛门铁克科技有限公司 Method and device for controlling power consumption
CN102811258A (en) * 2012-07-27 2012-12-05 北京星网锐捷网络技术有限公司 Data parallel-downloading method, apparatus and network device
CN102868573A (en) * 2012-09-12 2013-01-09 北京航空航天大学 Method and device for Web service load cloud test

Cited By (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106921695A (en) * 2015-12-24 2017-07-04 阿里巴巴集团控股有限公司 Resource encapsulation method and device and assets packaging method
CN106059849A (en) * 2016-05-09 2016-10-26 上海斐讯数据通信技术有限公司 Automatic trigger packet capture system and method
CN106059849B (en) * 2016-05-09 2019-10-22 上海斐讯数据通信技术有限公司 A kind of automatic trigger packet snapping system and method
CN108632325A (en) * 2017-03-24 2018-10-09 中国移动通信集团浙江有限公司 A kind of call method and device of application

Also Published As

Publication number Publication date
CN104657355B (en) 2018-10-23

Similar Documents

Publication Publication Date Title
EP2374078B1 (en) Method for server-side logging of client browser state through markup language
US20190372878A1 (en) Web site reachability management for content browsing
CN105426415A (en) Management method, device and system of website access request
CN103797477A (en) Predicting user navigation events
CN103412890A (en) Webpage loading method and device
JP4849929B2 (en) Scenario creation program
US20080320498A1 (en) High Performance Script Behavior Detection Through Browser Shimming
CN111552854A (en) Webpage data capturing method and device, storage medium and equipment
CN104699529B (en) A kind of information acquisition method and device
CN102255776A (en) Method and device for monitoring state of on-line application
CN112181948B (en) Processing method and device of database operation statement, electronic equipment and medium
CN104636396A (en) Page positioning method and device
CN104657355A (en) Web concurrent crawling method and system
CN109298987A (en) A kind of method and device detecting web crawlers operating status
US8832275B1 (en) Offline web browser
TWI677223B (en) Page display method and device
US20130173580A1 (en) Scenario-based crawling
CN105760284A (en) Website performance monitoring method and device
CN109981533A (en) A kind of ddos attack detection method, device, electronic equipment and storage medium
US8402367B1 (en) Smart reload pages
CN108108458A (en) The method and device of resources of human talents data is shared based on domain name
CN111786828A (en) Log processing method executed in network device and network device
CN112579947A (en) Webpage element graph intercepting method and device and electronic equipment
CN105991706B (en) Monitoring method and device on a kind of line
KR20210036735A (en) Method and system for collecting online data using mobile devices

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant