CN106445966A - Data acquisition method and apparatus - Google Patents

Data acquisition method and apparatus Download PDF

Info

Publication number
CN106445966A
CN106445966A CN201510489158.4A CN201510489158A CN106445966A CN 106445966 A CN106445966 A CN 106445966A CN 201510489158 A CN201510489158 A CN 201510489158A CN 106445966 A CN106445966 A CN 106445966A
Authority
CN
China
Prior art keywords
data
crawls
failure
time
task
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN201510489158.4A
Other languages
Chinese (zh)
Inventor
李新国
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing Gridsum Technology Co Ltd
Original Assignee
Beijing Gridsum Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing Gridsum Technology Co Ltd filed Critical Beijing Gridsum Technology Co Ltd
Priority to CN201510489158.4A priority Critical patent/CN106445966A/en
Publication of CN106445966A publication Critical patent/CN106445966A/en
Pending legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/95Retrieval from the web
    • G06F16/951Indexing; Web crawling techniques

Abstract

Embodiments of the invention provide a data acquisition method and apparatus, and belong to the field of networks. The method comprises the steps of acquiring a failed data crawling task, wherein the data crawling task at least contains a data crawling failure frequency and data crawling failure time; determining data re-crawling time of the failed data crawling task according to the data crawling failure frequency and/or the data crawling failure time; and executing a data re-crawling task for the failed data crawling task according to the data re-crawling time. By determining the data re-crawling time of the failed data crawling task according to the data crawling failure frequency and/or the data crawling failure time, the data acquisition reliability is improved and the data acquisition efficiency is enhanced.

Description

A kind of data capture method and device
Technical field
The application is related to network field, particularly to a kind of data capture method and device.
Background technology
Generally, during obtaining data using web crawler, can be temporary due to network failure, website When paralysis and URL (Uniform Resource Locator, URL) lost efficacy situations such as And the situation that the data leading to cannot obtain, so that needing to provide a kind of data capture method, is realized in network The acquisition of data when temporarily paralyse in fault, website and URL lost efficacy.
In the prior art, it is stored in queue after page URL is recorded, heavy every an interval time Again once, when number of repetition reaches a threshold values, if after still this data cannot being obtained, judging should URL lost efficacy, and stopped the acquisition of the data to this.
But in the method being provided using prior art, due to a suitable interval time cannot be found. If it is long to set interval, easily cause that the data leading to because of network failure cannot obtain in queue Bulk deposition, increases internal memory burden;If setting time interval too short can cause because URL lost efficacy and website temporary When paralysis and frequently being retried that the data that leads to cannot obtain, increase server and network burden.Meanwhile, Recovery time due to website temporarily paralysis cannot measure, and this URL may reach before website is recovered Number of retries is dropped, and causes the situation about being dropped that the data leading to because of website temporarily paralysis cannot obtain, So that in the method being provided using prior art, reducing the reliability of data acquisition, reducing data The efficiency obtaining.
Content of the invention
In order to improve the reliability of data acquisition, improve the efficiency of data acquisition, the embodiment of the present application provides A kind of data capture method and device.Described technical scheme is as follows:
This application provides a kind of data capture method, methods described includes:
The data obtaining failure crawls task, and wherein, described data crawls task and includes at least:Data crawls The number of times data of failure crawls the time of failure;
The number of times of failure is crawled according to described data and/or data crawls the time of failure, determine described failure Data crawls task and re-starts the time that data crawls;
Re-start, according to described, the time that data crawls, tasks carrying is crawled again to the data of described failure Data crawls task.
Present invention also provides a kind of data acquisition facility, described device includes:
Acquisition module, the data for obtaining failure crawls task, and wherein, described data crawls task at least Comprise:The number of times data that data crawls failure crawls time of failure;
First processing module, for crawl the number of times of failure according to described data and/or data crawl failure when Between, determine that the data of described failure crawls task and re-starts the time that data crawls;
Second processing module, for re-starting, according to described, the time that data crawls, the number to described failure According to crawling tasks carrying, data crawls task again.
The embodiment of the present application provides a kind of data capture method and device, including:The data obtaining failure is climbed Take task, wherein, data crawls task and includes at least:The number of times data that data crawls failure crawls failure Time;The number of times of failure is crawled according to data and/or data crawls the time of failure, determine the data of failure The task of crawling re-starts the time that data crawls;According to re-starting the time that data crawls, to failure Data crawls tasks carrying, and data crawls task again.By crawling number of times and/or the data of failure according to data Crawl the time of failure, determine that the data of failure crawls task and re-starts the time that data crawls so that can To crawl the number of times of failure according to data and/or data crawls time of failure and crawls task to the data of this failure Re-starting the time that data crawls is adjusted, thus avoiding to due to website, temporarily the reason such as paralysis is led The omission that cannot obtain data causing, it is ensured that the reliability of data acquisition, avoids logical in the short time simultaneously Cross web crawler to repeating due to website the carrying out that cannot obtain data that temporarily reason such as paralysis leads to Crawl, and the system resource burden of the Internet resources burden causing and device, thus further increasing data The reliability obtaining, improves the efficiency of data acquisition.
Brief description
For the technical scheme being illustrated more clearly that in the embodiment of the present application, below will be to institute in embodiment description Need use accompanying drawing be briefly described it should be apparent that, drawings in the following description are only the application Some embodiments, for those of ordinary skill in the art, on the premise of not paying creative work, Other accompanying drawings can also be obtained according to these accompanying drawings.
Fig. 1 is a kind of data capture method flow chart that the embodiment of the present application provides;
Fig. 2 is a kind of data capture method flow chart that the embodiment of the present application provides;
Fig. 3 is a kind of data acquisition facility structural representation that the embodiment of the present application provides;
Fig. 4 is a kind of data acquisition facility structural representation that the embodiment of the present application provides.
Specific embodiment
Purpose, technical scheme and advantage for making the application are clearer, below in conjunction with the embodiment of the present application In accompanying drawing, the technical scheme in the embodiment of the present application is clearly and completely described it is clear that being retouched The embodiment stated is only some embodiments of the present application, rather than whole embodiments.Based in the application Embodiment, those of ordinary skill in the art obtained under the premise of not making creative work all its His embodiment, broadly falls into the scope of the application protection.
The embodiment of the present application provides a kind of data capture method, and the method application scenarios are included by web crawlers Program crawls task by execution and obtains web data, and this crawls task and at least includes webpage number to be obtained According to the banner being located.In addition, the method can also be applied to other and obtained according to banner information Take the scene of web data, this banner can be this webpage URL (Uniform Resource Locator, URL), the embodiment of the present application is not limited to specific application scenarios.
A kind of data capture method that embodiment one provides for the embodiment of the present application, shown in reference Fig. 1, the party Method includes:
101st, the data obtaining failure crawls task.
Wherein, data crawls task and includes at least:Data crawl failure number of times data crawl failure when Between.The data obtaining failure crawls task, can by way of real-time monitoring or default storage unsuccessfully Data crawl in the data base of task and obtain.
The first situation, when monitoring that data crawls mission failure, can obtain this data and crawl task (i.e. this time can be time or the executive agent scanning that just secondary data crawls failure to Time To Failure Crawl time during mission failure to this data, any restriction do not done to this application), and executive agent institute The number of times re-executing the task of crawling when data crawls unsuccessfully of setting, then, crawls target according to actual Demand, just described data crawls the number of times of failure and described data crawls the time of failure to default number According to storehouse.Wherein, crawl target and may refer to the crawling of website visiting amount, the crawling of microblogging click volume.
If herein it should be noted that system is not disposed on re-executing, when data crawls unsuccessfully, the task of crawling Number of times, then can be defaulted as data and crawl the number of times of failure is zero.
Second situation, for crawling less efficient system (i.e. executive agent), obtains for step 101 The data of failure crawls task, can also be from the data base that the data of default storage failure crawls task To obtain, to execute, to improve this executive agent, the efficiency that data crawls task.
Specifically, when the data obtaining from task pool in executive agent execution crawls task, if crawling failure, Then the data of failure is crawled task to store to default data base, the data of this failure crawls task and includes counting Crawl the information such as the time of failure, the present environment crawling failure according to the number of times data crawling failure.
This default data base can be pre-arranged on the server it is also possible to be pre-arranged in device In caching, can also be pre-arranged in the data base of device, the embodiment of the present application is to this default data Place position be not limited.
Wherein, the data of at least one failure that this default data place comprises crawl task can be according to adding Plus time be ranked up.
102nd, the number of times of failure is crawled according to data and/or data crawls the time of failure, determine the data of failure The task of crawling re-starts the time that data crawls.
Specifically, it is possible, firstly, to target be crawled according to the difference that performed data crawls task, data is climbed Take the number of times of failure or data crawls time of failure and determines the time interval that data crawls;Wherein, here Described crawls target, may refer to the crawling of website visiting amount, microblogging click volume the environment such as crawl, right This application does not do any restriction.Then, the time interval being crawled according to described data, described data crawl The number of times of failure and/or data crawl the time of failure, and the data generating described failure crawls task and re-starts The time that data crawls.
103rd, according to re-starting the time that data crawls, tasks carrying data again is crawled to the data of failure Crawl task.If meeting, the data of setup failed crawls task and currently will execute for web crawler Task.
Specifically, available judgement re-starts the time that data crawls whether to meet web crawler current The time that will execute, task, wherein, web crawler are crawled with the data re-executing described failure The currently time to be executed, can be arranged according to the environment that the data of required execution crawls, to this this Shen Please do not do any restriction.
Further, re-start the time that data crawls and be unsatisfactory for web crawler and currently will when described During the time of execution, step 105 can also include:The data of failure is crawled task store to default number According in storehouse, judged with the pending time that data crawls that re-starts next time.
The embodiment of the present application provides a kind of data capture method, by crawled according to data failure number of times and/ Or data crawls the time of failure, determine that the data of failure crawls task and re-starts the time that data crawls, Allow to crawl the number of times of failure according to data and/or data crawls time of failure and the data of this failure is climbed Time that data crawls is adjusted, thus avoiding to due to website temporarily paralysis etc. to take task to re-start The omission that cannot obtain data that reason leads to, it is ensured that the reliability of data acquisition, avoids in short-term simultaneously Interior by web crawler to due to website the carrying out that cannot obtain data that temporarily reason such as paralysis leads to That repeats crawls, and the system resource burden of the Internet resources burden causing and device, thus improving further The reliability of data acquisition, improves the efficiency of data acquisition.In addition, judging that re-starting data crawls Time whether meet the time that web crawler currently will execute;If meeting, the number of setup failed Currently will executing for web crawler according to the task that crawls of task, not only further avoids the short time Interior by web crawler to due to website the carrying out weight that cannot obtain data that temporarily reason such as paralysis leads to Multiple crawls, and the system resource burden of the Internet resources burden causing and device, thus further increasing The reliability of data acquisition, improves the efficiency of data acquisition it is ensured that temporarily paralysis etc. is former due to website Because lead to cannot obtain data website temporarily paralyse solution after, the timely acquisition of this information, thus Further increase the reliability of data acquisition, improve the efficiency of data acquisition.
A kind of data capture method that embodiment two provides for the embodiment of the present application, shown in reference Fig. 2, the party Method includes:
201st, the data obtaining failure crawls task, and wherein, data crawls task and includes at least:Data crawls The number of times data of failure crawls the time of failure.
Specifically, after predetermined period, the data obtaining failure from default data base crawls task.
The data obtaining failure from default data base crawls the process of task and can be:
After predetermined period, entered by task is crawled to the data of the failure included in default data base Row scanning, and the interpolation time of task is crawled according to the data of the failure being comprised in this default data base Sequentially, the data obtaining the failure earlier of interpolation time from default data base crawls task.
Suitable by interpolation time of crawling task according to the data of the failure being comprised in this default data base Sequence, the data obtaining failure crawls task so that interpolation time the to be retried earlier task that crawls preferably is entered Row crawls, and cannot obtain recovering extremely of data thus ensure that due to what the reasons such as the temporary transient paralysis in website led to During the retrievable state of data, the timely acquisition of this data, thus further increase the reliability of data acquisition Property, improve the efficiency of data acquisition.
202nd, the number of times of failure is crawled according to data and/or data crawls the time of failure, determine the data of failure The task of crawling re-starts the time that data crawls.
Specifically, time interval is crawled according to the data determining, data crawls the number of times of failure and/or data is climbed Take the time of failure, the data of failed regeneration crawls task and re-starts the time that data crawls.
In order to further illustrate the method for the embodiment of the present application it is assumed that the number of times that data crawls failure is M, number It is N according to the time interval crawling, the time that data crawls failure is TF, the data of this failure crawls task weight Newly carrying out the time that data crawls is TF;
Can be according to the first preset function, realization crawls time interval data according to data and crawls the secondary of failure Number, the data of failed regeneration crawls the process that task re-starts the time that data crawls, this first default letter Number TS1Can be for shown in formula [1]:
TS1=N × aM[1]
Can also realize crawling failure according to the time interval data that data crawls according to the second preset function Time, the data of failed regeneration crawls the process that task re-starts the time that data crawls, and this is second pre- If function TS2Can be for shown in formula [2]:
TS2=TF+N × a [2]
Furthermore it is also possible to according to the 3rd preset function, according to data, the time interval crawling, data are climbed for realization The time data taking failure crawls the number of times of failure, and the data of failed regeneration crawls task and re-starts data The process of the time crawling, the 3rd preset function TS3Can be for shown in formula [3]:
TS3=TF+N × aM[3]
Wherein, a in above-mentioned formula [1]-formula [3] is predetermined coefficient, and a is more than 1.
It should be noted that above-mentioned first preset function, the second preset function and the 3rd preset function are only Exemplary, in addition to this it is possible to include the preset function that other Functional Qualities are nondecreasing function, this Shen Specific preset function please be limited embodiment.
By determined by data crawl time interval, data crawls the number of times of failure and/or data crawls mistake The time lost, the data of failed regeneration crawls task and re-starts the time that data crawls it is ensured that data is climbed The data taking the more failure of number of times of failure crawls execution time of task and is later than the number of times that data crawls failure The data of less failure crawls the execution time of task, and the frequency of failure is more, illustrates that this data crawls and appoints Business successful execution probability lower it is ensured that the probability of successful execution higher priority of task execution it is ensured that The timely acquisition of data, thus further increasing the reliability of data acquisition, improves data acquisition Efficiency.In addition, crawl the time of failure by data, the data of failed regeneration crawls task and re-starts number According to the time crawling it is ensured that recovering of data cannot be obtained due to what the reasons such as the temporary transient paralysis in website led to During state retrievable to data, the timely acquisition of this data, thus further increase data acquisition can By property, improve the efficiency of data acquisition.In addition, by the number of times crawling failure according to data, generating and lose The data losing crawls task and re-starts the time that data crawls, it is to avoid pass through web crawlers journey in the short time Ordered pair is due to crawling that website the carrying out that cannot obtain data that temporarily reason such as paralysis leads to is repeated, and causes Internet resources burden and device system resource burden, thus further increasing the reliability of data acquisition, Improve the efficiency of data acquisition.
203rd, judge to re-start whether the time that data crawls meets what web crawler currently will execute Time;If meeting, execution step 204, if being unsatisfactory for, execution step 205.
Wherein, the time that web crawler currently will execute is to could be arranged to data to crawl time interval Integral multiple.
By judge to re-start the time that data crawls less than web crawler currently will execute when Between, it is to avoid in the short time by web crawler the reason such as temporarily paralysed due to website is led to cannot What the carrying out of acquisition data was repeated crawls, and the system resource of the Internet resources burden causing and device is born, Thus further increasing the reliability of data acquisition, improve the efficiency of data acquisition.
204th, the data of setup failed crawls the task that task currently will execute for web crawler.
Specifically, the data of this failure is crawled task to add to the task pool of web crawler.
It is set to, by the data of failure is crawled task, the task that web crawler currently will execute, protect The data having demonstrate,proved this failure crawls again crawling in time of task, and then ensure that the timely acquisition of data, from And further increase the reliability of data acquisition, improve the efficiency of data acquisition.
It should be noted that step 203 to step 204 is to realize basis to re-start the time that data crawls, Data crawls task again to crawl tasks carrying to the data of failure, in addition to the mode of above-mentioned steps, also This process can be realized by other means, the embodiment of the present application is not limited to specific mode.
The time crawling due to re-starting data is to crawl the number of times of failure according to data and/or data crawls mistake Time of losing generates, thus ensure that data crawls the data of the more failure of number of times of failure and crawls task Execution time be later than the data of the less failure of number of times that data crawls failure and crawl execution time of task, And the frequency of failure is more, illustrate this data crawl the successful execution of task probability lower it is ensured that successfully holding The higher priority of task execution of probability of row it is ensured that the timely acquisition of data, thus further increasing number According to the reliability obtaining, improve the efficiency of data acquisition.
205, the data of failure is crawled task and stores to default data base, with pending weight next time Newly carry out the time judgement that data crawls, and obtain the data of this failure in default data base crawling task pair The data of the next failure answered crawls task.
By being unsatisfactory for the time that web crawler currently will execute re-starting the time that data crawls When, the data of failure is crawled task store to default data base, it is to avoid in the short time, pass through network Crawlers to due to crawling that website the carrying out that cannot obtain data that temporarily reason such as paralysis leads to is repeated, And the system resource burden of the Internet resources burden causing and device, thus further increasing data acquisition Reliability, improves the efficiency of data acquisition.
Method by executing the embodiment of the present application, not only avoid and pass through web crawler pair in the short time Due to crawling that website the carrying out that cannot obtain data that temporarily reason such as paralysis leads to is repeated, and the net causing The system resource burden of network resource burden and device is it is ensured that this website is being recovered to the retrievable shape of data During state, the timely acquisition of this data, thus improve the reliability of data acquisition, improve data acquisition Efficiency.
The embodiment of the present application provides a kind of data capture method, by according to the to be retried weight crawling task Examination number of times and last time first task time when retrying, generate and will retry the to be retried task that crawls Second task time is so that can be according to crawling the number of retries of task and last when retrying wait retry Between the retry time that crawls task to be retried to this be adjusted, thus avoid to due to website temporarily paralysed The omission that cannot obtain data that the reasons such as paralysis lead to is it is ensured that the reliability of data acquisition.In addition, passing through According to crawling the number of retries of task and the first task time when last time retries wait retry, generating will Retry to be retried the second task time crawling task, it is to avoid in the short time, pass through web crawler To crawling of being repeated due to the carrying out that cannot obtain data that the reasons such as the temporary transient paralysis in website lead to, and cause The system resource burden of Internet resources burden and device, thus further increasing the reliability of data acquisition, Improve the efficiency of data acquisition.In addition, pre-conditioned by judging whether the second task time met, if It is then to arrange and to be retried crawl the task that task currently will execute for web crawler;Otherwise, obtain Take this to be retried next one crawling task to be retried crawl task, be not provided with crawling task and climb for network The task that worm program currently will execute, not only further avoids and passes through web crawler in the short time To crawling of being repeated due to the carrying out that cannot obtain data that the reasons such as the temporary transient paralysis in website lead to, and cause The system resource burden of Internet resources burden and device, thus further increasing the reliability of data acquisition, Improve the efficiency of data acquisition it is ensured that due to website temporarily the reason such as paralysis lead to cannot obtain number According to website temporarily paralyse solution after, the timely acquisition of this information, thus further increase data obtain The reliability taking, improves the efficiency of data acquisition.
A kind of data acquisition facility that embodiment three provides for the embodiment of the present application, shown in reference Fig. 3, this number Include according to acquisition device:
Acquisition module 31, the data for obtaining failure crawls task, and wherein, data crawls task and at least wraps Contain:The number of times data that data crawls failure crawls time of failure;
First processing module 32, crawls the time of failure for crawling the number of times of failure and/or data according to data, Determine that the data of failure crawls task and re-starts the time that data crawls;
Second processing module 33, for according to re-starting the time that data crawls, crawling to the data of failure Again data crawls task to tasks carrying.
Described acquisition module, is used for:Data for acquired failure crawls task, obtains data and crawls The number of times data of failure crawls the time of failure;Described data is crawled the number of times of failure and described data is climbed The time taking failure is to default data base.
Optionally, first processing module 32 can include:
Determination sub-module, for determining the time interval that data crawls;
Generate submodule, for according to determine the time interval that crawls of data, data crawl failure number of times and/ Or data crawls the time of failure, the data of failed regeneration crawls task and re-starts the time that data crawls.
Optionally, Second processing module 33 specifically for:
Judge to re-start when whether the time that data crawls meet that web crawler currently will execute Between;
If meeting, the data of setup failed crawls the task that task currently will execute for web crawler.
Optionally, Second processing module 33 is additionally operable to:
The data of failure is crawled task store to default data base, with pending again entering next time The time that row data crawls judges.
The embodiment of the present application provides a kind of data acquisition facility, and this data acquisition facility is by climbing according to data Take the number of times of failure and/or data to crawl the time of failure, determine that the data of failure crawls task and re-starts number According to the time crawling so that the number of times of failure can be crawled according to data and/or data crawls the time pair of failure The data of this failure crawls task and re-starts the time that data crawls and is adjusted, thus avoid to due to The website omission that cannot obtain data that temporarily reason such as paralysis leads to it is ensured that the reliability of data acquisition, Avoid cannot to led to due to reasons such as the temporary transient paralysis in website by web crawler in the short time simultaneously What the carrying out of acquisition data was repeated crawls, and the system resource of the Internet resources burden causing and device is born, Thus further increasing the reliability of data acquisition, improve the efficiency of data acquisition.In addition, judging weight Newly carry out whether the time that data crawls meets the time that web crawler currently will execute;If meeting, Then the data of setup failed crawls the task that task currently will execute for web crawler, not only further Avoid in the short time and cannot be obtained to what the reason such as temporarily paralysed due to website was led to by web crawler What the carrying out fetched data was repeated crawls, and the system resource burden of the Internet resources burden causing and device, from And further increase the reliability of data acquisition, improve the efficiency of data acquisition it is ensured that due to net Data that what the reasons such as temporary transient paralysis of standing led to cannot obtain after solution is temporarily paralysed in website, this information Obtaining in time, thus further increasing the reliability of data acquisition, improve the efficiency of data acquisition.
A kind of data acquisition facility 4 that example IV position the embodiment of the present application provides, shown in reference Fig. 4, this number Include memorizer 41 and the processor 42 being connected with memorizer 41 according to acquisition device 4, memorizer 41 is used for depositing Storage batch processing code, processor 42 calls the program code that memorizer 41 is stored to be used for executing following behaviour Make:
The data obtaining failure crawls task, and wherein, data crawls task and includes at least:Data crawls failure Number of times data crawl time of failure;
The number of times of failure is crawled according to data and/or data crawls the time of failure, determine that the data of failure crawls Task re-starts the time that data crawls;
According to re-starting the time that data crawls, data crawls again to crawl tasks carrying to the data of failure Task.
Optionally, processor 42 calls the program code that memorizer 41 is stored to be used for executing following operation:
The number of times data that data is crawled failure crawls the time of failure to default data base.
Optionally, processor 42 calls the program code that memorizer 41 is stored to be used for executing following operation:
According to prefixed time interval, data crawls the number of times of failure and/or data crawls the time of failure, generates The data of failure crawls task and re-starts the time that data crawls.
Optionally, processor 42 calls the program code that memorizer 41 is stored to be used for executing following operation:
Judge to re-start when whether the time that data crawls meet that web crawler currently will execute Between;
If meeting, the data of setup failed crawls the task that task currently will execute for web crawler.
Optionally, processor 42 calls the program code that memorizer 41 is stored to be used for executing following operation:
The data of failure is crawled task store to default data base, with pending again entering next time The time that row data crawls judges.
The embodiment of the present application provides a kind of data acquisition facility, and this data acquisition facility is by climbing according to data Take the number of times of failure and/or data to crawl the time of failure, determine that the data of failure crawls task and re-starts number According to the time crawling so that the number of times of failure can be crawled according to data and/or data crawls the time pair of failure The data of this failure crawls task and re-starts the time that data crawls and is adjusted, thus avoid to due to The website omission that cannot obtain data that temporarily reason such as paralysis leads to it is ensured that the reliability of data acquisition, Avoid cannot to led to due to reasons such as the temporary transient paralysis in website by web crawler in the short time simultaneously What the carrying out of acquisition data was repeated crawls, and the system resource of the Internet resources burden causing and device is born, Thus further increasing the reliability of data acquisition, improve the efficiency of data acquisition.In addition, judging weight Newly carry out whether the time that data crawls meets the time that web crawler currently will execute;If meeting, Then the data of setup failed crawls the task that task currently will execute for web crawler, not only further Avoid in the short time and cannot be obtained to what the reason such as temporarily paralysed due to website was led to by web crawler What the carrying out fetched data was repeated crawls, and the system resource burden of the Internet resources burden causing and device, from And further increase the reliability of data acquisition, improve the efficiency of data acquisition it is ensured that due to net Data that what the reasons such as temporary transient paralysis of standing led to cannot obtain after solution is temporarily paralysed in website, this information Obtaining in time, thus further increasing the reliability of data acquisition, improve the efficiency of data acquisition.
Above-mentioned all optional technical schemes, can adopt the alternative embodiment arbitrarily combining to form the application, This no longer repeats one by one.
It should be noted that:Above-described embodiment provide device execute data capture method when, only with above-mentioned The division of each functional module is illustrated, and in practical application, can divide above-mentioned functions as desired Join and completed by different functional modules, the internal structure of equipment will be divided into different functional modules, with complete Become all or part of function described above.In addition, the data acquisition facility of above-described embodiment offer and number Belong to same design according to acquisition methods embodiment, it implements process and refers to embodiment of the method, here no longer Repeat.
One of ordinary skill in the art will appreciate that all or part of step realizing above-described embodiment can be passed through Hardware come to complete it is also possible to instructed by program correlation hardware complete, program can be stored in one kind In computer-readable recording medium, storage medium mentioned above can be read only memory, disk or CD Deng.
These are only the preferred embodiment of the application, not in order to limit the application, all in spirit herein Within principle, any modification, equivalent substitution and improvement made etc., should be included in the protection of the application Within the scope of.

Claims (10)

1. a kind of data capture method is it is characterised in that methods described includes:
The data obtaining failure crawls task, and wherein, described data crawls task and includes at least:Data crawls The number of times data of failure crawls the time of failure;
The number of times of failure is crawled according to described data and/or data crawls the time of failure, determine described failure Data crawls task and re-starts the time that data crawls;
Re-start, according to described, the time that data crawls, tasks carrying is crawled again to the data of described failure Data crawls task.
2. method according to claim 1 it is characterised in that the data of described acquisition failure crawl appoint Business, including:
Data for acquired failure crawls task, and the number of times data that acquisition data crawls failure crawls The time of failure;
Described data is crawled the number of times of failure and described data crawls the time of failure to default data base.
3. method according to claim 2 is it is characterised in that described crawl failure according to described data Number of times and/or data crawl time of failure, determine that the data of described failure crawls task and re-starts data The time crawling includes:
Determine the time interval that data crawls;
The time interval that crawled according to described data, described data crawls the number of times of failure and/or data crawls mistake The time lost, the data of the described failure of generation crawls task and re-starts the time that data crawls.
4. method according to claim 3 is it is characterised in that re-start data described in described basis The time crawling, the data of described failure is crawled with tasks carrying, and data crawls task and includes again:
Re-start described in judgement whether the time that data crawls meets what web crawler currently will execute Time;
If meeting, the data arranging described failure crawls task and currently will hold for described web crawler The task of row.
5. method according to claim 4 is it is characterised in that described ought re-start what data crawled When time is unsatisfactory for the time that web crawler currently will execute, then methods described also includes:
The data of described failure is crawled task store to default data base, with pending weight next time Newly carry out the time judgement that data crawls.
6. a kind of data acquisition facility is it is characterised in that include:
Acquisition module, the data for obtaining failure crawls task, and wherein, described data crawls task at least Comprise:The number of times data that data crawls failure crawls time of failure;
First processing module, for crawl the number of times of failure according to described data and/or data crawl failure when Between, determine that the data of described failure crawls task and re-starts the time that data crawls;
Second processing module, for re-starting, according to described, the time that data crawls, the number to described failure According to crawling tasks carrying, data crawls task again.
7. device according to claim 6, it is characterised in that described acquisition module, is used for:
Data for acquired failure crawls task, and the number of times data that acquisition data crawls failure crawls The time of failure;
Described data is crawled the number of times of failure and described data crawls the time of failure to default data base.
8. device according to claim 7 is it is characterised in that described first processing module includes:
Determination sub-module, for determining the time interval that data crawls;
Generate submodule, crawl the secondary of failure for the time interval crawling according to described data, described data Number and/or data crawl time of failure, and the data generating described failure crawls task and re-starts data and crawls Time.
9. device according to claim 8 it is characterised in that described Second processing module specifically for:
Re-start described in judgement whether the time that data crawls meets what web crawler currently will execute Time;
If meeting, the data arranging described failure crawls task and currently will hold for described web crawler The task of row.
10. device according to claim 9 is it is characterised in that described Second processing module is specifically gone back For:
Re-start the time that data crawls and be unsatisfactory for the time that web crawler currently will execute when described When, the data of described failure is crawled task and stores to default data base, with pending weight next time Newly carry out the time judgement that data crawls.
CN201510489158.4A 2015-08-11 2015-08-11 Data acquisition method and apparatus Pending CN106445966A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201510489158.4A CN106445966A (en) 2015-08-11 2015-08-11 Data acquisition method and apparatus

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201510489158.4A CN106445966A (en) 2015-08-11 2015-08-11 Data acquisition method and apparatus

Publications (1)

Publication Number Publication Date
CN106445966A true CN106445966A (en) 2017-02-22

Family

ID=58092802

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201510489158.4A Pending CN106445966A (en) 2015-08-11 2015-08-11 Data acquisition method and apparatus

Country Status (1)

Country Link
CN (1) CN106445966A (en)

Cited By (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107526833A (en) * 2017-09-05 2017-12-29 广东科杰通信息科技有限公司 A kind of URL management methods, system
CN107544853A (en) * 2017-08-23 2018-01-05 万惠投资管理有限公司 It is a kind of to interact the method and system retried with bank
CN112347394A (en) * 2020-11-30 2021-02-09 广州至真信息科技有限公司 Method and device for acquiring webpage information, computer equipment and storage medium
CN112579858A (en) * 2019-09-30 2021-03-30 北京国双科技有限公司 Data crawling method and device

Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101187925A (en) * 2006-11-17 2008-05-28 北京酷讯科技有限公司 Automatic optimized crawler grab method
CN102469132A (en) * 2010-11-15 2012-05-23 北大方正集团有限公司 Method and system for grabbing web pages from servers with different IPs (Internet Protocols) in website
CN103399933A (en) * 2013-08-08 2013-11-20 人民搜索网络股份公司 Method and system for grabbing webpage contents of network print media
CN103559083A (en) * 2013-10-11 2014-02-05 北京奇虎科技有限公司 Web crawl task scheduling method and task scheduler
CN103559219A (en) * 2013-10-18 2014-02-05 北京京东尚科信息技术有限公司 Distributed web crawler capture task dispatching method, dispatching-side device and capture nodes
US8676783B1 (en) * 2011-06-28 2014-03-18 Google Inc. Method and apparatus for managing a backlog of pending URL crawls
CN103902386A (en) * 2014-04-11 2014-07-02 复旦大学 Multi-thread network crawler processing method based on connection proxy optimal management

Patent Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101187925A (en) * 2006-11-17 2008-05-28 北京酷讯科技有限公司 Automatic optimized crawler grab method
CN102469132A (en) * 2010-11-15 2012-05-23 北大方正集团有限公司 Method and system for grabbing web pages from servers with different IPs (Internet Protocols) in website
US8676783B1 (en) * 2011-06-28 2014-03-18 Google Inc. Method and apparatus for managing a backlog of pending URL crawls
CN103399933A (en) * 2013-08-08 2013-11-20 人民搜索网络股份公司 Method and system for grabbing webpage contents of network print media
CN103559083A (en) * 2013-10-11 2014-02-05 北京奇虎科技有限公司 Web crawl task scheduling method and task scheduler
CN103559219A (en) * 2013-10-18 2014-02-05 北京京东尚科信息技术有限公司 Distributed web crawler capture task dispatching method, dispatching-side device and capture nodes
CN103902386A (en) * 2014-04-11 2014-07-02 复旦大学 Multi-thread network crawler processing method based on connection proxy optimal management

Cited By (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107544853A (en) * 2017-08-23 2018-01-05 万惠投资管理有限公司 It is a kind of to interact the method and system retried with bank
CN107526833A (en) * 2017-09-05 2017-12-29 广东科杰通信息科技有限公司 A kind of URL management methods, system
CN107526833B (en) * 2017-09-05 2020-03-24 广东科杰通信息科技有限公司 URL management method and system
CN112579858A (en) * 2019-09-30 2021-03-30 北京国双科技有限公司 Data crawling method and device
CN112347394A (en) * 2020-11-30 2021-02-09 广州至真信息科技有限公司 Method and device for acquiring webpage information, computer equipment and storage medium

Similar Documents

Publication Publication Date Title
CN104348822B (en) A kind of method, apparatus and server of internet account number authentication
CN106445966A (en) Data acquisition method and apparatus
CN107832355B (en) A kind of method and device that the agency of crawlers obtains
CN103902386B (en) Multi-thread network crawler processing method based on connection proxy optimal management
CN110062025A (en) Method, apparatus, server and the storage medium of data acquisition
CN104219316A (en) Method and device for processing call request in distributed system
CN109981653B (en) Web vulnerability scanning method
CN109033195A (en) The acquisition methods of webpage information obtain equipment and computer-readable medium
CN106294648A (en) A kind of processing method and processing device for page access path
CN106534244A (en) Scheduling method and device for proxy resources
CN103914302A (en) Webpage loading progress monitoring method and webpage loading progress monitoring device
CN106407219A (en) Web page link crawling method and apparatus
CN110020043B (en) Page crawling method, device, storage medium and processor
CN109241733A (en) Crawler Activity recognition method and device based on web access log
CN110149419A (en) The efficient crawler method of IP-based
CN103404181A (en) Method, system, gateway and server for cancelling redirection
CN107168850B (en) URL page monitoring method and device
CN110516139A (en) Crawler system and method
CN103678311A (en) Webpage access method and system based on transfer mode and path capturing server
CN107465453A (en) A kind of ONT Optical Network Terminal and its method of work and communication system
CN103399871A (en) Equipment and method for capturing second-level domain information associated with main domain
CN103118033B (en) A kind of defend user website to be tampered method and device
CN104462242B (en) Webpage capacity of returns statistical method and device
CN103716139A (en) Information push processing method and information push processing device
CN103684823A (en) Weblog recording method, network access path determining method and related devices

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
RJ01 Rejection of invention patent application after publication
RJ01 Rejection of invention patent application after publication

Application publication date: 20170222