CN106445966A - Data acquisition method and apparatus - Google Patents
Data acquisition method and apparatus Download PDFInfo
- Publication number
- CN106445966A CN106445966A CN201510489158.4A CN201510489158A CN106445966A CN 106445966 A CN106445966 A CN 106445966A CN 201510489158 A CN201510489158 A CN 201510489158A CN 106445966 A CN106445966 A CN 106445966A
- Authority
- CN
- China
- Prior art keywords
- data
- crawls
- failure
- time
- task
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/90—Details of database functions independent of the retrieved data types
- G06F16/95—Retrieval from the web
- G06F16/951—Indexing; Web crawling techniques
Abstract
Embodiments of the invention provide a data acquisition method and apparatus, and belong to the field of networks. The method comprises the steps of acquiring a failed data crawling task, wherein the data crawling task at least contains a data crawling failure frequency and data crawling failure time; determining data re-crawling time of the failed data crawling task according to the data crawling failure frequency and/or the data crawling failure time; and executing a data re-crawling task for the failed data crawling task according to the data re-crawling time. By determining the data re-crawling time of the failed data crawling task according to the data crawling failure frequency and/or the data crawling failure time, the data acquisition reliability is improved and the data acquisition efficiency is enhanced.
Description
Technical field
The application is related to network field, particularly to a kind of data capture method and device.
Background technology
Generally, during obtaining data using web crawler, can be temporary due to network failure, website
When paralysis and URL (Uniform Resource Locator, URL) lost efficacy situations such as
And the situation that the data leading to cannot obtain, so that needing to provide a kind of data capture method, is realized in network
The acquisition of data when temporarily paralyse in fault, website and URL lost efficacy.
In the prior art, it is stored in queue after page URL is recorded, heavy every an interval time
Again once, when number of repetition reaches a threshold values, if after still this data cannot being obtained, judging should
URL lost efficacy, and stopped the acquisition of the data to this.
But in the method being provided using prior art, due to a suitable interval time cannot be found.
If it is long to set interval, easily cause that the data leading to because of network failure cannot obtain in queue
Bulk deposition, increases internal memory burden;If setting time interval too short can cause because URL lost efficacy and website temporary
When paralysis and frequently being retried that the data that leads to cannot obtain, increase server and network burden.Meanwhile,
Recovery time due to website temporarily paralysis cannot measure, and this URL may reach before website is recovered
Number of retries is dropped, and causes the situation about being dropped that the data leading to because of website temporarily paralysis cannot obtain,
So that in the method being provided using prior art, reducing the reliability of data acquisition, reducing data
The efficiency obtaining.
Content of the invention
In order to improve the reliability of data acquisition, improve the efficiency of data acquisition, the embodiment of the present application provides
A kind of data capture method and device.Described technical scheme is as follows:
This application provides a kind of data capture method, methods described includes:
The data obtaining failure crawls task, and wherein, described data crawls task and includes at least:Data crawls
The number of times data of failure crawls the time of failure;
The number of times of failure is crawled according to described data and/or data crawls the time of failure, determine described failure
Data crawls task and re-starts the time that data crawls;
Re-start, according to described, the time that data crawls, tasks carrying is crawled again to the data of described failure
Data crawls task.
Present invention also provides a kind of data acquisition facility, described device includes:
Acquisition module, the data for obtaining failure crawls task, and wherein, described data crawls task at least
Comprise:The number of times data that data crawls failure crawls time of failure;
First processing module, for crawl the number of times of failure according to described data and/or data crawl failure when
Between, determine that the data of described failure crawls task and re-starts the time that data crawls;
Second processing module, for re-starting, according to described, the time that data crawls, the number to described failure
According to crawling tasks carrying, data crawls task again.
The embodiment of the present application provides a kind of data capture method and device, including:The data obtaining failure is climbed
Take task, wherein, data crawls task and includes at least:The number of times data that data crawls failure crawls failure
Time;The number of times of failure is crawled according to data and/or data crawls the time of failure, determine the data of failure
The task of crawling re-starts the time that data crawls;According to re-starting the time that data crawls, to failure
Data crawls tasks carrying, and data crawls task again.By crawling number of times and/or the data of failure according to data
Crawl the time of failure, determine that the data of failure crawls task and re-starts the time that data crawls so that can
To crawl the number of times of failure according to data and/or data crawls time of failure and crawls task to the data of this failure
Re-starting the time that data crawls is adjusted, thus avoiding to due to website, temporarily the reason such as paralysis is led
The omission that cannot obtain data causing, it is ensured that the reliability of data acquisition, avoids logical in the short time simultaneously
Cross web crawler to repeating due to website the carrying out that cannot obtain data that temporarily reason such as paralysis leads to
Crawl, and the system resource burden of the Internet resources burden causing and device, thus further increasing data
The reliability obtaining, improves the efficiency of data acquisition.
Brief description
For the technical scheme being illustrated more clearly that in the embodiment of the present application, below will be to institute in embodiment description
Need use accompanying drawing be briefly described it should be apparent that, drawings in the following description are only the application
Some embodiments, for those of ordinary skill in the art, on the premise of not paying creative work,
Other accompanying drawings can also be obtained according to these accompanying drawings.
Fig. 1 is a kind of data capture method flow chart that the embodiment of the present application provides;
Fig. 2 is a kind of data capture method flow chart that the embodiment of the present application provides;
Fig. 3 is a kind of data acquisition facility structural representation that the embodiment of the present application provides;
Fig. 4 is a kind of data acquisition facility structural representation that the embodiment of the present application provides.
Specific embodiment
Purpose, technical scheme and advantage for making the application are clearer, below in conjunction with the embodiment of the present application
In accompanying drawing, the technical scheme in the embodiment of the present application is clearly and completely described it is clear that being retouched
The embodiment stated is only some embodiments of the present application, rather than whole embodiments.Based in the application
Embodiment, those of ordinary skill in the art obtained under the premise of not making creative work all its
His embodiment, broadly falls into the scope of the application protection.
The embodiment of the present application provides a kind of data capture method, and the method application scenarios are included by web crawlers
Program crawls task by execution and obtains web data, and this crawls task and at least includes webpage number to be obtained
According to the banner being located.In addition, the method can also be applied to other and obtained according to banner information
Take the scene of web data, this banner can be this webpage URL (Uniform Resource Locator,
URL), the embodiment of the present application is not limited to specific application scenarios.
A kind of data capture method that embodiment one provides for the embodiment of the present application, shown in reference Fig. 1, the party
Method includes:
101st, the data obtaining failure crawls task.
Wherein, data crawls task and includes at least:Data crawl failure number of times data crawl failure when
Between.The data obtaining failure crawls task, can by way of real-time monitoring or default storage unsuccessfully
Data crawl in the data base of task and obtain.
The first situation, when monitoring that data crawls mission failure, can obtain this data and crawl task
(i.e. this time can be time or the executive agent scanning that just secondary data crawls failure to Time To Failure
Crawl time during mission failure to this data, any restriction do not done to this application), and executive agent institute
The number of times re-executing the task of crawling when data crawls unsuccessfully of setting, then, crawls target according to actual
Demand, just described data crawls the number of times of failure and described data crawls the time of failure to default number
According to storehouse.Wherein, crawl target and may refer to the crawling of website visiting amount, the crawling of microblogging click volume.
If herein it should be noted that system is not disposed on re-executing, when data crawls unsuccessfully, the task of crawling
Number of times, then can be defaulted as data and crawl the number of times of failure is zero.
Second situation, for crawling less efficient system (i.e. executive agent), obtains for step 101
The data of failure crawls task, can also be from the data base that the data of default storage failure crawls task
To obtain, to execute, to improve this executive agent, the efficiency that data crawls task.
Specifically, when the data obtaining from task pool in executive agent execution crawls task, if crawling failure,
Then the data of failure is crawled task to store to default data base, the data of this failure crawls task and includes counting
Crawl the information such as the time of failure, the present environment crawling failure according to the number of times data crawling failure.
This default data base can be pre-arranged on the server it is also possible to be pre-arranged in device
In caching, can also be pre-arranged in the data base of device, the embodiment of the present application is to this default data
Place position be not limited.
Wherein, the data of at least one failure that this default data place comprises crawl task can be according to adding
Plus time be ranked up.
102nd, the number of times of failure is crawled according to data and/or data crawls the time of failure, determine the data of failure
The task of crawling re-starts the time that data crawls.
Specifically, it is possible, firstly, to target be crawled according to the difference that performed data crawls task, data is climbed
Take the number of times of failure or data crawls time of failure and determines the time interval that data crawls;Wherein, here
Described crawls target, may refer to the crawling of website visiting amount, microblogging click volume the environment such as crawl, right
This application does not do any restriction.Then, the time interval being crawled according to described data, described data crawl
The number of times of failure and/or data crawl the time of failure, and the data generating described failure crawls task and re-starts
The time that data crawls.
103rd, according to re-starting the time that data crawls, tasks carrying data again is crawled to the data of failure
Crawl task.If meeting, the data of setup failed crawls task and currently will execute for web crawler
Task.
Specifically, available judgement re-starts the time that data crawls whether to meet web crawler current
The time that will execute, task, wherein, web crawler are crawled with the data re-executing described failure
The currently time to be executed, can be arranged according to the environment that the data of required execution crawls, to this this Shen
Please do not do any restriction.
Further, re-start the time that data crawls and be unsatisfactory for web crawler and currently will when described
During the time of execution, step 105 can also include:The data of failure is crawled task store to default number
According in storehouse, judged with the pending time that data crawls that re-starts next time.
The embodiment of the present application provides a kind of data capture method, by crawled according to data failure number of times and/
Or data crawls the time of failure, determine that the data of failure crawls task and re-starts the time that data crawls,
Allow to crawl the number of times of failure according to data and/or data crawls time of failure and the data of this failure is climbed
Time that data crawls is adjusted, thus avoiding to due to website temporarily paralysis etc. to take task to re-start
The omission that cannot obtain data that reason leads to, it is ensured that the reliability of data acquisition, avoids in short-term simultaneously
Interior by web crawler to due to website the carrying out that cannot obtain data that temporarily reason such as paralysis leads to
That repeats crawls, and the system resource burden of the Internet resources burden causing and device, thus improving further
The reliability of data acquisition, improves the efficiency of data acquisition.In addition, judging that re-starting data crawls
Time whether meet the time that web crawler currently will execute;If meeting, the number of setup failed
Currently will executing for web crawler according to the task that crawls of task, not only further avoids the short time
Interior by web crawler to due to website the carrying out weight that cannot obtain data that temporarily reason such as paralysis leads to
Multiple crawls, and the system resource burden of the Internet resources burden causing and device, thus further increasing
The reliability of data acquisition, improves the efficiency of data acquisition it is ensured that temporarily paralysis etc. is former due to website
Because lead to cannot obtain data website temporarily paralyse solution after, the timely acquisition of this information, thus
Further increase the reliability of data acquisition, improve the efficiency of data acquisition.
A kind of data capture method that embodiment two provides for the embodiment of the present application, shown in reference Fig. 2, the party
Method includes:
201st, the data obtaining failure crawls task, and wherein, data crawls task and includes at least:Data crawls
The number of times data of failure crawls the time of failure.
Specifically, after predetermined period, the data obtaining failure from default data base crawls task.
The data obtaining failure from default data base crawls the process of task and can be:
After predetermined period, entered by task is crawled to the data of the failure included in default data base
Row scanning, and the interpolation time of task is crawled according to the data of the failure being comprised in this default data base
Sequentially, the data obtaining the failure earlier of interpolation time from default data base crawls task.
Suitable by interpolation time of crawling task according to the data of the failure being comprised in this default data base
Sequence, the data obtaining failure crawls task so that interpolation time the to be retried earlier task that crawls preferably is entered
Row crawls, and cannot obtain recovering extremely of data thus ensure that due to what the reasons such as the temporary transient paralysis in website led to
During the retrievable state of data, the timely acquisition of this data, thus further increase the reliability of data acquisition
Property, improve the efficiency of data acquisition.
202nd, the number of times of failure is crawled according to data and/or data crawls the time of failure, determine the data of failure
The task of crawling re-starts the time that data crawls.
Specifically, time interval is crawled according to the data determining, data crawls the number of times of failure and/or data is climbed
Take the time of failure, the data of failed regeneration crawls task and re-starts the time that data crawls.
In order to further illustrate the method for the embodiment of the present application it is assumed that the number of times that data crawls failure is M, number
It is N according to the time interval crawling, the time that data crawls failure is TF, the data of this failure crawls task weight
Newly carrying out the time that data crawls is TF;
Can be according to the first preset function, realization crawls time interval data according to data and crawls the secondary of failure
Number, the data of failed regeneration crawls the process that task re-starts the time that data crawls, this first default letter
Number TS1Can be for shown in formula [1]:
TS1=N × aM[1]
Can also realize crawling failure according to the time interval data that data crawls according to the second preset function
Time, the data of failed regeneration crawls the process that task re-starts the time that data crawls, and this is second pre-
If function TS2Can be for shown in formula [2]:
TS2=TF+N × a [2]
Furthermore it is also possible to according to the 3rd preset function, according to data, the time interval crawling, data are climbed for realization
The time data taking failure crawls the number of times of failure, and the data of failed regeneration crawls task and re-starts data
The process of the time crawling, the 3rd preset function TS3Can be for shown in formula [3]:
TS3=TF+N × aM[3]
Wherein, a in above-mentioned formula [1]-formula [3] is predetermined coefficient, and a is more than 1.
It should be noted that above-mentioned first preset function, the second preset function and the 3rd preset function are only
Exemplary, in addition to this it is possible to include the preset function that other Functional Qualities are nondecreasing function, this Shen
Specific preset function please be limited embodiment.
By determined by data crawl time interval, data crawls the number of times of failure and/or data crawls mistake
The time lost, the data of failed regeneration crawls task and re-starts the time that data crawls it is ensured that data is climbed
The data taking the more failure of number of times of failure crawls execution time of task and is later than the number of times that data crawls failure
The data of less failure crawls the execution time of task, and the frequency of failure is more, illustrates that this data crawls and appoints
Business successful execution probability lower it is ensured that the probability of successful execution higher priority of task execution it is ensured that
The timely acquisition of data, thus further increasing the reliability of data acquisition, improves data acquisition
Efficiency.In addition, crawl the time of failure by data, the data of failed regeneration crawls task and re-starts number
According to the time crawling it is ensured that recovering of data cannot be obtained due to what the reasons such as the temporary transient paralysis in website led to
During state retrievable to data, the timely acquisition of this data, thus further increase data acquisition can
By property, improve the efficiency of data acquisition.In addition, by the number of times crawling failure according to data, generating and lose
The data losing crawls task and re-starts the time that data crawls, it is to avoid pass through web crawlers journey in the short time
Ordered pair is due to crawling that website the carrying out that cannot obtain data that temporarily reason such as paralysis leads to is repeated, and causes
Internet resources burden and device system resource burden, thus further increasing the reliability of data acquisition,
Improve the efficiency of data acquisition.
203rd, judge to re-start whether the time that data crawls meets what web crawler currently will execute
Time;If meeting, execution step 204, if being unsatisfactory for, execution step 205.
Wherein, the time that web crawler currently will execute is to could be arranged to data to crawl time interval
Integral multiple.
By judge to re-start the time that data crawls less than web crawler currently will execute when
Between, it is to avoid in the short time by web crawler the reason such as temporarily paralysed due to website is led to cannot
What the carrying out of acquisition data was repeated crawls, and the system resource of the Internet resources burden causing and device is born,
Thus further increasing the reliability of data acquisition, improve the efficiency of data acquisition.
204th, the data of setup failed crawls the task that task currently will execute for web crawler.
Specifically, the data of this failure is crawled task to add to the task pool of web crawler.
It is set to, by the data of failure is crawled task, the task that web crawler currently will execute, protect
The data having demonstrate,proved this failure crawls again crawling in time of task, and then ensure that the timely acquisition of data, from
And further increase the reliability of data acquisition, improve the efficiency of data acquisition.
It should be noted that step 203 to step 204 is to realize basis to re-start the time that data crawls,
Data crawls task again to crawl tasks carrying to the data of failure, in addition to the mode of above-mentioned steps, also
This process can be realized by other means, the embodiment of the present application is not limited to specific mode.
The time crawling due to re-starting data is to crawl the number of times of failure according to data and/or data crawls mistake
Time of losing generates, thus ensure that data crawls the data of the more failure of number of times of failure and crawls task
Execution time be later than the data of the less failure of number of times that data crawls failure and crawl execution time of task,
And the frequency of failure is more, illustrate this data crawl the successful execution of task probability lower it is ensured that successfully holding
The higher priority of task execution of probability of row it is ensured that the timely acquisition of data, thus further increasing number
According to the reliability obtaining, improve the efficiency of data acquisition.
205, the data of failure is crawled task and stores to default data base, with pending weight next time
Newly carry out the time judgement that data crawls, and obtain the data of this failure in default data base crawling task pair
The data of the next failure answered crawls task.
By being unsatisfactory for the time that web crawler currently will execute re-starting the time that data crawls
When, the data of failure is crawled task store to default data base, it is to avoid in the short time, pass through network
Crawlers to due to crawling that website the carrying out that cannot obtain data that temporarily reason such as paralysis leads to is repeated,
And the system resource burden of the Internet resources burden causing and device, thus further increasing data acquisition
Reliability, improves the efficiency of data acquisition.
Method by executing the embodiment of the present application, not only avoid and pass through web crawler pair in the short time
Due to crawling that website the carrying out that cannot obtain data that temporarily reason such as paralysis leads to is repeated, and the net causing
The system resource burden of network resource burden and device is it is ensured that this website is being recovered to the retrievable shape of data
During state, the timely acquisition of this data, thus improve the reliability of data acquisition, improve data acquisition
Efficiency.
The embodiment of the present application provides a kind of data capture method, by according to the to be retried weight crawling task
Examination number of times and last time first task time when retrying, generate and will retry the to be retried task that crawls
Second task time is so that can be according to crawling the number of retries of task and last when retrying wait retry
Between the retry time that crawls task to be retried to this be adjusted, thus avoid to due to website temporarily paralysed
The omission that cannot obtain data that the reasons such as paralysis lead to is it is ensured that the reliability of data acquisition.In addition, passing through
According to crawling the number of retries of task and the first task time when last time retries wait retry, generating will
Retry to be retried the second task time crawling task, it is to avoid in the short time, pass through web crawler
To crawling of being repeated due to the carrying out that cannot obtain data that the reasons such as the temporary transient paralysis in website lead to, and cause
The system resource burden of Internet resources burden and device, thus further increasing the reliability of data acquisition,
Improve the efficiency of data acquisition.In addition, pre-conditioned by judging whether the second task time met, if
It is then to arrange and to be retried crawl the task that task currently will execute for web crawler;Otherwise, obtain
Take this to be retried next one crawling task to be retried crawl task, be not provided with crawling task and climb for network
The task that worm program currently will execute, not only further avoids and passes through web crawler in the short time
To crawling of being repeated due to the carrying out that cannot obtain data that the reasons such as the temporary transient paralysis in website lead to, and cause
The system resource burden of Internet resources burden and device, thus further increasing the reliability of data acquisition,
Improve the efficiency of data acquisition it is ensured that due to website temporarily the reason such as paralysis lead to cannot obtain number
According to website temporarily paralyse solution after, the timely acquisition of this information, thus further increase data obtain
The reliability taking, improves the efficiency of data acquisition.
A kind of data acquisition facility that embodiment three provides for the embodiment of the present application, shown in reference Fig. 3, this number
Include according to acquisition device:
Acquisition module 31, the data for obtaining failure crawls task, and wherein, data crawls task and at least wraps
Contain:The number of times data that data crawls failure crawls time of failure;
First processing module 32, crawls the time of failure for crawling the number of times of failure and/or data according to data,
Determine that the data of failure crawls task and re-starts the time that data crawls;
Second processing module 33, for according to re-starting the time that data crawls, crawling to the data of failure
Again data crawls task to tasks carrying.
Described acquisition module, is used for:Data for acquired failure crawls task, obtains data and crawls
The number of times data of failure crawls the time of failure;Described data is crawled the number of times of failure and described data is climbed
The time taking failure is to default data base.
Optionally, first processing module 32 can include:
Determination sub-module, for determining the time interval that data crawls;
Generate submodule, for according to determine the time interval that crawls of data, data crawl failure number of times and/
Or data crawls the time of failure, the data of failed regeneration crawls task and re-starts the time that data crawls.
Optionally, Second processing module 33 specifically for:
Judge to re-start when whether the time that data crawls meet that web crawler currently will execute
Between;
If meeting, the data of setup failed crawls the task that task currently will execute for web crawler.
Optionally, Second processing module 33 is additionally operable to:
The data of failure is crawled task store to default data base, with pending again entering next time
The time that row data crawls judges.
The embodiment of the present application provides a kind of data acquisition facility, and this data acquisition facility is by climbing according to data
Take the number of times of failure and/or data to crawl the time of failure, determine that the data of failure crawls task and re-starts number
According to the time crawling so that the number of times of failure can be crawled according to data and/or data crawls the time pair of failure
The data of this failure crawls task and re-starts the time that data crawls and is adjusted, thus avoid to due to
The website omission that cannot obtain data that temporarily reason such as paralysis leads to it is ensured that the reliability of data acquisition,
Avoid cannot to led to due to reasons such as the temporary transient paralysis in website by web crawler in the short time simultaneously
What the carrying out of acquisition data was repeated crawls, and the system resource of the Internet resources burden causing and device is born,
Thus further increasing the reliability of data acquisition, improve the efficiency of data acquisition.In addition, judging weight
Newly carry out whether the time that data crawls meets the time that web crawler currently will execute;If meeting,
Then the data of setup failed crawls the task that task currently will execute for web crawler, not only further
Avoid in the short time and cannot be obtained to what the reason such as temporarily paralysed due to website was led to by web crawler
What the carrying out fetched data was repeated crawls, and the system resource burden of the Internet resources burden causing and device, from
And further increase the reliability of data acquisition, improve the efficiency of data acquisition it is ensured that due to net
Data that what the reasons such as temporary transient paralysis of standing led to cannot obtain after solution is temporarily paralysed in website, this information
Obtaining in time, thus further increasing the reliability of data acquisition, improve the efficiency of data acquisition.
A kind of data acquisition facility 4 that example IV position the embodiment of the present application provides, shown in reference Fig. 4, this number
Include memorizer 41 and the processor 42 being connected with memorizer 41 according to acquisition device 4, memorizer 41 is used for depositing
Storage batch processing code, processor 42 calls the program code that memorizer 41 is stored to be used for executing following behaviour
Make:
The data obtaining failure crawls task, and wherein, data crawls task and includes at least:Data crawls failure
Number of times data crawl time of failure;
The number of times of failure is crawled according to data and/or data crawls the time of failure, determine that the data of failure crawls
Task re-starts the time that data crawls;
According to re-starting the time that data crawls, data crawls again to crawl tasks carrying to the data of failure
Task.
Optionally, processor 42 calls the program code that memorizer 41 is stored to be used for executing following operation:
The number of times data that data is crawled failure crawls the time of failure to default data base.
Optionally, processor 42 calls the program code that memorizer 41 is stored to be used for executing following operation:
According to prefixed time interval, data crawls the number of times of failure and/or data crawls the time of failure, generates
The data of failure crawls task and re-starts the time that data crawls.
Optionally, processor 42 calls the program code that memorizer 41 is stored to be used for executing following operation:
Judge to re-start when whether the time that data crawls meet that web crawler currently will execute
Between;
If meeting, the data of setup failed crawls the task that task currently will execute for web crawler.
Optionally, processor 42 calls the program code that memorizer 41 is stored to be used for executing following operation:
The data of failure is crawled task store to default data base, with pending again entering next time
The time that row data crawls judges.
The embodiment of the present application provides a kind of data acquisition facility, and this data acquisition facility is by climbing according to data
Take the number of times of failure and/or data to crawl the time of failure, determine that the data of failure crawls task and re-starts number
According to the time crawling so that the number of times of failure can be crawled according to data and/or data crawls the time pair of failure
The data of this failure crawls task and re-starts the time that data crawls and is adjusted, thus avoid to due to
The website omission that cannot obtain data that temporarily reason such as paralysis leads to it is ensured that the reliability of data acquisition,
Avoid cannot to led to due to reasons such as the temporary transient paralysis in website by web crawler in the short time simultaneously
What the carrying out of acquisition data was repeated crawls, and the system resource of the Internet resources burden causing and device is born,
Thus further increasing the reliability of data acquisition, improve the efficiency of data acquisition.In addition, judging weight
Newly carry out whether the time that data crawls meets the time that web crawler currently will execute;If meeting,
Then the data of setup failed crawls the task that task currently will execute for web crawler, not only further
Avoid in the short time and cannot be obtained to what the reason such as temporarily paralysed due to website was led to by web crawler
What the carrying out fetched data was repeated crawls, and the system resource burden of the Internet resources burden causing and device, from
And further increase the reliability of data acquisition, improve the efficiency of data acquisition it is ensured that due to net
Data that what the reasons such as temporary transient paralysis of standing led to cannot obtain after solution is temporarily paralysed in website, this information
Obtaining in time, thus further increasing the reliability of data acquisition, improve the efficiency of data acquisition.
Above-mentioned all optional technical schemes, can adopt the alternative embodiment arbitrarily combining to form the application,
This no longer repeats one by one.
It should be noted that:Above-described embodiment provide device execute data capture method when, only with above-mentioned
The division of each functional module is illustrated, and in practical application, can divide above-mentioned functions as desired
Join and completed by different functional modules, the internal structure of equipment will be divided into different functional modules, with complete
Become all or part of function described above.In addition, the data acquisition facility of above-described embodiment offer and number
Belong to same design according to acquisition methods embodiment, it implements process and refers to embodiment of the method, here no longer
Repeat.
One of ordinary skill in the art will appreciate that all or part of step realizing above-described embodiment can be passed through
Hardware come to complete it is also possible to instructed by program correlation hardware complete, program can be stored in one kind
In computer-readable recording medium, storage medium mentioned above can be read only memory, disk or CD
Deng.
These are only the preferred embodiment of the application, not in order to limit the application, all in spirit herein
Within principle, any modification, equivalent substitution and improvement made etc., should be included in the protection of the application
Within the scope of.
Claims (10)
1. a kind of data capture method is it is characterised in that methods described includes:
The data obtaining failure crawls task, and wherein, described data crawls task and includes at least:Data crawls
The number of times data of failure crawls the time of failure;
The number of times of failure is crawled according to described data and/or data crawls the time of failure, determine described failure
Data crawls task and re-starts the time that data crawls;
Re-start, according to described, the time that data crawls, tasks carrying is crawled again to the data of described failure
Data crawls task.
2. method according to claim 1 it is characterised in that the data of described acquisition failure crawl appoint
Business, including:
Data for acquired failure crawls task, and the number of times data that acquisition data crawls failure crawls
The time of failure;
Described data is crawled the number of times of failure and described data crawls the time of failure to default data base.
3. method according to claim 2 is it is characterised in that described crawl failure according to described data
Number of times and/or data crawl time of failure, determine that the data of described failure crawls task and re-starts data
The time crawling includes:
Determine the time interval that data crawls;
The time interval that crawled according to described data, described data crawls the number of times of failure and/or data crawls mistake
The time lost, the data of the described failure of generation crawls task and re-starts the time that data crawls.
4. method according to claim 3 is it is characterised in that re-start data described in described basis
The time crawling, the data of described failure is crawled with tasks carrying, and data crawls task and includes again:
Re-start described in judgement whether the time that data crawls meets what web crawler currently will execute
Time;
If meeting, the data arranging described failure crawls task and currently will hold for described web crawler
The task of row.
5. method according to claim 4 is it is characterised in that described ought re-start what data crawled
When time is unsatisfactory for the time that web crawler currently will execute, then methods described also includes:
The data of described failure is crawled task store to default data base, with pending weight next time
Newly carry out the time judgement that data crawls.
6. a kind of data acquisition facility is it is characterised in that include:
Acquisition module, the data for obtaining failure crawls task, and wherein, described data crawls task at least
Comprise:The number of times data that data crawls failure crawls time of failure;
First processing module, for crawl the number of times of failure according to described data and/or data crawl failure when
Between, determine that the data of described failure crawls task and re-starts the time that data crawls;
Second processing module, for re-starting, according to described, the time that data crawls, the number to described failure
According to crawling tasks carrying, data crawls task again.
7. device according to claim 6, it is characterised in that described acquisition module, is used for:
Data for acquired failure crawls task, and the number of times data that acquisition data crawls failure crawls
The time of failure;
Described data is crawled the number of times of failure and described data crawls the time of failure to default data base.
8. device according to claim 7 is it is characterised in that described first processing module includes:
Determination sub-module, for determining the time interval that data crawls;
Generate submodule, crawl the secondary of failure for the time interval crawling according to described data, described data
Number and/or data crawl time of failure, and the data generating described failure crawls task and re-starts data and crawls
Time.
9. device according to claim 8 it is characterised in that described Second processing module specifically for:
Re-start described in judgement whether the time that data crawls meets what web crawler currently will execute
Time;
If meeting, the data arranging described failure crawls task and currently will hold for described web crawler
The task of row.
10. device according to claim 9 is it is characterised in that described Second processing module is specifically gone back
For:
Re-start the time that data crawls and be unsatisfactory for the time that web crawler currently will execute when described
When, the data of described failure is crawled task and stores to default data base, with pending weight next time
Newly carry out the time judgement that data crawls.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201510489158.4A CN106445966A (en) | 2015-08-11 | 2015-08-11 | Data acquisition method and apparatus |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201510489158.4A CN106445966A (en) | 2015-08-11 | 2015-08-11 | Data acquisition method and apparatus |
Publications (1)
Publication Number | Publication Date |
---|---|
CN106445966A true CN106445966A (en) | 2017-02-22 |
Family
ID=58092802
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201510489158.4A Pending CN106445966A (en) | 2015-08-11 | 2015-08-11 | Data acquisition method and apparatus |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN106445966A (en) |
Cited By (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN107526833A (en) * | 2017-09-05 | 2017-12-29 | 广东科杰通信息科技有限公司 | A kind of URL management methods, system |
CN107544853A (en) * | 2017-08-23 | 2018-01-05 | 万惠投资管理有限公司 | It is a kind of to interact the method and system retried with bank |
CN112347394A (en) * | 2020-11-30 | 2021-02-09 | 广州至真信息科技有限公司 | Method and device for acquiring webpage information, computer equipment and storage medium |
CN112579858A (en) * | 2019-09-30 | 2021-03-30 | 北京国双科技有限公司 | Data crawling method and device |
Citations (7)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN101187925A (en) * | 2006-11-17 | 2008-05-28 | 北京酷讯科技有限公司 | Automatic optimized crawler grab method |
CN102469132A (en) * | 2010-11-15 | 2012-05-23 | 北大方正集团有限公司 | Method and system for grabbing web pages from servers with different IPs (Internet Protocols) in website |
CN103399933A (en) * | 2013-08-08 | 2013-11-20 | 人民搜索网络股份公司 | Method and system for grabbing webpage contents of network print media |
CN103559083A (en) * | 2013-10-11 | 2014-02-05 | 北京奇虎科技有限公司 | Web crawl task scheduling method and task scheduler |
CN103559219A (en) * | 2013-10-18 | 2014-02-05 | 北京京东尚科信息技术有限公司 | Distributed web crawler capture task dispatching method, dispatching-side device and capture nodes |
US8676783B1 (en) * | 2011-06-28 | 2014-03-18 | Google Inc. | Method and apparatus for managing a backlog of pending URL crawls |
CN103902386A (en) * | 2014-04-11 | 2014-07-02 | 复旦大学 | Multi-thread network crawler processing method based on connection proxy optimal management |
-
2015
- 2015-08-11 CN CN201510489158.4A patent/CN106445966A/en active Pending
Patent Citations (7)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN101187925A (en) * | 2006-11-17 | 2008-05-28 | 北京酷讯科技有限公司 | Automatic optimized crawler grab method |
CN102469132A (en) * | 2010-11-15 | 2012-05-23 | 北大方正集团有限公司 | Method and system for grabbing web pages from servers with different IPs (Internet Protocols) in website |
US8676783B1 (en) * | 2011-06-28 | 2014-03-18 | Google Inc. | Method and apparatus for managing a backlog of pending URL crawls |
CN103399933A (en) * | 2013-08-08 | 2013-11-20 | 人民搜索网络股份公司 | Method and system for grabbing webpage contents of network print media |
CN103559083A (en) * | 2013-10-11 | 2014-02-05 | 北京奇虎科技有限公司 | Web crawl task scheduling method and task scheduler |
CN103559219A (en) * | 2013-10-18 | 2014-02-05 | 北京京东尚科信息技术有限公司 | Distributed web crawler capture task dispatching method, dispatching-side device and capture nodes |
CN103902386A (en) * | 2014-04-11 | 2014-07-02 | 复旦大学 | Multi-thread network crawler processing method based on connection proxy optimal management |
Cited By (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN107544853A (en) * | 2017-08-23 | 2018-01-05 | 万惠投资管理有限公司 | It is a kind of to interact the method and system retried with bank |
CN107526833A (en) * | 2017-09-05 | 2017-12-29 | 广东科杰通信息科技有限公司 | A kind of URL management methods, system |
CN107526833B (en) * | 2017-09-05 | 2020-03-24 | 广东科杰通信息科技有限公司 | URL management method and system |
CN112579858A (en) * | 2019-09-30 | 2021-03-30 | 北京国双科技有限公司 | Data crawling method and device |
CN112347394A (en) * | 2020-11-30 | 2021-02-09 | 广州至真信息科技有限公司 | Method and device for acquiring webpage information, computer equipment and storage medium |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN104348822B (en) | A kind of method, apparatus and server of internet account number authentication | |
CN106445966A (en) | Data acquisition method and apparatus | |
CN107832355B (en) | A kind of method and device that the agency of crawlers obtains | |
CN103902386B (en) | Multi-thread network crawler processing method based on connection proxy optimal management | |
CN110062025A (en) | Method, apparatus, server and the storage medium of data acquisition | |
CN104219316A (en) | Method and device for processing call request in distributed system | |
CN109981653B (en) | Web vulnerability scanning method | |
CN109033195A (en) | The acquisition methods of webpage information obtain equipment and computer-readable medium | |
CN106294648A (en) | A kind of processing method and processing device for page access path | |
CN106534244A (en) | Scheduling method and device for proxy resources | |
CN103914302A (en) | Webpage loading progress monitoring method and webpage loading progress monitoring device | |
CN106407219A (en) | Web page link crawling method and apparatus | |
CN110020043B (en) | Page crawling method, device, storage medium and processor | |
CN109241733A (en) | Crawler Activity recognition method and device based on web access log | |
CN110149419A (en) | The efficient crawler method of IP-based | |
CN103404181A (en) | Method, system, gateway and server for cancelling redirection | |
CN107168850B (en) | URL page monitoring method and device | |
CN110516139A (en) | Crawler system and method | |
CN103678311A (en) | Webpage access method and system based on transfer mode and path capturing server | |
CN107465453A (en) | A kind of ONT Optical Network Terminal and its method of work and communication system | |
CN103399871A (en) | Equipment and method for capturing second-level domain information associated with main domain | |
CN103118033B (en) | A kind of defend user website to be tampered method and device | |
CN104462242B (en) | Webpage capacity of returns statistical method and device | |
CN103716139A (en) | Information push processing method and information push processing device | |
CN103684823A (en) | Weblog recording method, network access path determining method and related devices |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
C06 | Publication | ||
PB01 | Publication | ||
C10 | Entry into substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
RJ01 | Rejection of invention patent application after publication | ||
RJ01 | Rejection of invention patent application after publication |
Application publication date: 20170222 |