CN112905866A

CN112905866A - Historical data tracing and crawling method and terminal without manual participation

Info

Publication number: CN112905866A
Application number: CN202110147690.3A
Authority: CN
Inventors: 刘德建; 林琛
Original assignee: Fujian Tianyi Network Technology Co ltd
Current assignee: Fujian Tianyi Network Technology Co ltd
Priority date: 2019-03-14
Filing date: 2019-03-14
Publication date: 2021-06-04
Anticipated expiration: 2039-03-14
Also published as: CN112905866B; CN112905867B; CN112905867A; CN109992705B; CN109992705A

Abstract

The invention provides a historical data tracing crawling and terminal without manual participation, and the method comprises the following steps: s1: setting a historical data tracing direction and a first threshold corresponding to the amount of the historical data crawled each time; s2: according to the historical data tracing direction and a first threshold value, acquiring a plurality of first URLs (uniform resource locators) corresponding to the historical data to be crawled for multiple times; sequencing the plurality of first URLs to obtain a first sequence; s3: and sequentially crawling data on the webpage corresponding to each first URL in the first sequence at preset time intervals. The invention provides a historical data tracing and crawling method and a terminal without manual participation, manual participation is not needed in the process of tracing and crawling historical data, and the efficiency of crawling the historical data can be improved.

Description

Historical data tracing and crawling method and terminal without manual participation

The application is a divisional application with a parent application named as 'a method and terminal for tracing and crawling historical data' with an application number of 201910191973.0 and an application date of 2019, 3 and 14.

Technical Field

The invention relates to the technical field of data processing, in particular to a historical data tracing and crawling method and a terminal without manual participation.

Background

Historical data is a type of data that is closely related to time, and may not have any correlation in content, but the time at which they are generated is generally linear.

In the process of developing an internet system, the requirement of exchanging with massive historical data is inevitable; for example, in a crawler project, it is sometimes necessary to obtain historical data of a target site in recent years, if a large number of secondary link requests are required after a historical page link is requested, or if a large number of intermediate processing flows are required, it may take a large amount of time, and thus if the system is to be run until a task is finished after being started, it may take several days, several weeks, or even several months; in the process of lasting for a long time, unexpected conditions such as temporary shutdown of a system host, unexpected interruption of a task process and the like are inevitably encountered, and great trouble is brought to the continuity and the integrity of the task; therefore, such tasks are generally required to be executed in a segmented manner, and the segmentation requires that time request parameters of a target page of the task are reconfigured according to a time node of the last progress in a manual intervention manner, so that the task is executed in a linked manner, and the whole process is too cumbersome and inflexible. If the task needs to be executed all year round, the task needs to be manually configured once every day, and the labor cost is greatly consumed.

Disclosure of Invention

The technical problem to be solved by the invention is as follows: the invention provides a historical data tracing and crawling method and a terminal without manual participation, manual participation is not needed in the process of tracing and crawling historical data, and the efficiency of crawling the historical data can be improved.

In order to solve the technical problem, the invention provides a historical data tracing and crawling method without manual participation, which comprises the following steps of:

s1: setting a historical data tracing direction and a first threshold corresponding to the amount of the historical data crawled each time;

s2: according to the historical data tracing direction and a first threshold value, acquiring a plurality of first URLs (uniform resource locators) corresponding to the historical data to be crawled for multiple times; sequencing the plurality of first URLs to obtain a first sequence;

s3: and sequentially crawling data on the webpage corresponding to each first URL in the first sequence at preset time intervals.

The invention provides a historical data tracing and crawling terminal without manual participation, which comprises a memory, a processor and a computer program, wherein the computer program is stored in the memory and can be run on the processor, and the processor executes the computer program to realize the following steps:

The invention has the beneficial effects that:

according to the historical data tracing and crawling method and the terminal without manual participation, in the process of tracing and crawling historical data, a plurality of first URLs corresponding to the historical data to be crawled for multiple times respectively can be obtained and sequenced to obtain a first sequence only according to the historical data tracing direction and a first threshold value, the first sequence can be obtained only by configuring once in the process of tracing and crawling the historical data, then the data on the webpage corresponding to each first URL in the first sequence are crawled in sequence according to preset time, all the historical data to be crawled can be obtained, manual participation is not needed in the process, and the efficiency of tracing and crawling the historical data can be improved.

Drawings

Fig. 1 is a schematic diagram illustrating main steps of a historical data retroactive crawling method without human intervention according to an embodiment of the present invention;

FIG. 2 is a schematic structural diagram of a historical data back-tracking crawling terminal without human intervention according to an embodiment of the present invention;

description of reference numerals:

1. a memory; 2. a processor.

Detailed Description

In order to explain technical contents, objects and effects of the present invention in detail, the following detailed description is given with reference to the accompanying drawings in conjunction with the embodiments.

The most key concept of the invention is as follows: obtaining historical data tracing directions and a first threshold value, thus obtaining first URLs corresponding to the historical data to be crawled for multiple times, sequencing all the first URLs, and crawling data on webpages corresponding to the first URLs in sequence at preset intervals.

Referring to fig. 1, the invention provides a historical data tracing and crawling method without human intervention, comprising the following steps:

According to the historical data tracing and crawling method without manual participation, in the process of tracing and crawling of the historical data, a plurality of first URLs corresponding to the historical data to be crawled for multiple times respectively can be obtained and ranked to obtain a first sequence only according to the historical data tracing direction and a first threshold, the first sequence can be obtained only by configuration once in the process of tracing and crawling the historical data, then the data on the webpage corresponding to each first URL in the first sequence are crawled in sequence according to preset time, all the historical data to be crawled can be obtained, manual participation is not needed in the process, and the efficiency of tracing and crawling of the historical data can be improved.

Further, the S3 specifically includes:

s31: acquiring a first URL sequenced at the forefront in the first sequence, and acquiring a second URL corresponding to data to be crawled; presetting a variable r, wherein the initial value of r is 1;

s32: crawling data on a webpage corresponding to the second URL;

s33: if the data on the webpage corresponding to the second URL are acquired, setting a preset r-th identification value as a preset first value, and storing the r-th identification value and the second URL in a cache, wherein the initial value of each identification value is a preset second value;

s34: let r be r + 1;

s35: acquiring the maximum r value in the cache at a preset third time to obtain a third value; the preset third time is the preset fourth time plus the preset time; the preset fourth time is a time point corresponding to data on a webpage corresponding to the second URL;

s36: adding one to the third value to obtain a fourth value;

s37: according to the fourth value, acquiring a first URL corresponding to the fourth value sequenced in the first sequence to obtain a third URL, and updating the second URL to the third URL;

s38: and repeatedly executing the steps S32-S37 until the crawling data ending instruction is received or all historical data are crawled.

From the above description, it can be known that, by the above method, the historical data to be crawled each time can be accurately obtained, and the maximum r value can be obtained from the cache first when the historical data is crawled each time, so that the URL corresponding to the next time to be crawled is determined, and the problems that when a task is interrupted unexpectedly, the breakpoint condition needs to be checked manually, adjustment is performed in a targeted manner, and the historical data to be crawled is reconfigured can be solved.

Preferably, the cache is a redis cache database, and when interruption occurs in the task execution process, data in the cache is not lost, so that the stability of data crawling can be improved.

Further, the sequencing the plurality of first URLs to obtain a first sequence specifically includes:

and sequencing all the first URLs according to the historical data tracing direction and the time of the historical data corresponding to each first URL to obtain a first sequence.

From the above description, it can be known that, through the above method, each first URL can be sorted quickly and accurately.

Further, the S1 specifically includes:

acquiring task starting time corresponding to execution tracing historical data to obtain first time;

acquiring a time starting point value of historical data to be traced to obtain second time;

obtaining the time direction of tracing the historical data to obtain the historical data tracing direction;

and obtaining the number of days for continuously tracing the historical data each time, namely the number of days is the first threshold value.

Further, the obtaining, according to the historical data tracing direction and the first threshold, a plurality of first URLs corresponding to the historical data to be crawled for multiple times respectively specifically includes:

acquiring a plurality of first URLs corresponding to historical data to be crawled for multiple times according to second time, historical data tracing directions and a first threshold;

the first URL comprises a plurality of first sub-URLs, and the number of the first sub-URLs is equal to the first threshold.

According to the description, the URL corresponding to the historical data to be crawled each time can be accurately configured through the method, manual intervention is not needed in the execution process, and the efficiency of tracing and crawling the historical data can be improved; meanwhile, each first URL includes a plurality of first sub-URLs, for example, the number of days of the history data traced each time is 5 days, and the history data traced each day corresponds to one sub-URL, that is, the number of sub-URLs traced each time is 5, which can further improve the efficiency of the system in performing the history data tracing crawling.

Further, the S32 specifically includes:

obtaining a plurality of second sub-URLs according to the second URL;

sequentially crawling data on the webpage corresponding to each second sub URL according to the historical data tracing direction and the time of the historical data corresponding to each second sub URL;

the S33 specifically includes:

when the data on the webpage corresponding to a second sub URL is acquired, storing the second sub URL in a cache;

and judging whether the crawling of the data on the webpage corresponding to all the second sub-URLs is finished, if so, setting a preset r-th identification value as a preset first value, storing the r-th identification value in a cache, wherein the initial value of r is 1, and the initial value of each identification value is a preset second value.

Further, before the historical data is crawled each time, whether the historical data is crawled last time is judged to have an interruption condition;

if yes, acquiring a first URL corresponding to the last crawling history data to obtain a fourth URL;

obtaining a plurality of fourth sub-URLs according to the fourth URL;

according to all the fourth sub-URLs, acquiring fourth sub-URLs which are not stored in a cache to obtain more than one fifth sub-URLs;

obtaining a fifth URL according to more than one fifth sub-URL, and updating the second URL into the fifth URL;

step S38 is executed.

As can be seen from the above description, after the data on the web page corresponding to each sub URL is obtained, the data is stored in the cache, so that the problem that when the data on the web page corresponding to all the sub URLs is not crawled, the data is interrupted, and when the data is executed again, the data on the web page corresponding to the executed sub URL needs to be obtained again, and the efficiency is low can be avoided.

Referring to fig. 2, the present invention provides a historical data back-tracking crawling terminal without human intervention, including a memory 1, a processor 2 and a computer program stored in the memory 1 and operable on the processor 2, wherein the processor 2 implements the following steps when executing the computer program:

According to the historical data tracing and crawling terminal without manual participation, in the process of tracing and crawling of the historical data, a plurality of first URLs corresponding to the historical data to be crawled for multiple times can be obtained and ranked to obtain a first sequence only according to the tracing direction of the historical data and a first threshold, the first sequence can be obtained only by configuration once in the process of tracing and crawling the historical data, then the data on the webpage corresponding to each first URL in the first sequence are crawled in sequence according to preset time, all the historical data to be crawled can be obtained, manual participation in the process is not needed, and the efficiency of tracing and crawling of the historical data can be improved.

Further, the historical data tracing crawling terminal without human participation is described in the following, where the S3 specifically is:

s32: crawling data on a webpage corresponding to the second URL;

s34: let r be r + 1;

s36: adding one to the third value to obtain a fourth value;

According to the description, the terminal can accurately acquire the historical data to be crawled every time, and can acquire the maximum r value from the cache firstly when the historical data is crawled every time, so that the URL corresponding to the next time needs to be crawled can be determined, the problem that the breakpoint condition needs to be manually checked when the task is interrupted unexpectedly can be solved, the breakpoint condition is adjusted pertinently, and the historical data is configured for tracing and crawling again.

Further, the historical data back-tracking crawling terminal without manual participation sorts the plurality of first URLs to obtain a first sequence, specifically:

As can be seen from the above description, each first URL can be quickly and accurately sorted by the terminal.

Further, the historical data tracing crawling terminal without human participation is described in the following, where the S1 specifically is:

According to the description, the URL corresponding to the historical data to be crawled each time can be accurately configured through the terminal, manual intervention is not needed in the execution process, and the efficiency of tracing and crawling the historical data without manual participation can be improved; meanwhile, each first URL includes a plurality of first sub-URLs, for example, the number of days of the history data traced each time is 5 days, and the history data traced each day corresponds to one sub-URL, that is, the number of sub-URLs traced each time is 5, which can further improve the efficiency of the system in performing the history data tracing crawling.

Further, the historical data tracing crawling terminal without human participation is described in the following, where the S32 specifically is:

obtaining a plurality of second sub-URLs according to the second URL;

the S33 specifically includes:

Furthermore, the historical data tracing crawling terminal without manual participation judges whether the historical data crawled last time has interruption or not before crawling the historical data each time;

obtaining a plurality of fourth sub-URLs according to the fourth URL;

step S38 is executed.

Referring to fig. 1, a first embodiment of the present invention is:

the invention provides a historical data tracing and crawling method without manual participation, which comprises the following steps:

wherein, the S1 specifically is:

In a specific embodiment, there are two cases in the historical data tracing direction, namely, positive direction or negative direction; if the current date is positive, the historical data are sequentially acquired backwards along the second time, for example, the second time is 2016, 3 and 11 days, and the historical data acquired later is 2016, 3 and 11 days, namely the current date (or the time specified by the user); if the direction is negative, the historical data are acquired along the second time in sequence, for example, the second time is 2016 (3/11/3), and the historical data acquired later are user-specified time (the user-specified time is earlier than 2016 (3/11/3)) to 2016 (3/11/3/2016).

In a specific embodiment, the first time is a time when the task starts to be executed, and the time may be a current time or a future time.

In a specific embodiment, the number of days for which the historical data is continuously traced back each time is obtained, which is the first threshold, for example, if the historical data is set by the user to be traced back for five days each time, the first threshold is 5.

the URL is an address corresponding to the history data.

Wherein, the S2 specifically is:

the first URL comprises a plurality of first sub-URLs, and the number of the first sub-URLs is equal to the first threshold;

In the sorting process, all the first URLs are sorted according to a time sequence from far to near (when the historical data tracing direction is positive), or sorted according to a time sequence from near to far (when the historical data tracing direction is negative).

S3: sequentially crawling data on a webpage corresponding to each first URL in the first sequence at preset time intervals;

wherein, the S3 specifically is:

s32: crawling data on a webpage corresponding to the second URL;

wherein, the S32 specifically is:

obtaining a plurality of second sub-URLs according to the second URL;

wherein, the S33 specifically is:

Preferably, the preset first value is 1, and the preset second value is 0; and when the identification value is 1, the data on all the web pages corresponding to the second sub-URLs are crawled.

S34: let r be r + 1;

the preset third time is a time point, the preset time is a time period, for example, one day, and the preset fourth time is a time point.

S36: adding one to the third value to obtain a fourth value;

wherein, the S37 specifically is:

according to the fourth value, acquiring a first URL corresponding to the fourth value sequenced in the first sequence to obtain a third URL;

judging whether the historical data obtained last time has interruption or not;

if so, obtaining a fourth URL according to the third URL, wherein the third URL is the same as the fourth URL; obtaining a plurality of fourth sub-URLs according to the fourth URL; according to all the fourth sub-URLs, acquiring fourth sub-URLs which are not stored in a cache to obtain more than one fifth sub-URLs; obtaining a fifth URL according to more than one fifth sub-URL, and updating the second URL into the fifth URL; step S38 is executed;

if not, the second URL is updated to the third URL, and step S38 is executed.

Referring to fig. 2, the second embodiment of the present invention is:

the invention provides a historical data tracing and crawling terminal without manual participation, which comprises a memory 1, a processor 2 and a computer program which is stored in the memory 1 and can be operated on the processor 2, wherein the processor executes the computer program to realize the following steps:

wherein, the S1 specifically is:

the URL is an address corresponding to the history data.

Wherein, the S2 specifically is:

wherein, the S3 specifically is:

s32: crawling data on a webpage corresponding to the second URL;

wherein, the S32 specifically is:

obtaining a plurality of second sub-URLs according to the second URL;

wherein, the S33 specifically is:

S34: let r be r + 1;

S36: adding one to the third value to obtain a fourth value;

wherein, the S37 specifically is:

judging whether the historical data obtained last time has interruption or not;

if not, the second URL is updated to the third URL, and step S38 is executed.

The third embodiment of the invention is as follows:

1. 5 configuration items are created: a task execution reference time task _ begin _ time (first time) for tracking the start of the work execution time, i.e. the execution time of the first segment of task; the historical data tracing time initial value data _ begin _ time (second time) is used as the time starting point of tracing the historical data in the first segment of task, and the historical data of the subsequent segment of task is used as the reference point to recalculate the time starting point; a retroactive direction trace _ backward (historical data retroactive direction) for controlling whether the retroactive date direction is forward or backward; the time unit amount time _ num (first threshold) of each time of tracing controls the data acquisition amount of each segmented task; the task continuation threshold, follow _ threshold, by which it is determined whether the remaining part of the interrupted task will take a day independently to execute or can be spliced into the next task segment to execute when the interrupt is restarted.

2. Fields are created in the cache (e.g., redis) to store task execution state parameters: a task completion flag bit r _ finish _ flag (the r-th identification value) used for judging whether the last task is completed, wherein 0 is incomplete and 1 is completed; a task execution date r _ task _ time, which registers the execution date of the latest task; the finished date amount r _ finish _ num of the task segment registers that the current task segment has finished obtaining historical data for several days, for example, the configuration value time _ num is 5, which indicates that 5 days of historical data are to be obtained every day, if r _ finish _ num is 3 during the execution process, which indicates that 3 days of historical data are obtained today, only when r _ finish _ num is time _ num, r _ finish _ flag is set to 1, which indicates that the task today is completely finished; the execution date correction value r _ offset _ num is set to 0 as an initial value, and is adjusted for breakpoint resumption in response to occurrence of task abnormal interruption.

In addition, an accessed url set r _ finished _ urls is created, and url which is executed historically is registered by using the characteristic of the uniqueness of the redis set type data, so that url deduplication of the whole task cycle is realized.

3. And analyzing the time value characteristics of url request fields of the historical data pages of the target sites, and configuring 5 configuration items according to requirements. For example, assume that the time field of the target site url is in the date format of YYYY-MM-DD, and the basic unit of time is days; assuming that the task plan is executed from 1/2019, the historical data site is traced back to data before 01/2018 and historical data is traced back for 5 days each day, then the task _ begin _ time can be configured as 2019-01-01, the data _ begin _ time can be configured as 2018-01-01, and the time _ num can be configured as 5; since the date is the previous data, the date is traced in the reverse direction (forward direction backward and reverse direction forward), so that trace _ backward is-1, otherwise, it is 1; given that it is required that if an interruption occurs and the interruption day completes more than 3 days of historical data, the remaining 2 days of data can be retrieved along with the data for the next task segment, then follow _ threshold is set to 3.

4. Starting a task, firstly detecting r _ finish _ flag of redis by the task, and if the value is 1, indicating that the last task is successfully finished, entering a daily mode; otherwise, if the r _ finish _ flag is 0, the last task execution is abnormal and is not completed, and the breakpoint recovery mode needs to be entered.

5. The task firstly enters into url generation phase, and the difference between the daily mode and the breakpoint recovery mode is mainly in the date list generation process of the phase.

In the daily mode, the current time now and the (last) task completion date r _ task _ time in the redis are compared, if now-r _ task _ time is more than 1 day, it indicates that there are several days in the middle of which no task is executed, in order to compensate the influence of blank period on the positioning of the target data date, the r _ offset _ time of the redis (wherein the initial value of r _ offset _ time is 0;) needs to be adjusted, that is, the r _ offset _ time is adjusted

r_offset_time＝r_offset_time+(now-r_task_time–1)；

For example, when a task first has a 1-day blank period, r _ offset _ time is 1, which indicates that the date needs to be corrected by shifting by 1 day, and when a 1-day blank period occurs again, r _ offset _ time is 2, which indicates that the date needs to be shifted by two days. Then, immediately setting r _ task _ time of redis as the current date, and then calculating the actual target data date offset value offset, i.e. the actual target data date offset value offset, according to now, task _ begin _ time and r _ offset _ time

offset＝(now-task_begin_time)+r_offset_time；

The date range of the history data page that needs to be requested is then:

data_begin_time+trace_backward*(offset*time_num+1)；

to:

data_begin_time+trace_backward*(offset*time_num+time_num)；

in the meantime. After the date is generated, r _ finish _ flag is set to 0.

Under the breakpoint recovery, whether the breakpoint date is now is judged first. If yes, the execution is continued according to the daily mode, url deduplication operation is carried out at the next stage of the task, url completed before the breakpoint is filtered, and therefore repeated tasks are avoided. If the breakpoint date is not today, firstly, adjusting r _ offset _ time of redis, calculating an actual target data date offset value offset, and setting r _ task _ time of redis as the current date in the same way as the daily mode; then, the task amount r _ finish _ num that has been completed by the last task (when r _ finish _ num is less than 0, an operation of r _ finish _ num being 0-r _ finish _ num, which will be described in case 2 later) is compared with the threshold value follow _ threshold, and different task engagement strategies are adopted for the two cases:

case 1: r _ finish _ num < focus _ threshold, which indicates that the task completion degree is not high and more tasks are accumulated before the last task is interrupted, an independent working day is needed to execute the remaining tasks, and the joined task amount is the remaining task of the breakpoint day. Therefore, the subsequent operation is the same as the daily mode.

Case 2: and r _ finish _ num > is closed _ threshold, which indicates that the task completion degree is high and the number of the remaining tasks is small before the last task is interrupted, and the next task is allowed to be accessed to be performed together. Since the total task interval spans 2 days, r _ finish _ num needs to be changed into 0-r _ finish _ num and is converted into a negative number storage, so that the decision interval of r _ finish _ num, which is time _ num, is expanded into an actual interval. If the interruption is performed again after that, when it is accessed that r _ finish _ num is a negative number at the next startup, the task completion amount of the reacquired breakpoint day may be calculated again by taking r _ finish _ num as 0-r _ finish _ num.

Then, a task date list contained in 2 working days is generated, and the date range of the history data page needing to be requested is as follows:

data_begin_time+trace_backward*(offset*time_num+1)；

to:

data_begin_time+trace_backward*(offset*time_num+2*time_num)；

in the meantime. Wherein, the breakpoint date contains the completed part of the date, which is filtered out in the subsequent url deduplication.

And finally, splicing the two modes according to the acquired date parameters to generate a request url to prepare for crawling data.

6. After url is generated, Redis is entered for duplicate removal screening, whether url is valid is judged according to whether r _ finished _ urls are registered in the set, invalid links are abandoned, valid urls are sequentially arranged into a queue, and a request is waited.

7. The following is a data request acquisition phase, and once 1 entry url task (namely 1 calendar history data volume) is completed, the registration and judgment of the task state are performed: first, registering the completed url in r _ finished _ urls, and adding 1 to r _ finish _ num; secondly, whether r _ finish _ num is equal to time _ num or not is judged, the task is completely finished, a status flag bit is set (r _ finish _ flag is set to be 1, and r _ finish _ num is set to be 0), the task is finished, if the status flag bit is not equal to the status flag bit, whether r _ finish _ num is 0 or not is further judged, if the status flag bit is equal to 0, the task is the last task of an interruption day in an interruption mode and is ready to enter the next batch of tasks is judged, and in the special case of the two-day task quantity, the offset value r _ offset _ time is determined according to the first day, so when the task enters the second batch, the offset value needs to be further corrected by changing r _ offset _ time to r _ offset _ time-1. And (5) performing subsequent operation until r _ finish _ num is equal to time _ num, and completing the task.

8. All the acquired data are stored in a database (such as mysql) after being cleaned and sorted in real time.

9. The system is started at a fixed time point every day through timing configuration such as a task plan, so that the automatic execution of the tasks is realized.

10. If a certain piece of node data in the history data is required, the above configuration can be flexibly implemented by temporarily correcting the above configuration.

For the description of the parameters in this example, please refer to tables 1 and 2:

table 1: description of configuration parameters

TABLE 2 task status parameter description

In summary, according to the historical data tracing and crawling method and the terminal without manual participation provided by the invention, in the process of tracing and crawling the historical data, only according to the historical data tracing direction and the first threshold, the plurality of first URLs respectively corresponding to the historical data to be crawled for multiple times can be obtained, and are sorted to obtain the first sequence. Furthermore, by the method, the historical data to be crawled each time can be accurately acquired, and the maximum r value can be acquired from the cache firstly when the historical data is crawled each time, so that the URL corresponding to the next time needs to be crawled can be determined, and the problems that when a task is interrupted unexpectedly, the breakpoint condition needs to be checked manually, adjustment is performed pertinently, and the historical data to be crawled is reconfigured can be solved. Furthermore, by the method, the URL corresponding to the historical data to be crawled each time can be accurately configured, manual intervention is not needed in the execution process, and the efficiency of tracing and crawling the historical data can be improved; meanwhile, each first URL includes a plurality of first sub-URLs, for example, the number of days of the history data traced each time is 5 days, and the history data traced each day corresponds to one sub-URL, that is, the number of sub-URLs traced each time is 5, which can further improve the efficiency of the system in performing the history data tracing crawling. Furthermore, after the data on the webpage corresponding to each sub-URL is acquired, the data is stored in the cache, so that the problem that when the data of the webpage corresponding to all the sub-URLs is not crawled, the interruption occurs, and when the data is executed again, the data on the webpage corresponding to the executed sub-URL needs to be acquired again, so that the efficiency is low can be solved.

The above description is only an embodiment of the present invention, and not intended to limit the scope of the present invention, and all equivalent changes made by using the contents of the present specification and the drawings, or applied directly or indirectly to other related technical fields, are included in the scope of the present invention.

Claims

1. A historical data tracing and crawling method without manual participation is characterized by comprising the following steps:

in the sorting process, if the historical data tracing direction is the forward direction, sorting all the first URLs according to a time sequence from far to near to obtain a first sequence;

if the historical data tracing direction is negative, sequencing all the first URLs according to the time sequence from near to far to obtain a first sequence;

the S3 specifically includes:

s31: acquiring a first URL sequenced at the forefront in the first sequence, and acquiring a second URL corresponding to data to be crawled; presetting a variable r, wherein the initial value of the variable r is 1;

s32: crawling data on a webpage corresponding to the second URL;

s33: if the data on the webpage corresponding to the second URL are acquired completely, setting a preset r-th identification value as a preset first value, and storing the r-th identification value and the second URL in a cache, wherein the initial value of each identification value is a preset second value;

s34: let r be r + 1;

s36: adding one to the third value to obtain a fourth value;

2. The historical data retroactive crawling method without human intervention according to claim 1, wherein the S1 specifically is:

3. The historical data tracing and crawling method without human intervention according to claim 2, wherein the obtaining of the plurality of first URLs respectively corresponding to the historical data to be crawled for a plurality of times according to the historical data tracing direction and the first threshold specifically comprises:

4. The historical data retroactive crawling method without human intervention according to claim 3, wherein the step S32 specifically comprises:

obtaining a plurality of second sub-URLs according to the second URL;

the S33 specifically includes:

5. The historical data tracing and crawling method without human intervention according to claim 4, wherein before each time of crawling of the historical data, whether the historical data is interrupted or not is judged;

obtaining a plurality of fourth sub-URLs according to the fourth URL;

step S38 is executed.

6. A historical data tracing and crawling terminal without human participation comprises a memory, a processor and a computer program which is stored in the memory and can run on the processor, and is characterized in that the processor executes the computer program to realize the following steps:

the S3 specifically includes:

s32: crawling data on a webpage corresponding to the second URL;

s34: let r be r + 1;

s36: adding one to the third value to obtain a fourth value;