CN112905866A - Historical data tracing and crawling method and terminal without manual participation - Google Patents

Historical data tracing and crawling method and terminal without manual participation Download PDF

Info

Publication number
CN112905866A
CN112905866A CN202110147690.3A CN202110147690A CN112905866A CN 112905866 A CN112905866 A CN 112905866A CN 202110147690 A CN202110147690 A CN 202110147690A CN 112905866 A CN112905866 A CN 112905866A
Authority
CN
China
Prior art keywords
url
historical data
time
value
data
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN202110147690.3A
Other languages
Chinese (zh)
Other versions
CN112905866B (en
Inventor
刘德建
林琛
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Fujian Tianyi Network Technology Co ltd
Original Assignee
Fujian Tianyi Network Technology Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Fujian Tianyi Network Technology Co ltd filed Critical Fujian Tianyi Network Technology Co ltd
Priority to CN202110147690.3A priority Critical patent/CN112905866B/en
Publication of CN112905866A publication Critical patent/CN112905866A/en
Application granted granted Critical
Publication of CN112905866B publication Critical patent/CN112905866B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/95Retrieval from the web
    • G06F16/951Indexing; Web crawling techniques
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/95Retrieval from the web
    • G06F16/955Retrieval from the web using information identifiers, e.g. uniform resource locators [URL]
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02DCLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
    • Y02D10/00Energy efficient computing, e.g. low power processors, power management or thermal management

Abstract

The invention provides a historical data tracing crawling and terminal without manual participation, and the method comprises the following steps: s1: setting a historical data tracing direction and a first threshold corresponding to the amount of the historical data crawled each time; s2: according to the historical data tracing direction and a first threshold value, acquiring a plurality of first URLs (uniform resource locators) corresponding to the historical data to be crawled for multiple times; sequencing the plurality of first URLs to obtain a first sequence; s3: and sequentially crawling data on the webpage corresponding to each first URL in the first sequence at preset time intervals. The invention provides a historical data tracing and crawling method and a terminal without manual participation, manual participation is not needed in the process of tracing and crawling historical data, and the efficiency of crawling the historical data can be improved.

Description

Historical data tracing and crawling method and terminal without manual participation
The application is a divisional application with a parent application named as 'a method and terminal for tracing and crawling historical data' with an application number of 201910191973.0 and an application date of 2019, 3 and 14.
Technical Field
The invention relates to the technical field of data processing, in particular to a historical data tracing and crawling method and a terminal without manual participation.
Background
Historical data is a type of data that is closely related to time, and may not have any correlation in content, but the time at which they are generated is generally linear.
In the process of developing an internet system, the requirement of exchanging with massive historical data is inevitable; for example, in a crawler project, it is sometimes necessary to obtain historical data of a target site in recent years, if a large number of secondary link requests are required after a historical page link is requested, or if a large number of intermediate processing flows are required, it may take a large amount of time, and thus if the system is to be run until a task is finished after being started, it may take several days, several weeks, or even several months; in the process of lasting for a long time, unexpected conditions such as temporary shutdown of a system host, unexpected interruption of a task process and the like are inevitably encountered, and great trouble is brought to the continuity and the integrity of the task; therefore, such tasks are generally required to be executed in a segmented manner, and the segmentation requires that time request parameters of a target page of the task are reconfigured according to a time node of the last progress in a manual intervention manner, so that the task is executed in a linked manner, and the whole process is too cumbersome and inflexible. If the task needs to be executed all year round, the task needs to be manually configured once every day, and the labor cost is greatly consumed.
Disclosure of Invention
The technical problem to be solved by the invention is as follows: the invention provides a historical data tracing and crawling method and a terminal without manual participation, manual participation is not needed in the process of tracing and crawling historical data, and the efficiency of crawling the historical data can be improved.
In order to solve the technical problem, the invention provides a historical data tracing and crawling method without manual participation, which comprises the following steps of:
s1: setting a historical data tracing direction and a first threshold corresponding to the amount of the historical data crawled each time;
s2: according to the historical data tracing direction and a first threshold value, acquiring a plurality of first URLs (uniform resource locators) corresponding to the historical data to be crawled for multiple times; sequencing the plurality of first URLs to obtain a first sequence;
s3: and sequentially crawling data on the webpage corresponding to each first URL in the first sequence at preset time intervals.
The invention provides a historical data tracing and crawling terminal without manual participation, which comprises a memory, a processor and a computer program, wherein the computer program is stored in the memory and can be run on the processor, and the processor executes the computer program to realize the following steps:
s1: setting a historical data tracing direction and a first threshold corresponding to the amount of the historical data crawled each time;
s2: according to the historical data tracing direction and a first threshold value, acquiring a plurality of first URLs (uniform resource locators) corresponding to the historical data to be crawled for multiple times; sequencing the plurality of first URLs to obtain a first sequence;
s3: and sequentially crawling data on the webpage corresponding to each first URL in the first sequence at preset time intervals.
The invention has the beneficial effects that:
according to the historical data tracing and crawling method and the terminal without manual participation, in the process of tracing and crawling historical data, a plurality of first URLs corresponding to the historical data to be crawled for multiple times respectively can be obtained and sequenced to obtain a first sequence only according to the historical data tracing direction and a first threshold value, the first sequence can be obtained only by configuring once in the process of tracing and crawling the historical data, then the data on the webpage corresponding to each first URL in the first sequence are crawled in sequence according to preset time, all the historical data to be crawled can be obtained, manual participation is not needed in the process, and the efficiency of tracing and crawling the historical data can be improved.
Drawings
Fig. 1 is a schematic diagram illustrating main steps of a historical data retroactive crawling method without human intervention according to an embodiment of the present invention;
FIG. 2 is a schematic structural diagram of a historical data back-tracking crawling terminal without human intervention according to an embodiment of the present invention;
description of reference numerals:
1. a memory; 2. a processor.
Detailed Description
In order to explain technical contents, objects and effects of the present invention in detail, the following detailed description is given with reference to the accompanying drawings in conjunction with the embodiments.
The most key concept of the invention is as follows: obtaining historical data tracing directions and a first threshold value, thus obtaining first URLs corresponding to the historical data to be crawled for multiple times, sequencing all the first URLs, and crawling data on webpages corresponding to the first URLs in sequence at preset intervals.
Referring to fig. 1, the invention provides a historical data tracing and crawling method without human intervention, comprising the following steps:
s1: setting a historical data tracing direction and a first threshold corresponding to the amount of the historical data crawled each time;
s2: according to the historical data tracing direction and a first threshold value, acquiring a plurality of first URLs (uniform resource locators) corresponding to the historical data to be crawled for multiple times; sequencing the plurality of first URLs to obtain a first sequence;
s3: and sequentially crawling data on the webpage corresponding to each first URL in the first sequence at preset time intervals.
According to the historical data tracing and crawling method without manual participation, in the process of tracing and crawling of the historical data, a plurality of first URLs corresponding to the historical data to be crawled for multiple times respectively can be obtained and ranked to obtain a first sequence only according to the historical data tracing direction and a first threshold, the first sequence can be obtained only by configuration once in the process of tracing and crawling the historical data, then the data on the webpage corresponding to each first URL in the first sequence are crawled in sequence according to preset time, all the historical data to be crawled can be obtained, manual participation is not needed in the process, and the efficiency of tracing and crawling of the historical data can be improved.
Further, the S3 specifically includes:
s31: acquiring a first URL sequenced at the forefront in the first sequence, and acquiring a second URL corresponding to data to be crawled; presetting a variable r, wherein the initial value of r is 1;
s32: crawling data on a webpage corresponding to the second URL;
s33: if the data on the webpage corresponding to the second URL are acquired, setting a preset r-th identification value as a preset first value, and storing the r-th identification value and the second URL in a cache, wherein the initial value of each identification value is a preset second value;
s34: let r be r + 1;
s35: acquiring the maximum r value in the cache at a preset third time to obtain a third value; the preset third time is the preset fourth time plus the preset time; the preset fourth time is a time point corresponding to data on a webpage corresponding to the second URL;
s36: adding one to the third value to obtain a fourth value;
s37: according to the fourth value, acquiring a first URL corresponding to the fourth value sequenced in the first sequence to obtain a third URL, and updating the second URL to the third URL;
s38: and repeatedly executing the steps S32-S37 until the crawling data ending instruction is received or all historical data are crawled.
From the above description, it can be known that, by the above method, the historical data to be crawled each time can be accurately obtained, and the maximum r value can be obtained from the cache first when the historical data is crawled each time, so that the URL corresponding to the next time to be crawled is determined, and the problems that when a task is interrupted unexpectedly, the breakpoint condition needs to be checked manually, adjustment is performed in a targeted manner, and the historical data to be crawled is reconfigured can be solved.
Preferably, the cache is a redis cache database, and when interruption occurs in the task execution process, data in the cache is not lost, so that the stability of data crawling can be improved.
Further, the sequencing the plurality of first URLs to obtain a first sequence specifically includes:
and sequencing all the first URLs according to the historical data tracing direction and the time of the historical data corresponding to each first URL to obtain a first sequence.
From the above description, it can be known that, through the above method, each first URL can be sorted quickly and accurately.
Further, the S1 specifically includes:
acquiring task starting time corresponding to execution tracing historical data to obtain first time;
acquiring a time starting point value of historical data to be traced to obtain second time;
obtaining the time direction of tracing the historical data to obtain the historical data tracing direction;
and obtaining the number of days for continuously tracing the historical data each time, namely the number of days is the first threshold value.
Further, the obtaining, according to the historical data tracing direction and the first threshold, a plurality of first URLs corresponding to the historical data to be crawled for multiple times respectively specifically includes:
acquiring a plurality of first URLs corresponding to historical data to be crawled for multiple times according to second time, historical data tracing directions and a first threshold;
the first URL comprises a plurality of first sub-URLs, and the number of the first sub-URLs is equal to the first threshold.
According to the description, the URL corresponding to the historical data to be crawled each time can be accurately configured through the method, manual intervention is not needed in the execution process, and the efficiency of tracing and crawling the historical data can be improved; meanwhile, each first URL includes a plurality of first sub-URLs, for example, the number of days of the history data traced each time is 5 days, and the history data traced each day corresponds to one sub-URL, that is, the number of sub-URLs traced each time is 5, which can further improve the efficiency of the system in performing the history data tracing crawling.
Further, the S32 specifically includes:
obtaining a plurality of second sub-URLs according to the second URL;
sequentially crawling data on the webpage corresponding to each second sub URL according to the historical data tracing direction and the time of the historical data corresponding to each second sub URL;
the S33 specifically includes:
when the data on the webpage corresponding to a second sub URL is acquired, storing the second sub URL in a cache;
and judging whether the crawling of the data on the webpage corresponding to all the second sub-URLs is finished, if so, setting a preset r-th identification value as a preset first value, storing the r-th identification value in a cache, wherein the initial value of r is 1, and the initial value of each identification value is a preset second value.
Further, before the historical data is crawled each time, whether the historical data is crawled last time is judged to have an interruption condition;
if yes, acquiring a first URL corresponding to the last crawling history data to obtain a fourth URL;
obtaining a plurality of fourth sub-URLs according to the fourth URL;
according to all the fourth sub-URLs, acquiring fourth sub-URLs which are not stored in a cache to obtain more than one fifth sub-URLs;
obtaining a fifth URL according to more than one fifth sub-URL, and updating the second URL into the fifth URL;
step S38 is executed.
As can be seen from the above description, after the data on the web page corresponding to each sub URL is obtained, the data is stored in the cache, so that the problem that when the data on the web page corresponding to all the sub URLs is not crawled, the data is interrupted, and when the data is executed again, the data on the web page corresponding to the executed sub URL needs to be obtained again, and the efficiency is low can be avoided.
Referring to fig. 2, the present invention provides a historical data back-tracking crawling terminal without human intervention, including a memory 1, a processor 2 and a computer program stored in the memory 1 and operable on the processor 2, wherein the processor 2 implements the following steps when executing the computer program:
s1: setting a historical data tracing direction and a first threshold corresponding to the amount of the historical data crawled each time;
s2: according to the historical data tracing direction and a first threshold value, acquiring a plurality of first URLs (uniform resource locators) corresponding to the historical data to be crawled for multiple times; sequencing the plurality of first URLs to obtain a first sequence;
s3: and sequentially crawling data on the webpage corresponding to each first URL in the first sequence at preset time intervals.
According to the historical data tracing and crawling terminal without manual participation, in the process of tracing and crawling of the historical data, a plurality of first URLs corresponding to the historical data to be crawled for multiple times can be obtained and ranked to obtain a first sequence only according to the tracing direction of the historical data and a first threshold, the first sequence can be obtained only by configuration once in the process of tracing and crawling the historical data, then the data on the webpage corresponding to each first URL in the first sequence are crawled in sequence according to preset time, all the historical data to be crawled can be obtained, manual participation in the process is not needed, and the efficiency of tracing and crawling of the historical data can be improved.
Further, the historical data tracing crawling terminal without human participation is described in the following, where the S3 specifically is:
s31: acquiring a first URL sequenced at the forefront in the first sequence, and acquiring a second URL corresponding to data to be crawled; presetting a variable r, wherein the initial value of r is 1;
s32: crawling data on a webpage corresponding to the second URL;
s33: if the data on the webpage corresponding to the second URL are acquired, setting a preset r-th identification value as a preset first value, and storing the r-th identification value and the second URL in a cache, wherein the initial value of each identification value is a preset second value;
s34: let r be r + 1;
s35: acquiring the maximum r value in the cache at a preset third time to obtain a third value; the preset third time is the preset fourth time plus the preset time; the preset fourth time is a time point corresponding to data on a webpage corresponding to the second URL;
s36: adding one to the third value to obtain a fourth value;
s37: according to the fourth value, acquiring a first URL corresponding to the fourth value sequenced in the first sequence to obtain a third URL, and updating the second URL to the third URL;
s38: and repeatedly executing the steps S32-S37 until the crawling data ending instruction is received or all historical data are crawled.
According to the description, the terminal can accurately acquire the historical data to be crawled every time, and can acquire the maximum r value from the cache firstly when the historical data is crawled every time, so that the URL corresponding to the next time needs to be crawled can be determined, the problem that the breakpoint condition needs to be manually checked when the task is interrupted unexpectedly can be solved, the breakpoint condition is adjusted pertinently, and the historical data is configured for tracing and crawling again.
Preferably, the cache is a redis cache database, and when interruption occurs in the task execution process, data in the cache is not lost, so that the stability of data crawling can be improved.
Further, the historical data back-tracking crawling terminal without manual participation sorts the plurality of first URLs to obtain a first sequence, specifically:
and sequencing all the first URLs according to the historical data tracing direction and the time of the historical data corresponding to each first URL to obtain a first sequence.
As can be seen from the above description, each first URL can be quickly and accurately sorted by the terminal.
Further, the historical data tracing crawling terminal without human participation is described in the following, where the S1 specifically is:
acquiring task starting time corresponding to execution tracing historical data to obtain first time;
acquiring a time starting point value of historical data to be traced to obtain second time;
obtaining the time direction of tracing the historical data to obtain the historical data tracing direction;
and obtaining the number of days for continuously tracing the historical data each time, namely the number of days is the first threshold value.
Further, the obtaining, according to the historical data tracing direction and the first threshold, a plurality of first URLs corresponding to the historical data to be crawled for multiple times respectively specifically includes:
acquiring a plurality of first URLs corresponding to historical data to be crawled for multiple times according to second time, historical data tracing directions and a first threshold;
the first URL comprises a plurality of first sub-URLs, and the number of the first sub-URLs is equal to the first threshold.
According to the description, the URL corresponding to the historical data to be crawled each time can be accurately configured through the terminal, manual intervention is not needed in the execution process, and the efficiency of tracing and crawling the historical data without manual participation can be improved; meanwhile, each first URL includes a plurality of first sub-URLs, for example, the number of days of the history data traced each time is 5 days, and the history data traced each day corresponds to one sub-URL, that is, the number of sub-URLs traced each time is 5, which can further improve the efficiency of the system in performing the history data tracing crawling.
Further, the historical data tracing crawling terminal without human participation is described in the following, where the S32 specifically is:
obtaining a plurality of second sub-URLs according to the second URL;
sequentially crawling data on the webpage corresponding to each second sub URL according to the historical data tracing direction and the time of the historical data corresponding to each second sub URL;
the S33 specifically includes:
when the data on the webpage corresponding to a second sub URL is acquired, storing the second sub URL in a cache;
and judging whether the crawling of the data on the webpage corresponding to all the second sub-URLs is finished, if so, setting a preset r-th identification value as a preset first value, storing the r-th identification value in a cache, wherein the initial value of r is 1, and the initial value of each identification value is a preset second value.
Furthermore, the historical data tracing crawling terminal without manual participation judges whether the historical data crawled last time has interruption or not before crawling the historical data each time;
if yes, acquiring a first URL corresponding to the last crawling history data to obtain a fourth URL;
obtaining a plurality of fourth sub-URLs according to the fourth URL;
according to all the fourth sub-URLs, acquiring fourth sub-URLs which are not stored in a cache to obtain more than one fifth sub-URLs;
obtaining a fifth URL according to more than one fifth sub-URL, and updating the second URL into the fifth URL;
step S38 is executed.
As can be seen from the above description, after the data on the web page corresponding to each sub URL is obtained, the data is stored in the cache, so that the problem that when the data on the web page corresponding to all the sub URLs is not crawled, the data is interrupted, and when the data is executed again, the data on the web page corresponding to the executed sub URL needs to be obtained again, and the efficiency is low can be avoided.
Referring to fig. 1, a first embodiment of the present invention is:
the invention provides a historical data tracing and crawling method without manual participation, which comprises the following steps:
s1: setting a historical data tracing direction and a first threshold corresponding to the amount of the historical data crawled each time;
wherein, the S1 specifically is:
acquiring task starting time corresponding to execution tracing historical data to obtain first time;
acquiring a time starting point value of historical data to be traced to obtain second time;
obtaining the time direction of tracing the historical data to obtain the historical data tracing direction;
and obtaining the number of days for continuously tracing the historical data each time, namely the number of days is the first threshold value.
In a specific embodiment, there are two cases in the historical data tracing direction, namely, positive direction or negative direction; if the current date is positive, the historical data are sequentially acquired backwards along the second time, for example, the second time is 2016, 3 and 11 days, and the historical data acquired later is 2016, 3 and 11 days, namely the current date (or the time specified by the user); if the direction is negative, the historical data are acquired along the second time in sequence, for example, the second time is 2016 (3/11/3), and the historical data acquired later are user-specified time (the user-specified time is earlier than 2016 (3/11/3)) to 2016 (3/11/3/2016).
In a specific embodiment, the first time is a time when the task starts to be executed, and the time may be a current time or a future time.
In a specific embodiment, the number of days for which the historical data is continuously traced back each time is obtained, which is the first threshold, for example, if the historical data is set by the user to be traced back for five days each time, the first threshold is 5.
S2: according to the historical data tracing direction and a first threshold value, acquiring a plurality of first URLs (uniform resource locators) corresponding to the historical data to be crawled for multiple times; sequencing the plurality of first URLs to obtain a first sequence;
the URL is an address corresponding to the history data.
Wherein, the S2 specifically is:
acquiring a plurality of first URLs corresponding to historical data to be crawled for multiple times according to second time, historical data tracing directions and a first threshold;
the first URL comprises a plurality of first sub-URLs, and the number of the first sub-URLs is equal to the first threshold;
and sequencing all the first URLs according to the historical data tracing direction and the time of the historical data corresponding to each first URL to obtain a first sequence.
In the sorting process, all the first URLs are sorted according to a time sequence from far to near (when the historical data tracing direction is positive), or sorted according to a time sequence from near to far (when the historical data tracing direction is negative).
S3: sequentially crawling data on a webpage corresponding to each first URL in the first sequence at preset time intervals;
wherein, the S3 specifically is:
s31: acquiring a first URL sequenced at the forefront in the first sequence, and acquiring a second URL corresponding to data to be crawled; presetting a variable r, wherein the initial value of r is 1;
s32: crawling data on a webpage corresponding to the second URL;
wherein, the S32 specifically is:
obtaining a plurality of second sub-URLs according to the second URL;
sequentially crawling data on the webpage corresponding to each second sub URL according to the historical data tracing direction and the time of the historical data corresponding to each second sub URL;
s33: if the data on the webpage corresponding to the second URL are acquired, setting a preset r-th identification value as a preset first value, and storing the r-th identification value and the second URL in a cache, wherein the initial value of each identification value is a preset second value;
wherein, the S33 specifically is:
when the data on the webpage corresponding to a second sub URL is acquired, storing the second sub URL in a cache;
and judging whether the crawling of the data on the webpage corresponding to all the second sub-URLs is finished, if so, setting a preset r-th identification value as a preset first value, storing the r-th identification value in a cache, wherein the initial value of r is 1, and the initial value of each identification value is a preset second value.
Preferably, the preset first value is 1, and the preset second value is 0; and when the identification value is 1, the data on all the web pages corresponding to the second sub-URLs are crawled.
S34: let r be r + 1;
s35: acquiring the maximum r value in the cache at a preset third time to obtain a third value; the preset third time is the preset fourth time plus the preset time; the preset fourth time is a time point corresponding to data on a webpage corresponding to the second URL;
the preset third time is a time point, the preset time is a time period, for example, one day, and the preset fourth time is a time point.
S36: adding one to the third value to obtain a fourth value;
s37: according to the fourth value, acquiring a first URL corresponding to the fourth value sequenced in the first sequence to obtain a third URL, and updating the second URL to the third URL;
wherein, the S37 specifically is:
according to the fourth value, acquiring a first URL corresponding to the fourth value sequenced in the first sequence to obtain a third URL;
judging whether the historical data obtained last time has interruption or not;
if so, obtaining a fourth URL according to the third URL, wherein the third URL is the same as the fourth URL; obtaining a plurality of fourth sub-URLs according to the fourth URL; according to all the fourth sub-URLs, acquiring fourth sub-URLs which are not stored in a cache to obtain more than one fifth sub-URLs; obtaining a fifth URL according to more than one fifth sub-URL, and updating the second URL into the fifth URL; step S38 is executed;
if not, the second URL is updated to the third URL, and step S38 is executed.
S38: and repeatedly executing the steps S32-S37 until the crawling data ending instruction is received or all historical data are crawled.
Referring to fig. 2, the second embodiment of the present invention is:
the invention provides a historical data tracing and crawling terminal without manual participation, which comprises a memory 1, a processor 2 and a computer program which is stored in the memory 1 and can be operated on the processor 2, wherein the processor executes the computer program to realize the following steps:
s1: setting a historical data tracing direction and a first threshold corresponding to the amount of the historical data crawled each time;
wherein, the S1 specifically is:
acquiring task starting time corresponding to execution tracing historical data to obtain first time;
acquiring a time starting point value of historical data to be traced to obtain second time;
obtaining the time direction of tracing the historical data to obtain the historical data tracing direction;
and obtaining the number of days for continuously tracing the historical data each time, namely the number of days is the first threshold value.
In a specific embodiment, there are two cases in the historical data tracing direction, namely, positive direction or negative direction; if the current date is positive, the historical data are sequentially acquired backwards along the second time, for example, the second time is 2016, 3 and 11 days, and the historical data acquired later is 2016, 3 and 11 days, namely the current date (or the time specified by the user); if the direction is negative, the historical data are acquired along the second time in sequence, for example, the second time is 2016 (3/11/3), and the historical data acquired later are user-specified time (the user-specified time is earlier than 2016 (3/11/3)) to 2016 (3/11/3/2016).
In a specific embodiment, the first time is a time when the task starts to be executed, and the time may be a current time or a future time.
In a specific embodiment, the number of days for which the historical data is continuously traced back each time is obtained, which is the first threshold, for example, if the historical data is set by the user to be traced back for five days each time, the first threshold is 5.
S2: according to the historical data tracing direction and a first threshold value, acquiring a plurality of first URLs (uniform resource locators) corresponding to the historical data to be crawled for multiple times; sequencing the plurality of first URLs to obtain a first sequence;
the URL is an address corresponding to the history data.
Wherein, the S2 specifically is:
acquiring a plurality of first URLs corresponding to historical data to be crawled for multiple times according to second time, historical data tracing directions and a first threshold;
the first URL comprises a plurality of first sub-URLs, and the number of the first sub-URLs is equal to the first threshold;
and sequencing all the first URLs according to the historical data tracing direction and the time of the historical data corresponding to each first URL to obtain a first sequence.
In the sorting process, all the first URLs are sorted according to a time sequence from far to near (when the historical data tracing direction is positive), or sorted according to a time sequence from near to far (when the historical data tracing direction is negative).
S3: sequentially crawling data on a webpage corresponding to each first URL in the first sequence at preset time intervals;
wherein, the S3 specifically is:
s31: acquiring a first URL sequenced at the forefront in the first sequence, and acquiring a second URL corresponding to data to be crawled; presetting a variable r, wherein the initial value of r is 1;
s32: crawling data on a webpage corresponding to the second URL;
wherein, the S32 specifically is:
obtaining a plurality of second sub-URLs according to the second URL;
sequentially crawling data on the webpage corresponding to each second sub URL according to the historical data tracing direction and the time of the historical data corresponding to each second sub URL;
s33: if the data on the webpage corresponding to the second URL are acquired, setting a preset r-th identification value as a preset first value, and storing the r-th identification value and the second URL in a cache, wherein the initial value of each identification value is a preset second value;
wherein, the S33 specifically is:
when the data on the webpage corresponding to a second sub URL is acquired, storing the second sub URL in a cache;
and judging whether the crawling of the data on the webpage corresponding to all the second sub-URLs is finished, if so, setting a preset r-th identification value as a preset first value, storing the r-th identification value in a cache, wherein the initial value of r is 1, and the initial value of each identification value is a preset second value.
Preferably, the preset first value is 1, and the preset second value is 0; and when the identification value is 1, the data on all the web pages corresponding to the second sub-URLs are crawled.
S34: let r be r + 1;
s35: acquiring the maximum r value in the cache at a preset third time to obtain a third value; the preset third time is the preset fourth time plus the preset time; the preset fourth time is a time point corresponding to data on a webpage corresponding to the second URL;
the preset third time is a time point, the preset time is a time period, for example, one day, and the preset fourth time is a time point.
S36: adding one to the third value to obtain a fourth value;
s37: according to the fourth value, acquiring a first URL corresponding to the fourth value sequenced in the first sequence to obtain a third URL, and updating the second URL to the third URL;
wherein, the S37 specifically is:
according to the fourth value, acquiring a first URL corresponding to the fourth value sequenced in the first sequence to obtain a third URL;
judging whether the historical data obtained last time has interruption or not;
if so, obtaining a fourth URL according to the third URL, wherein the third URL is the same as the fourth URL; obtaining a plurality of fourth sub-URLs according to the fourth URL; according to all the fourth sub-URLs, acquiring fourth sub-URLs which are not stored in a cache to obtain more than one fifth sub-URLs; obtaining a fifth URL according to more than one fifth sub-URL, and updating the second URL into the fifth URL; step S38 is executed;
if not, the second URL is updated to the third URL, and step S38 is executed.
S38: and repeatedly executing the steps S32-S37 until the crawling data ending instruction is received or all historical data are crawled.
The third embodiment of the invention is as follows:
1. 5 configuration items are created: a task execution reference time task _ begin _ time (first time) for tracking the start of the work execution time, i.e. the execution time of the first segment of task; the historical data tracing time initial value data _ begin _ time (second time) is used as the time starting point of tracing the historical data in the first segment of task, and the historical data of the subsequent segment of task is used as the reference point to recalculate the time starting point; a retroactive direction trace _ backward (historical data retroactive direction) for controlling whether the retroactive date direction is forward or backward; the time unit amount time _ num (first threshold) of each time of tracing controls the data acquisition amount of each segmented task; the task continuation threshold, follow _ threshold, by which it is determined whether the remaining part of the interrupted task will take a day independently to execute or can be spliced into the next task segment to execute when the interrupt is restarted.
2. Fields are created in the cache (e.g., redis) to store task execution state parameters: a task completion flag bit r _ finish _ flag (the r-th identification value) used for judging whether the last task is completed, wherein 0 is incomplete and 1 is completed; a task execution date r _ task _ time, which registers the execution date of the latest task; the finished date amount r _ finish _ num of the task segment registers that the current task segment has finished obtaining historical data for several days, for example, the configuration value time _ num is 5, which indicates that 5 days of historical data are to be obtained every day, if r _ finish _ num is 3 during the execution process, which indicates that 3 days of historical data are obtained today, only when r _ finish _ num is time _ num, r _ finish _ flag is set to 1, which indicates that the task today is completely finished; the execution date correction value r _ offset _ num is set to 0 as an initial value, and is adjusted for breakpoint resumption in response to occurrence of task abnormal interruption.
In addition, an accessed url set r _ finished _ urls is created, and url which is executed historically is registered by using the characteristic of the uniqueness of the redis set type data, so that url deduplication of the whole task cycle is realized.
3. And analyzing the time value characteristics of url request fields of the historical data pages of the target sites, and configuring 5 configuration items according to requirements. For example, assume that the time field of the target site url is in the date format of YYYY-MM-DD, and the basic unit of time is days; assuming that the task plan is executed from 1/2019, the historical data site is traced back to data before 01/2018 and historical data is traced back for 5 days each day, then the task _ begin _ time can be configured as 2019-01-01, the data _ begin _ time can be configured as 2018-01-01, and the time _ num can be configured as 5; since the date is the previous data, the date is traced in the reverse direction (forward direction backward and reverse direction forward), so that trace _ backward is-1, otherwise, it is 1; given that it is required that if an interruption occurs and the interruption day completes more than 3 days of historical data, the remaining 2 days of data can be retrieved along with the data for the next task segment, then follow _ threshold is set to 3.
4. Starting a task, firstly detecting r _ finish _ flag of redis by the task, and if the value is 1, indicating that the last task is successfully finished, entering a daily mode; otherwise, if the r _ finish _ flag is 0, the last task execution is abnormal and is not completed, and the breakpoint recovery mode needs to be entered.
5. The task firstly enters into url generation phase, and the difference between the daily mode and the breakpoint recovery mode is mainly in the date list generation process of the phase.
In the daily mode, the current time now and the (last) task completion date r _ task _ time in the redis are compared, if now-r _ task _ time is more than 1 day, it indicates that there are several days in the middle of which no task is executed, in order to compensate the influence of blank period on the positioning of the target data date, the r _ offset _ time of the redis (wherein the initial value of r _ offset _ time is 0;) needs to be adjusted, that is, the r _ offset _ time is adjusted
r_offset_time=r_offset_time+(now-r_task_time–1);
For example, when a task first has a 1-day blank period, r _ offset _ time is 1, which indicates that the date needs to be corrected by shifting by 1 day, and when a 1-day blank period occurs again, r _ offset _ time is 2, which indicates that the date needs to be shifted by two days. Then, immediately setting r _ task _ time of redis as the current date, and then calculating the actual target data date offset value offset, i.e. the actual target data date offset value offset, according to now, task _ begin _ time and r _ offset _ time
offset=(now-task_begin_time)+r_offset_time;
The date range of the history data page that needs to be requested is then:
data_begin_time+trace_backward*(offset*time_num+1);
to:
data_begin_time+trace_backward*(offset*time_num+time_num);
in the meantime. After the date is generated, r _ finish _ flag is set to 0.
Under the breakpoint recovery, whether the breakpoint date is now is judged first. If yes, the execution is continued according to the daily mode, url deduplication operation is carried out at the next stage of the task, url completed before the breakpoint is filtered, and therefore repeated tasks are avoided. If the breakpoint date is not today, firstly, adjusting r _ offset _ time of redis, calculating an actual target data date offset value offset, and setting r _ task _ time of redis as the current date in the same way as the daily mode; then, the task amount r _ finish _ num that has been completed by the last task (when r _ finish _ num is less than 0, an operation of r _ finish _ num being 0-r _ finish _ num, which will be described in case 2 later) is compared with the threshold value follow _ threshold, and different task engagement strategies are adopted for the two cases:
case 1: r _ finish _ num < focus _ threshold, which indicates that the task completion degree is not high and more tasks are accumulated before the last task is interrupted, an independent working day is needed to execute the remaining tasks, and the joined task amount is the remaining task of the breakpoint day. Therefore, the subsequent operation is the same as the daily mode.
Case 2: and r _ finish _ num > is closed _ threshold, which indicates that the task completion degree is high and the number of the remaining tasks is small before the last task is interrupted, and the next task is allowed to be accessed to be performed together. Since the total task interval spans 2 days, r _ finish _ num needs to be changed into 0-r _ finish _ num and is converted into a negative number storage, so that the decision interval of r _ finish _ num, which is time _ num, is expanded into an actual interval. If the interruption is performed again after that, when it is accessed that r _ finish _ num is a negative number at the next startup, the task completion amount of the reacquired breakpoint day may be calculated again by taking r _ finish _ num as 0-r _ finish _ num.
Then, a task date list contained in 2 working days is generated, and the date range of the history data page needing to be requested is as follows:
data_begin_time+trace_backward*(offset*time_num+1);
to:
data_begin_time+trace_backward*(offset*time_num+2*time_num);
in the meantime. Wherein, the breakpoint date contains the completed part of the date, which is filtered out in the subsequent url deduplication.
And finally, splicing the two modes according to the acquired date parameters to generate a request url to prepare for crawling data.
6. After url is generated, Redis is entered for duplicate removal screening, whether url is valid is judged according to whether r _ finished _ urls are registered in the set, invalid links are abandoned, valid urls are sequentially arranged into a queue, and a request is waited.
7. The following is a data request acquisition phase, and once 1 entry url task (namely 1 calendar history data volume) is completed, the registration and judgment of the task state are performed: first, registering the completed url in r _ finished _ urls, and adding 1 to r _ finish _ num; secondly, whether r _ finish _ num is equal to time _ num or not is judged, the task is completely finished, a status flag bit is set (r _ finish _ flag is set to be 1, and r _ finish _ num is set to be 0), the task is finished, if the status flag bit is not equal to the status flag bit, whether r _ finish _ num is 0 or not is further judged, if the status flag bit is equal to 0, the task is the last task of an interruption day in an interruption mode and is ready to enter the next batch of tasks is judged, and in the special case of the two-day task quantity, the offset value r _ offset _ time is determined according to the first day, so when the task enters the second batch, the offset value needs to be further corrected by changing r _ offset _ time to r _ offset _ time-1. And (5) performing subsequent operation until r _ finish _ num is equal to time _ num, and completing the task.
8. All the acquired data are stored in a database (such as mysql) after being cleaned and sorted in real time.
9. The system is started at a fixed time point every day through timing configuration such as a task plan, so that the automatic execution of the tasks is realized.
10. If a certain piece of node data in the history data is required, the above configuration can be flexibly implemented by temporarily correcting the above configuration.
For the description of the parameters in this example, please refer to tables 1 and 2:
table 1: description of configuration parameters
Figure BDA0002930866170000181
TABLE 2 task status parameter description
Figure BDA0002930866170000182
In summary, according to the historical data tracing and crawling method and the terminal without manual participation provided by the invention, in the process of tracing and crawling the historical data, only according to the historical data tracing direction and the first threshold, the plurality of first URLs respectively corresponding to the historical data to be crawled for multiple times can be obtained, and are sorted to obtain the first sequence. Furthermore, by the method, the historical data to be crawled each time can be accurately acquired, and the maximum r value can be acquired from the cache firstly when the historical data is crawled each time, so that the URL corresponding to the next time needs to be crawled can be determined, and the problems that when a task is interrupted unexpectedly, the breakpoint condition needs to be checked manually, adjustment is performed pertinently, and the historical data to be crawled is reconfigured can be solved. Furthermore, by the method, the URL corresponding to the historical data to be crawled each time can be accurately configured, manual intervention is not needed in the execution process, and the efficiency of tracing and crawling the historical data can be improved; meanwhile, each first URL includes a plurality of first sub-URLs, for example, the number of days of the history data traced each time is 5 days, and the history data traced each day corresponds to one sub-URL, that is, the number of sub-URLs traced each time is 5, which can further improve the efficiency of the system in performing the history data tracing crawling. Furthermore, after the data on the webpage corresponding to each sub-URL is acquired, the data is stored in the cache, so that the problem that when the data of the webpage corresponding to all the sub-URLs is not crawled, the interruption occurs, and when the data is executed again, the data on the webpage corresponding to the executed sub-URL needs to be acquired again, so that the efficiency is low can be solved.
The above description is only an embodiment of the present invention, and not intended to limit the scope of the present invention, and all equivalent changes made by using the contents of the present specification and the drawings, or applied directly or indirectly to other related technical fields, are included in the scope of the present invention.

Claims (6)

1. A historical data tracing and crawling method without manual participation is characterized by comprising the following steps:
s1: setting a historical data tracing direction and a first threshold corresponding to the amount of the historical data crawled each time;
s2: according to the historical data tracing direction and a first threshold value, acquiring a plurality of first URLs (uniform resource locators) corresponding to the historical data to be crawled for multiple times; sequencing the plurality of first URLs to obtain a first sequence;
in the sorting process, if the historical data tracing direction is the forward direction, sorting all the first URLs according to a time sequence from far to near to obtain a first sequence;
if the historical data tracing direction is negative, sequencing all the first URLs according to the time sequence from near to far to obtain a first sequence;
s3: sequentially crawling data on a webpage corresponding to each first URL in the first sequence at preset time intervals;
the S3 specifically includes:
s31: acquiring a first URL sequenced at the forefront in the first sequence, and acquiring a second URL corresponding to data to be crawled; presetting a variable r, wherein the initial value of the variable r is 1;
s32: crawling data on a webpage corresponding to the second URL;
s33: if the data on the webpage corresponding to the second URL are acquired completely, setting a preset r-th identification value as a preset first value, and storing the r-th identification value and the second URL in a cache, wherein the initial value of each identification value is a preset second value;
s34: let r be r + 1;
s35: acquiring the maximum r value in the cache at a preset third time to obtain a third value; the preset third time is the preset fourth time plus the preset time; the preset fourth time is a time point corresponding to data on a webpage corresponding to the second URL;
s36: adding one to the third value to obtain a fourth value;
s37: according to the fourth value, acquiring a first URL corresponding to the fourth value sequenced in the first sequence to obtain a third URL, and updating the second URL to the third URL;
s38: and repeatedly executing the steps S32-S37 until the crawling data ending instruction is received or all historical data are crawled.
2. The historical data retroactive crawling method without human intervention according to claim 1, wherein the S1 specifically is:
acquiring task starting time corresponding to execution tracing historical data to obtain first time;
acquiring a time starting point value of historical data to be traced to obtain second time;
obtaining the time direction of tracing the historical data to obtain the historical data tracing direction;
and obtaining the number of days for continuously tracing the historical data each time, namely the number of days is the first threshold value.
3. The historical data tracing and crawling method without human intervention according to claim 2, wherein the obtaining of the plurality of first URLs respectively corresponding to the historical data to be crawled for a plurality of times according to the historical data tracing direction and the first threshold specifically comprises:
acquiring a plurality of first URLs corresponding to historical data to be crawled for multiple times according to second time, historical data tracing directions and a first threshold;
the first URL comprises a plurality of first sub-URLs, and the number of the first sub-URLs is equal to the first threshold.
4. The historical data retroactive crawling method without human intervention according to claim 3, wherein the step S32 specifically comprises:
obtaining a plurality of second sub-URLs according to the second URL;
sequentially crawling data on the webpage corresponding to each second sub URL according to the historical data tracing direction and the time of the historical data corresponding to each second sub URL;
the S33 specifically includes:
when the data on the webpage corresponding to a second sub URL is acquired, storing the second sub URL in a cache;
and judging whether the crawling of the data on the webpage corresponding to all the second sub-URLs is finished, if so, setting a preset r-th identification value as a preset first value, storing the r-th identification value in a cache, wherein the initial value of r is 1, and the initial value of each identification value is a preset second value.
5. The historical data tracing and crawling method without human intervention according to claim 4, wherein before each time of crawling of the historical data, whether the historical data is interrupted or not is judged;
if yes, acquiring a first URL corresponding to the last crawling history data to obtain a fourth URL;
obtaining a plurality of fourth sub-URLs according to the fourth URL;
according to all the fourth sub-URLs, acquiring fourth sub-URLs which are not stored in a cache to obtain more than one fifth sub-URLs;
obtaining a fifth URL according to more than one fifth sub-URL, and updating the second URL into the fifth URL;
step S38 is executed.
6. A historical data tracing and crawling terminal without human participation comprises a memory, a processor and a computer program which is stored in the memory and can run on the processor, and is characterized in that the processor executes the computer program to realize the following steps:
s1: setting a historical data tracing direction and a first threshold corresponding to the amount of the historical data crawled each time;
s2: according to the historical data tracing direction and a first threshold value, acquiring a plurality of first URLs (uniform resource locators) corresponding to the historical data to be crawled for multiple times; sequencing the plurality of first URLs to obtain a first sequence;
in the sorting process, if the historical data tracing direction is the forward direction, sorting all the first URLs according to a time sequence from far to near to obtain a first sequence;
if the historical data tracing direction is negative, sequencing all the first URLs according to the time sequence from near to far to obtain a first sequence;
s3: sequentially crawling data on a webpage corresponding to each first URL in the first sequence at preset time intervals;
the S3 specifically includes:
s31: acquiring a first URL sequenced at the forefront in the first sequence, and acquiring a second URL corresponding to data to be crawled; presetting a variable r, wherein the initial value of the variable r is 1;
s32: crawling data on a webpage corresponding to the second URL;
s33: if the data on the webpage corresponding to the second URL are acquired, setting a preset r-th identification value as a preset first value, and storing the r-th identification value and the second URL in a cache, wherein the initial value of each identification value is a preset second value;
s34: let r be r + 1;
s35: acquiring the maximum r value in the cache at a preset third time to obtain a third value; the preset third time is the preset fourth time plus the preset time; the preset fourth time is a time point corresponding to data on a webpage corresponding to the second URL;
s36: adding one to the third value to obtain a fourth value;
s37: according to the fourth value, acquiring a first URL corresponding to the fourth value sequenced in the first sequence to obtain a third URL, and updating the second URL to the third URL;
s38: and repeatedly executing the steps S32-S37 until the crawling data ending instruction is received or all historical data are crawled.
CN202110147690.3A 2019-03-14 2019-03-14 Historical data tracing and crawling method and terminal without manual participation Active CN112905866B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202110147690.3A CN112905866B (en) 2019-03-14 2019-03-14 Historical data tracing and crawling method and terminal without manual participation

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN201910191973.0A CN109992705B (en) 2019-03-14 2019-03-14 Historical data tracing and crawling method and terminal
CN202110147690.3A CN112905866B (en) 2019-03-14 2019-03-14 Historical data tracing and crawling method and terminal without manual participation

Related Parent Applications (1)

Application Number Title Priority Date Filing Date
CN201910191973.0A Division CN109992705B (en) 2019-03-14 2019-03-14 Historical data tracing and crawling method and terminal

Publications (2)

Publication Number Publication Date
CN112905866A true CN112905866A (en) 2021-06-04
CN112905866B CN112905866B (en) 2022-06-07

Family

ID=67130603

Family Applications (3)

Application Number Title Priority Date Filing Date
CN201910191973.0A Active CN109992705B (en) 2019-03-14 2019-03-14 Historical data tracing and crawling method and terminal
CN202110147690.3A Active CN112905866B (en) 2019-03-14 2019-03-14 Historical data tracing and crawling method and terminal without manual participation
CN202110147715.XA Active CN112905867B (en) 2019-03-14 2019-03-14 Efficient historical data tracing and crawling method and terminal

Family Applications Before (1)

Application Number Title Priority Date Filing Date
CN201910191973.0A Active CN109992705B (en) 2019-03-14 2019-03-14 Historical data tracing and crawling method and terminal

Family Applications After (1)

Application Number Title Priority Date Filing Date
CN202110147715.XA Active CN112905867B (en) 2019-03-14 2019-03-14 Efficient historical data tracing and crawling method and terminal

Country Status (1)

Country Link
CN (3) CN109992705B (en)

Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20130332443A1 (en) * 2012-06-07 2013-12-12 Google Inc. Adapting content repositories for crawling and serving
CN103870465A (en) * 2012-12-07 2014-06-18 厦门雅迅网络股份有限公司 Non-invasion database crawler implementation method
CN106777043A (en) * 2016-12-09 2017-05-31 宁波大学 A kind of academic resources acquisition methods based on LDA
CN107247789A (en) * 2017-06-16 2017-10-13 成都布林特信息技术有限公司 user interest acquisition method based on internet
CN108415941A (en) * 2018-01-29 2018-08-17 湖北省楚天云有限公司 A kind of spiders method, apparatus and electronic equipment
CN108536691A (en) * 2017-03-01 2018-09-14 中兴通讯股份有限公司 Web page crawl method and apparatus
CN109033195A (en) * 2018-06-28 2018-12-18 上海盛付通电子支付服务有限公司 The acquisition methods of webpage information obtain equipment and computer-readable medium

Family Cites Families (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US7085787B2 (en) * 2002-07-19 2006-08-01 International Business Machines Corporation Capturing data changes utilizing data-space tracking
US7769742B1 (en) * 2005-05-31 2010-08-03 Google Inc. Web crawler scheduler that utilizes sitemaps from websites
US9082126B2 (en) * 2009-09-25 2015-07-14 National Electronics Warranty, Llc Service plan web crawler
FR3004568A1 (en) * 2013-04-11 2014-10-17 Claude Rivoiron PROJECT MONITORING
CN104750694B (en) * 2013-12-26 2019-02-05 北京亿阳信通科技有限公司 A kind of mobile network information source tracing method and device
CN109284287B (en) * 2018-08-22 2024-02-02 平安科技(深圳)有限公司 Data backtracking and reporting method and device, computer equipment and storage medium
CN109377275A (en) * 2018-10-15 2019-02-22 中国平安人寿保险股份有限公司 Data tracing method, device, computer equipment and storage medium

Patent Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20130332443A1 (en) * 2012-06-07 2013-12-12 Google Inc. Adapting content repositories for crawling and serving
CN103870465A (en) * 2012-12-07 2014-06-18 厦门雅迅网络股份有限公司 Non-invasion database crawler implementation method
CN106777043A (en) * 2016-12-09 2017-05-31 宁波大学 A kind of academic resources acquisition methods based on LDA
CN108536691A (en) * 2017-03-01 2018-09-14 中兴通讯股份有限公司 Web page crawl method and apparatus
CN107247789A (en) * 2017-06-16 2017-10-13 成都布林特信息技术有限公司 user interest acquisition method based on internet
CN108415941A (en) * 2018-01-29 2018-08-17 湖北省楚天云有限公司 A kind of spiders method, apparatus and electronic equipment
CN109033195A (en) * 2018-06-28 2018-12-18 上海盛付通电子支付服务有限公司 The acquisition methods of webpage information obtain equipment and computer-readable medium

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
ANDREI Z. BRODER 等: "Efficient URL caching for world wide web crawling", 《PROCEEDINGS OF THE 12TH INTERNATIONAL CONFERENCE ON WORLD WIDE WEB》 *
李春山: "基于B~*树和B+树融合索引的海量URL管理技术", 《中国优秀硕士学位论文全文数据库 (信息科技辑)》 *

Also Published As

Publication number Publication date
CN112905866B (en) 2022-06-07
CN112905867B (en) 2022-06-07
CN112905867A (en) 2021-06-04
CN109992705B (en) 2021-03-05
CN109992705A (en) 2019-07-09

Similar Documents

Publication Publication Date Title
US20200272559A1 (en) Enhancing efficiency in regression testing of software applications
US20040083117A1 (en) Method for fast searching and analyzing inter-relations between patents from a patent database
CN110275799B (en) Method for snapshot balance of daily point-cut without shutdown of accounting system
CN106682017B (en) Database updating method and device
EP3299968A1 (en) Big data calculation method and system
CN113760476A (en) Task dependency processing method and related device
CN109992705B (en) Historical data tracing and crawling method and terminal
CN114942933A (en) Method for automatically updating database and related device
CN117112400A (en) Automatic test case generation platform
CN111222972A (en) Account checking and clearing method and device
CN108804239B (en) Platform integration method and device, computer equipment and storage medium
CN110674214B (en) Big data synchronization method, device, computer equipment and storage medium
CN111143316A (en) Version management system and method for BIM forward design
CN112860492B (en) Automatic regression testing method and system suitable for core system
CN103020464A (en) Method for correcting vehicle machine accumulated working time
CN117633024B (en) Database optimization method based on preprocessing optimization join
CN101055599A (en) Mould design alteration processing system and method
CN110618939A (en) Method and device for automatic test case management
CN102708179A (en) Method and device for automatic retrieval of patent data
CN109739479A (en) A kind of front end structure method for implanting and device
CN115186939B (en) Method for predicting carbon emission of processing equipment in full life cycle
CN116521212A (en) Batch processing method, device, electronic equipment and storage medium
CN114564296A (en) Batch processing task scheduling method and device and electronic equipment
CN117520329A (en) Power grid multisource data integration method and system
CN115809087A (en) Version updating method, version updating device, version updating equipment and storage medium

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant