CN116132283A

CN116132283A - Distributed data acquisition method, system, equipment and storage medium

Info

Publication number: CN116132283A
Application number: CN202211626739.4A
Authority: CN
Inventors: 裴昌川; 禹麒; 刘虎
Original assignee: Qizhi Technology Co ltd
Current assignee: Qizhi Technology Co ltd
Priority date: 2022-12-17
Filing date: 2022-12-17
Publication date: 2023-05-16

Abstract

The application relates to the technical field of data acquisition and discloses a distributed data acquisition method, a system, equipment and a storage medium, wherein the distributed data acquisition method comprises the following steps: acquiring historical update information of a target website, wherein the historical update information comprises time nodes of each data release of the target website; generating a plurality of periodic distribution diagrams based on historical updating information of a target website and a preset statistical period, and fitting to generate a data release distribution diagram so as to update a corresponding data release distribution model; dividing the statistical period into a plurality of evaluation periods, calculating the data release probability of each evaluation period according to a data release distribution model, and marking the evaluation period with the data release probability larger than a preset probability evaluation threshold value as a high-frequency period; determining acquisition time of a corresponding target website based on the high-frequency time period to generate a data acquisition plan; the method has the effect of improving the efficiency of collecting data from the Internet.

Description

Distributed data acquisition method, system, equipment and storage medium

Technical Field

The present disclosure relates to the field of data acquisition, and in particular, to a distributed data acquisition method, system, device, and storage medium.

Background

In a distributed data acquisition platform, in order to quickly acquire incremental information of a public site in real time, a method of periodically and circularly accessing a large amount of sites is generally adopted to acquire data because the site data updating frequency is not known; however, this collection method needs to occupy more network resources and server cost, and also brings site load problem, which easily results in slower normal access response of the website.

For the above related art, the inventor considers that the existing distributed data acquisition method has the problem of excessive resource occupation.

Disclosure of Invention

In order to improve the efficiency of data collection from the Internet, the application provides a distributed data collection method, a system, equipment and a storage medium.

The first technical scheme adopted by the invention of the application is as follows:

a distributed data acquisition method comprising:

acquiring historical update information of a target website, wherein the historical update information comprises time nodes of each data release of the target website;

generating a plurality of periodic distribution diagrams based on historical updating information of a target website and a preset statistical period, and fitting to generate a data release distribution diagram so as to update a corresponding data release distribution model;

Dividing the statistical period into a plurality of evaluation periods, calculating the data release probability of each evaluation period according to a data release distribution model, and marking the evaluation period with the data release probability larger than a preset probability evaluation threshold value as a high-frequency period;

and determining the acquisition time of the corresponding target website based on the high-frequency time period to generate a data acquisition plan.

By adopting the technical scheme, the historical release data of the target website and the time node corresponding to the release data are obtained, and the historical update information is generated, so that the rule of the release data of the target website can be conveniently judged according to the historical update information; setting a statistical period according to data statistical requirements, extracting time nodes of data release of a target website in each statistical period from historical update information of the target website based on the statistical period, generating corresponding periodic distribution diagrams according to distribution conditions of the time nodes of the data release in each statistical period, fitting based on a plurality of periodic distribution diagrams, generating a data release distribution diagram, and generating or updating a data release distribution model according to the data release distribution diagram so as to improve accuracy of the data release distribution model; dividing each statistical period into a plurality of evaluation periods, calculating the probability of data release in each evaluation period according to the time node of each data release of the target website recorded in the data release distribution model, defining the probability as the data release probability, facilitating the acquisition of the probability of data release of the target website in each evaluation period, comparing the data release probability with a preset probability evaluation threshold, and marking the evaluation period as a high-frequency period if the data release probability corresponding to the evaluation period is greater than the probability evaluation threshold, thereby facilitating the subsequent planning of the data acquisition time point aiming at the target website; and determining the data acquisition time of the corresponding target website based on the high-frequency time period to generate a data acquisition plan for the target website, so that the data acquisition efficiency is improved, and the consumption of network resources and server cost is reduced.

In a preferred example, the present application: after the step of determining the acquisition time of the corresponding target website based on the high-frequency period to generate the data acquisition plan, the method comprises the following steps:

collecting data from a target website based on a data collection plan, generating unique identification information based on a uniform resource location address of the collected website, and storing the unique identification information into an identification information base in a lasting manner;

the identification information base is set as a database of deduplication filters.

By adopting the technical scheme, the data are acquired from the target website based on the data acquisition plan, so that the data acquisition efficiency is improved, and after the acquisition of the data issued by one website is completed, the corresponding unique identification information is generated according to the uniform resource location address of the acquired website, so that the subsequent judgment of whether the data of each website are acquired is facilitated; the unique identification information is added into the identification information base, so that the identification information of the collected sites can be recorded conveniently, and the identification information base is set as a database of the deduplication filter, so that the deduplication filter based on the identification information base is responsible for performing deduplication processing on the collected data, network resources, computer resources and storage resources which are occupied by collecting the data from the target website are reduced, and the data collection efficiency is further improved.

searching data to be mined from a target website based on a data acquisition plan, and acquiring attribute information of the data to be mined, wherein the attribute information comprises release time and unique identification information of the data to be mined;

inputting the unique identification information of the data to be sampled into a duplicate removal filter, and judging whether the data to be sampled is incremental data or not;

and if the data to be acquired is incremental data, downloading the data to be acquired into a data acquisition library, and adding the release time of the data to be acquired into the history updating information of the target website.

By adopting the technical scheme, the data to be acquired is searched from the target website based on the acquisition time in the data acquisition plan, and the release time and the unique identification information of the data to be acquired are acquired as attribute information, so that the subsequent judgment of whether the data to be acquired is facilitated; inputting unique identification information corresponding to the data to be sampled into a duplicate removal filter so as to match the unique identification information of the data to be sampled with the identification information in the identification information base, thereby judging whether the data to be sampled is incremental data or not; if the data to be acquired is incremental data, the data to be acquired is not acquired, the data to be acquired is downloaded and stored in a data acquisition library, and the release time node of the data to be acquired is added into the history update information of the corresponding target website, so that the data release distribution model is updated conveniently based on the history update information, and the accuracy of the data release distribution model is further improved.

In a preferred example, the present application: the step of inputting the unique identification information of the data to be sampled into the deduplication filter and judging whether the data to be sampled is incremental data further comprises the following steps:

if the data to be acquired is not incremental data, the release time of the data to be acquired is added to repeated updating information of the target website.

By adopting the technical scheme, if the data to be acquired is not incremental data, the data to be acquired is acquired and added into the data acquisition library, the release time of the data to be acquired is added into repeated update information of the target website without downloading again, so that the condition of repeated release of the data of the target website is recorded, and the data acquisition plan of the target website can be conveniently adjusted according to the condition of repeated release of the data of the target website.

In a preferred example, the present application: generating unique identification information based on the unified resource positioning address of the acquired site, and storing the unique identification information into an identification information base in a lasting way, wherein the method comprises the following steps:

the unified resource positioning address of the collected sites is input into a bloom filter, and the unified resource positioning address is converted into unique identification information in the form of a bit vector through a segmentation mechanism;

The unique identification information is stored in a Redis cache of the identification information base in a key-value form data structure.

By adopting the technical scheme, the unified resource positioning address of the acquired site is input into the BlomFilter, and the unified resource positioning address is converted into a bit vector form of a binary vector data structure to serve as unique identification information through a segmentation mechanism of the BlomFilter; the unique identification information converted by the BlomFilter has 100% recall rate, each detection request returns two conditions of 'in the set (possible error)' and 'not in the set (not in the set at all)' and the effects of saving the storage space of the unique identification information and improving the detection speed are achieved by sacrificing the accuracy rate; the unique identification information is stored in a Redis cache of the identification information base in a key-value form data structure, so that whether other searched data to be acquired are acquired or not can be conveniently judged based on the unique identification information of the acquired data to be acquired stored in the identification information base.

In a preferred example, the present application: generating a plurality of periodic distribution diagrams based on historical updating information of a target website and a preset statistical period, and fitting to generate a data release distribution diagram so as to update a corresponding data release distribution model, wherein the step of updating the corresponding data release distribution model comprises the following steps:

Generating a corresponding number of periodic distribution diagrams based on a preset period number requirement, and generating a time coordinate axis of the periodic distribution diagrams based on a preset statistical period;

marking time nodes of each data release on each periodic distribution map based on historical update information of a target website;

counting the data release quantity of each time period in each periodic distribution diagram, and carrying out weighted average calculation on the data release quantity of each time period in the counting period based on each periodic distribution diagram so as to generate a data release distribution diagram;

the data distribution model is updated based on the data distribution profile and each period profile.

By adopting the technical scheme, as the data release rule of the same target website may change along with the development of time, the cycle number requirement can be set according to the actual statistical accuracy requirement, so that the statistical cycle number meeting the cycle number requirement can be subsequently taken for establishing a data release distribution model, and the accuracy of the data release distribution model can be conveniently improved; generating a corresponding number of periodic distribution diagrams based on the period number requirements, and generating a time coordinate axis of the periodic distribution diagrams according to the statistical period, so that the data release distribution rule of the target website in one statistical period can be conveniently judged subsequently; marking time nodes of each data release in a corresponding periodic distribution diagram based on historical updating information of a target website, counting the number of the data release in each time period in the periodic distribution diagram, carrying out weighted average calculation on the number of the data release in each time period in one counting period according to all the periodic distribution diagrams, generating a data release distribution diagram, generating/updating a data release distribution model according to the data release distribution diagram and each periodic distribution diagram, and facilitating the subsequent analysis of the data release rule of the target website from the data release distribution model, thereby generating a data acquisition plan.

In a preferred example, the present application: dividing the statistical period into a plurality of evaluation periods, calculating the data release probability of each evaluation period according to a data release distribution model, and marking the evaluation period with the data release probability larger than a preset probability evaluation threshold value as a high-frequency period, wherein the step comprises the following steps:

dividing the statistical period into a plurality of evaluation periods, and counting the data release quantity and the data release probability in each evaluation period according to all the period distribution diagrams;

and comparing the data release probability with a preset probability evaluation threshold, and marking the evaluation period as a high-frequency period if the data release probability of the evaluation period is greater than the probability evaluation threshold.

By adopting the technical scheme, the statistical period is divided into a plurality of evaluation periods, and according to all periodic distribution diagrams in the data distribution model, the historical data distribution conditions of the target website in each evaluation period are counted, wherein the historical data distribution conditions comprise the data distribution quantity and the data distribution probability of each evaluation period; comparing the data release probability with a preset probability evaluation threshold, if the data release probability in the evaluation period is larger than the probability evaluation threshold, considering that the probability of releasing data in the evaluation period by the target website is larger, marking the evaluation period as a high-frequency period, and facilitating the follow-up planning of the data acquisition time point of the target website.

The second object of the present application is achieved by the following technical scheme:

a distributed data acquisition system comprising:

the historical update information acquisition module is used for acquiring historical update information of the target website, and the historical update information comprises time nodes of each data release of the target website;

the data release distribution model updating module is used for generating a plurality of periodic distribution diagrams based on historical updating information of the target website and a preset statistical period, and fitting the periodic distribution diagrams to generate the data release distribution diagrams so as to update the corresponding data release distribution model;

the high-frequency time period marking module is used for dividing each statistical period into a plurality of evaluation time periods, calculating the data release probability of each evaluation time period according to the data release distribution model, and marking the evaluation time period with the data release probability larger than a preset probability evaluation threshold value as a high-frequency time period;

and the data acquisition plan generation module is used for determining the acquisition time of the corresponding target website based on the high-frequency time period so as to generate a data acquisition plan.

The third object of the present application is achieved by the following technical scheme:

a computer device comprising a memory, a processor and a computer program stored in the memory and executable on the processor, the processor implementing the steps of the above described distributed data acquisition method when the computer program is executed.

The fourth object of the present application is achieved by the following technical scheme:

a computer readable storage medium storing a computer program which when executed by a processor implements the steps of the distributed data acquisition method described above.

In summary, the present application includes at least one of the following beneficial technical effects:

1. acquiring historical release data of a target website and a time node corresponding to the release data, and generating historical update information, so that the rule of the release data of the target website can be conveniently judged according to the historical update information; setting a statistical period according to data statistical requirements, extracting time nodes of data release of a target website in each statistical period from historical update information of the target website based on the statistical period, generating corresponding periodic distribution diagrams according to distribution conditions of the time nodes of the data release in each statistical period, fitting based on a plurality of periodic distribution diagrams, generating a data release distribution diagram, and generating or updating a data release distribution model according to the data release distribution diagram so as to improve accuracy of the data release distribution model; dividing each statistical period into a plurality of evaluation periods, calculating the probability of data release in each evaluation period according to the time node of each data release of the target website recorded in the data release distribution model, defining the probability as the data release probability, facilitating the acquisition of the probability of data release of the target website in each evaluation period, comparing the data release probability with a preset probability evaluation threshold, and marking the evaluation period as a high-frequency period if the data release probability corresponding to the evaluation period is greater than the probability evaluation threshold, thereby facilitating the subsequent planning of the data acquisition time point aiming at the target website; and determining the data acquisition time of the corresponding target website based on the high-frequency time period to generate a data acquisition plan for the target website, so that the data acquisition efficiency is improved, and the consumption of network resources and server cost is reduced.

2. Data are collected from a target website based on a data collection plan so as to improve data collection efficiency, and after the collection of data issued by one website is completed, corresponding unique identification information is generated according to the uniform resource location address of the collected website, so that the subsequent judgment of whether the data of each website are collected is facilitated; the unique identification information is added into the identification information base, so that the identification information of the collected sites can be recorded conveniently, and the identification information base is set as a database of the deduplication filter, so that the deduplication filter based on the identification information base is responsible for performing deduplication processing on the collected data, network resources, computer resources and storage resources which are occupied by collecting the data from the target website are reduced, and the data collection efficiency is further improved.

3. Searching the data to be acquired from the target website based on the acquisition time in the data acquisition plan, and acquiring the release time and the unique identification information of the data to be acquired as attribute information, so that the subsequent judgment of whether the data to be acquired is facilitated; inputting unique identification information corresponding to the data to be sampled into a duplicate removal filter so as to match the unique identification information of the data to be sampled with the identification information in the identification information base, thereby judging whether the data to be sampled is incremental data or not; if the data to be acquired is incremental data, the data to be acquired is not acquired, the data to be acquired is downloaded and stored in a data acquisition library, and the release time node of the data to be acquired is added into the history update information of the corresponding target website, so that the data release distribution model is updated conveniently based on the history update information, and the accuracy of the data release distribution model is further improved.

Drawings

Fig. 1 is a flowchart of a distributed data acquisition method according to an embodiment of the present application.

Fig. 2 is a flowchart of step S20 in the distributed data acquisition method of the present application.

Fig. 3 is a flowchart of step S30 in the distributed data acquisition method of the present application.

Fig. 4 is another flow chart of the distributed data acquisition method of the present application.

Fig. 5 is a flowchart of step S50 in the distributed data acquisition method of the present application.

Fig. 6 is another flow chart of a distributed data acquisition method of the present application.

Fig. 7 is a schematic block diagram of a distributed data acquisition system according to a second embodiment of the present application.

Fig. 8 is a schematic view of an apparatus in a third embodiment of the present application.

Detailed Description

The present application is described in further detail below in conjunction with figures 1 to 8.

Example 1

The application discloses a distributed data acquisition method which can be used for acquiring data from a specific website of the Internet, and particularly can be used for crawling a data acquisition program of the data from the Internet so as to construct a database; in this embodiment, the target website refers to a website for data collection by the distributed data collection method in this application.

As shown in fig. 1, the method specifically comprises the following steps:

s10: and acquiring historical update information of the target website, wherein the historical update information comprises time nodes of each data release of the target website.

In this embodiment, the history update information is information for recording a time node and a data identifier of data release that have been performed by the target websites in the past, and each target website has corresponding history update information; the data release refers to the event that a target website releases new data to be collected on the website.

Specifically, according to the record of the data crawled from the target website in the past, the history update information of the target website is generated, wherein the history update information comprises the time node of each data release of the target website and the information of the corresponding identification of the released data, so that the data release rule and the characteristics of the target website can be conveniently judged according to the history update information of the target website.

Further, in order to save the storage space of the history update information, the storage time of the history update information may be set, for example, the history update information within half a year of the storage target website may be set, and for the history update information outside the past half a year, deletion is performed to reduce the occupation of the storage space.

S20: generating a plurality of periodic distribution diagrams based on historical updating information of the target website and a preset statistical period, and fitting to generate a data release distribution diagram so as to update a corresponding data release distribution model.

In this embodiment, the statistical period refers to a period for performing statistical analysis on the data distribution rule of the target website, and the statistical period may be set to be one day, one week or one month according to actual needs; the periodic distribution diagram is a diagram for showing the distribution rule of the data release time nodes of the target website in a statistical period; the data distribution map is a map generated by carrying out weighted average statistics on data in the multiple period distribution maps, so that the data distribution time distribution rule of the target website can be comprehensively evaluated according to the conditions of multiple statistical periods.

Specifically, a proper statistical period time is automatically matched according to the type, the data updating amount and the history updating information of the target website, for example, the target website is updated usually every day, the statistical period can be set to be one day, if the target website is updated usually every week, the statistical period can be set to be one week, and the time setting principle of the statistical period can embody the data release distribution rule; generating a plurality of periodic distribution diagrams according to a preset statistical period, acquiring time nodes of each data release from historical update data of a target website, filling the time nodes into the periodic distribution diagrams, fitting the time nodes based on the plurality of periodic distribution diagrams, and generating a data release distribution diagram; the data distribution map is generated by statistically analyzing the data distribution rules of a plurality of statistical periods, so that accidental errors of the data distribution rules in the data of a single statistical period are reduced, and the accuracy of statistical analysis of the data distribution rules of the target website is improved.

Specifically, a corresponding data distribution model is generated or updated based on a data distribution diagram, wherein after the data distribution diagram is acquired for the first time, the data distribution model is generated based on the data distribution diagram and a plurality of periodic distribution diagrams; and continuously updating the historical updating information of the target website in the subsequent data acquisition process, and updating the data release distribution model according to the data release distribution map and the periodic distribution map formed by the new historical updating information, thereby being convenient for improving the accuracy of the data release distribution model.

Referring to fig. 2, in step S20, the method includes:

s21: generating a corresponding number of periodic distribution diagrams based on a preset period number requirement, and generating a time coordinate axis of the periodic distribution diagrams based on a preset statistical period.

In the present embodiment, the cycle number requirement refers to a number requirement for a cycle profile to be generated.

Specifically, according to the number of the periodic distribution diagrams required by generating the data distribution diagrams, setting the numerical value of the period number requirement, wherein the numerical value of the period number requirement can be adjusted according to the accuracy requirement of actual analysis statistics; generating a period distribution map corresponding to the period number requirement, and generating a time coordinate axis of the period distribution map based on the statistical period, for example, when the statistical period is one week, the start-stop time of the time coordinate axis of the period distribution map may be set to 00:00 of monday to 24:00 of monday, specifically, the minimum scale of the time coordinate axis may be set to one hour, one minute or other time according to the actual requirement, and in this embodiment, the minimum scale of the time coordinate axis is one hour.

S22: and marking the time node of each data release on each periodic distribution map based on the historical updating information of the target website.

Specifically, based on historical update information of the target website, determining a time node corresponding to each data release of the target website, and marking the time node of the data release of the target website in a period corresponding to the data distribution map on the corresponding period distribution map.

S23: and counting the data distribution quantity of each time period in each periodic distribution diagram, and carrying out weighted average calculation on the data distribution quantity of each time period in the counting period based on each periodic distribution diagram so as to generate a data distribution diagram.

Specifically, a plurality of time periods are set based on the minimum scale of each periodic distribution chart, and the data release quantity of each time period in the periodic distribution chart is counted, wherein the data release quantity refers to the quantity of data to be acquired released by a target website, and an article or a link is usually used as a piece of data to be acquired; and generating a time coordinate axis of the data distribution diagram based on the statistical period, wherein the time coordinate axis of the data distribution diagram is identical to the time coordinate axis of the periodic distribution diagram, carrying out weighted average calculation on the data distribution number in the same time period in each periodic distribution diagram, and labeling the numerical value of the calculation result into the data distribution diagram to generate the data distribution diagram.

Further, after each new periodic distribution map is generated, the oldest periodic distribution map in the current periodic distribution map is removed, and the data distribution map is updated according to the new periodic distribution maps, so that the accuracy of the data distribution map is improved.

S24: the data distribution model is updated based on the data distribution profile and each period profile.

Specifically, a data distribution model is generated or updated based on the data distribution map and each periodic distribution map, and in this embodiment, the data distribution model stores raw data such as the data distribution map, each periodic distribution map, history update information, and the like, so that more types of data statistical analysis can be performed by using the data stored in the data distribution model later.

S30: dividing the statistical period into a plurality of evaluation periods, calculating the data release probability of each evaluation period according to the data release distribution model, and marking the evaluation period with the data release probability larger than a preset probability evaluation threshold value as a high-frequency period.

In the present embodiment, the evaluation period refers to a period set for statistically analyzing the distribution rule of data distribution of the target website; the data release probability refers to the probability of releasing data in an evaluation period according to each periodic distribution diagram; the probability evaluation threshold is a threshold used for comparing with the data release probability to judge whether the probability of releasing the data in each evaluation period of the target website is high or low, and the probability evaluation threshold is set to be 50% by default, and can be adjusted according to actual demands.

Specifically, dividing a statistical period into a plurality of evaluation periods, and taking the probability of data release behaviors in each evaluation period in a statistical data release distribution diagram as data release probability according to data stored in a data release distribution model; and comparing the data release probability with a corresponding preset probability evaluation threshold value to determine an evaluation period in which the data release probability is larger than the probability evaluation threshold value and marking the evaluation period as a high-frequency period, so that the data acquisition time for the target website can be conveniently determined according to the time corresponding to the high-frequency period.

Further, because the data release frequency of part of the target websites is low, the corresponding probability evaluation threshold can be determined according to the historical data release quantity of the target websites, a probability evaluation threshold generation algorithm is generated based on the historical data release quantity, the historical data release quantity is positively correlated with the probability evaluation threshold, and the determination rationality of the probability evaluation threshold is improved conveniently.

Referring to fig. 3, in step S30, the method includes:

s31: dividing the statistical period into a plurality of evaluation periods, and counting the data release quantity and the data release probability in each evaluation period according to all the period distribution diagrams.

In this embodiment, the number of data publications in a certain evaluation period refers to an average or weighted average of the number of data publications in the evaluation period for each period distribution diagram; the data distribution probability in a certain evaluation period refers to the probability that each periodic distribution chart has data distribution behaviors in the evaluation period.

Specifically, the statistical period is divided into a plurality of evaluation periods, in this embodiment, the evaluation period may be set to be one hour, and the time periods corresponding to the minimum scale in the data distribution map are consistent, and the data distribution number and the data distribution probability in each evaluation period are calculated according to the data distribution number of each time period in all the periodic distribution maps, so that the data distribution number and the probability of each target website in each period are conveniently obtained, so as to analyze the data distribution rule of the target website.

S32: and comparing the data release probability with a preset probability evaluation threshold, and marking the evaluation period as a high-frequency period if the data release probability of the evaluation period is greater than the probability evaluation threshold.

Specifically, the data release probability is compared with a preset probability evaluation threshold, if the data release probability of a certain evaluation period is greater than the probability evaluation threshold in the evaluation period, the possibility that the target website releases data in the evaluation period is considered, so that the evaluation period is marked as a high-frequency period, and the data acquisition time of the target website is conveniently determined according to the time corresponding to the high-frequency period.

S40: and determining the acquisition time of the corresponding target website based on the high-frequency time period to generate a data acquisition plan.

In this embodiment, the data collection plan refers to a plan for recording data collection time nodes for a target website.

Specifically, after the high-frequency period of the target website is acquired, the data acquisition time node aiming at the target website is determined at the ending time of the high-frequency period of the target website, so that the possibility of acquiring new data each time data acquisition is facilitated to be improved, and the data acquisition efficiency is improved.

Further, since the data release frequency of a part of target websites is low and timeliness requirements exist for data acquisition of such target websites, in the data acquisition plan of such target websites, the interval of acquisition time can be set to be the maximum time value of the timeliness requirements, so as to improve the timeliness of the data acquired from such target websites.

Wherein, referring to fig. 4, after step S40, the distributed data acquisition method further includes:

s50: and acquiring data from the target website based on the data acquisition plan, generating unique identification information based on the uniform resource location address of the acquired website, and storing the unique identification information into an identification information base in a lasting manner.

In this embodiment, the unique identification information refers to information for identifying a data collection site, and is used to identify whether data has been collected at the site; the identification information base refers to a database for storing unique identification information of collected data sites.

Specifically, data are collected from a target website based on a data collection plan, and before each pair of websites carry out data collection, the uniform resource location address of each website is read and corresponding unique identification information is judged according to the uniform resource location address; when data acquisition is carried out, the uniform resource location address of the site is obtained, corresponding unique identification information is generated according to the uniform resource location address, and the unique identification information is stored in an identification information base in a lasting mode; and comparing the unique identification information of the station to be acquired with the unique identification information stored in the identification information base before the resource is acquired subsequently, so as to judge whether the data of the station to be acquired is acquired.

Referring to fig. 5, in step S50, the method further includes:

s51: the unified resource positioning address of the collected site is input into a bloom filter, and the unified resource positioning address is converted into unique identification information in the form of a bit vector through a segmentation mechanism.

Specifically, the unified resource positioning address of the acquired site is input into the bloom filter, the unified resource positioning address is converted into unique identification information in the form of a bit vector through a segmentation mechanism of the bloom filter, the unique identification information converted by the bloom filter has 100% recall rate, and each detection request returns two conditions of 'in a set (possible error)' and 'out of the set (absolutely out of the set)', so that the effects of saving the storage space of the unique identification information and improving the detection speed are achieved through sacrificing the accuracy rate.

S52: the unique identification information is stored in a Redis cache of the identification information base in a key-value form data structure.

Specifically, the unique identification information is stored in a Redis cache of the identification information base in a key-value form data structure, so that whether other searched data to be acquired are acquired or not can be conveniently judged based on the unique identification information of the acquired data to be acquired stored in the identification information base.

S60: the identification information base is set as a database of deduplication filters.

In this embodiment, the deduplication filter is a judging program for judging the data to be collected to determine whether the data to be collected has been collected.

Specifically, the identification information base is set as a database of the duplicate removal filter, so that unique identification information of a site where the data to be collected is located is conveniently input into the duplicate removal filter to be judged before the data to be collected is collected, whether the data is collected or not is determined, and the possibility of collecting the duplicate data is conveniently reduced.

Referring to fig. 6, after step S40, the distributed data acquisition method further includes:

s71: searching data to be mined from a target website based on a data acquisition plan, and acquiring attribute information of the data to be mined, wherein the attribute information comprises release time and unique identification information of the data to be mined.

In this embodiment, the data to be acquired refers to data required to be acquired by the data acquisition program of the present application, and the attribute information includes the release time of the data to be acquired and the corresponding unique identification information.

Specifically, the data acquisition plan is based on searching the data to be acquired from the target website, and attribute information of the data to be acquired is acquired so as to acquire release time and unique identification information of the data to be acquired, so that whether the data to be acquired is acquired or not can be conveniently analyzed according to the attribute information of the data to be acquired later, and new history update information can be conveniently generated after the data to be acquired is acquired.

S72: and inputting the unique identification information of the data to be sampled into a deduplication filter, and judging whether the data to be sampled is incremental data or not.

In this embodiment, the incremental data refers to data published on the target website that is not collected by the data collection program in the present application.

Specifically, unique identification information of the data to be sampled is input into the deduplication filter, whether the data to be sampled is incremental data is judged according to whether the unique identification information corresponding to the data to be sampled is recorded in an identification information base, if the unique identification information of the data to be sampled is not recorded in the identification information base, the data to be sampled is the incremental data, and if the unique identification information of the data to be sampled is recorded in the identification information base, the data to be sampled is not the incremental data.

S73: and if the data to be acquired is incremental data, downloading the data to be acquired into a data acquisition library, and adding the release time of the data to be acquired into the history updating information of the target website.

Specifically, if the data to be acquired is incremental data, downloading the data to be acquired into a data acquisition library so as to complete the acquisition work of the data to be acquired; the publishing time corresponding to the data to be acquired is added to the historical updating information of the target website, so that the data publishing distribution model can be continuously updated in a rolling mode in the data acquisition process, and the accuracy of the data publishing distribution model is improved.

Wherein, after step S72, further comprising:

s721: if the data to be acquired is not incremental data, the release time of the data to be acquired is added to repeated updating information of the target website.

Specifically, if the data to be acquired is not incremental data, the data to be acquired is acquired and added into the data acquisition library, and the release time of the data to be acquired is added into repeated update information of the target website without downloading again, so that the condition of repeated release of the data of the target website is recorded, and the data acquisition plan of the target website can be adjusted conveniently according to the condition of repeated release of the data of the target website.

It should be understood that the sequence number of each step in the foregoing embodiment does not mean that the execution sequence of each process should be determined by the function and the internal logic of each process, and should not limit the implementation process of the embodiment of the present application in any way.

Example two

As shown in fig. 7, the present application discloses a distributed data acquisition system for performing the steps of the distributed data acquisition method described above, where the distributed data acquisition system corresponds to the distributed data acquisition method in the above embodiment.

The distributed data acquisition system comprises a historical update information acquisition module, a data release distribution model update module, a high-frequency period marking module and a data acquisition plan generation module. The detailed description of each functional module is as follows:

The data release distribution model updating module comprises:

the periodic distribution map generation submodule is used for generating a periodic distribution map with a corresponding number based on a preset periodic number requirement and generating a time coordinate axis of the periodic distribution map based on a preset statistical period;

the periodic distribution map labeling sub-module is used for labeling the time node of each data release on each periodic distribution map based on the historical update information of the target website;

the data distribution map generation sub-module is used for counting the data distribution quantity of each time period in each periodic distribution map, and carrying out weighted average calculation on the data distribution quantity of each time period in the counting period based on each periodic distribution map so as to generate a data distribution map;

And the data distribution updating sub-module is used for updating the data distribution model based on the data distribution map and each period distribution map.

Wherein the high frequency period marking module comprises:

the data distribution statistics sub-module is used for dividing the statistics period into a plurality of evaluation periods and counting the data distribution quantity and the data distribution probability in each evaluation period according to all the period distribution diagrams;

the data release probability evaluation sub-module is used for comparing the data release probability with a preset probability evaluation threshold value, and marking the evaluation period as a high-frequency period if the data release probability of the evaluation period is greater than the probability evaluation threshold value.

Wherein, distributed data acquisition system still includes:

the identification information storage module is used for collecting data from a target website based on a data collection plan, generating unique identification information based on the uniform resource location address of the collected website, and storing the unique identification information into an identification information base in a lasting mode;

and the deduplication database creation module is used for setting the identification information base as a database of the deduplication filter.

For specific limitations of the distributed data acquisition system, reference may be made to the above limitation of the distributed data acquisition method, and no further description is given here; all or part of each module in the distributed data acquisition system can be realized by software, hardware and a combination thereof; the above modules may be embedded in hardware or may be independent of a processor in the computer device, or may be stored in software in a memory in the computer device, so that the processor may call and execute operations corresponding to the above modules.

Example III

A computer device, which may be a server, may have an internal structure as shown in fig. 8. The computer device includes a processor, a memory, a network interface, and a database connected by a system bus. Wherein the processor of the computer device is configured to provide computing and control capabilities. The memory of the computer device includes a non-volatile storage medium and an internal memory. The non-volatile storage medium stores an operating system, computer programs, and a database. The internal memory provides an environment for the operation of the operating system and computer programs in the non-volatile storage media. The database of the computer equipment is used for storing data such as historical update information, statistical period, periodic distribution diagram, data distribution model, evaluation period, data distribution probability, probability evaluation threshold, data acquisition plan and the like. The network interface of the computer device is used for communicating with an external terminal through a network connection. The computer program is executed by a processor to implement a distributed data acquisition method.

In one embodiment, a computer device is provided comprising a memory, a processor, and a computer program stored in the memory and executable on the processor, the processor implementing the following steps when executing the computer program:

S10: acquiring historical update information of a target website, wherein the historical update information comprises time nodes of each data release of the target website;

s20: generating a plurality of periodic distribution diagrams based on historical updating information of a target website and a preset statistical period, and fitting to generate a data release distribution diagram so as to update a corresponding data release distribution model;

s30: dividing each statistical period into a plurality of evaluation periods, calculating the data release probability of each evaluation period according to a data release distribution model, and marking the evaluation period with the data release probability larger than a preset probability evaluation threshold value as a high-frequency period;

In one embodiment, a computer readable storage medium is provided having a computer program stored thereon, which when executed by a processor, performs the steps of:

Those skilled in the art will appreciate that implementing all or part of the above described methods may be accomplished by way of a computer program stored on a non-transitory computer readable storage medium, which when executed, may comprise the steps of the embodiments of the methods described above. Any reference to memory, storage, database, or other medium used in the various embodiments provided herein may include non-volatile and/or volatile memory. The nonvolatile memory can include Read Only Memory (ROM), programmable ROM (PROM), electrically Programmable ROM (EPROM), electrically Erasable Programmable ROM (EEPROM), or flash memory. Volatile memory can include Random Access Memory (RAM) or external cache memory. By way of illustration and not limitation, RAM is available in a variety of forms such as Static RAM (SRAM), dynamic RAM (DRAM), synchronous DRAM (SDRAM), double Data Rate SDRAM (DDRSDRAM), enhanced SDRAM (ESDRAM), synchronous link (Synchlink), DRAM (SLDRAM), memory bus (Rambus) direct RAM (RDRAM), direct memory bus dynamic RAM (DRDRAM), and memory bus dynamic RAM (RDRAM), among others.

It will be apparent to those skilled in the art that, for convenience and brevity of description, only the above-described division of the functional units and modules is illustrated, and in practical application, the above-described functional distribution may be performed by different functional units and modules according to needs, i.e. the internal structure of the apparatus is divided into different functional units or modules to perform all or part of the above-described functions.

The above embodiments are only for illustrating the technical solution of the present application, and are not limiting; although the present application has been described in detail with reference to the foregoing embodiments, those of ordinary skill in the art will understand; the technical scheme described in the foregoing embodiments can be modified or some of the features thereof can be replaced by equivalents; such modifications and substitutions do not depart from the spirit and scope of the technical solutions of the embodiments of the present application, and are intended to be included in the scope of the present application.

Claims

1. A distributed data acquisition method, comprising:

2. A distributed data acquisition method according to claim 1, wherein: after the step of determining the acquisition time of the corresponding target website based on the high-frequency period to generate the data acquisition plan, the method comprises the following steps:

3. A distributed data acquisition method according to claim 1, wherein: after the step of determining the acquisition time of the corresponding target website based on the high-frequency period to generate the data acquisition plan, the method comprises the following steps:

4. A distributed data acquisition method according to claim 3, wherein: the step of inputting the unique identification information of the data to be sampled into the deduplication filter and judging whether the data to be sampled is incremental data further comprises the following steps:

5. A distributed data acquisition method according to claim 2, wherein: generating unique identification information based on the unified resource positioning address of the acquired site, and storing the unique identification information into an identification information base in a lasting way, wherein the method comprises the following steps:

6. A distributed data acquisition method according to claim 1, wherein: generating a plurality of periodic distribution diagrams based on historical updating information of a target website and a preset statistical period, and fitting to generate a data release distribution diagram so as to update a corresponding data release distribution model, wherein the step of updating the corresponding data release distribution model comprises the following steps:

7. A distributed data acquisition method according to claim 1, wherein: dividing the statistical period into a plurality of evaluation periods, calculating the data release probability of each evaluation period according to a data release distribution model, and marking the evaluation period with the data release probability larger than a preset probability evaluation threshold value as a high-frequency period, wherein the step comprises the following steps:

8. A distributed data acquisition system, comprising:

9. A computer device comprising a memory, a processor and a computer program stored in the memory and executable on the processor, characterized in that the processor implements the steps of the distributed data acquisition method according to any one of claims 1 to 7 when the computer program is executed.

10. A computer readable storage medium storing a computer program, characterized in that the computer program when executed by a processor implements the steps of the distributed data acquisition method according to any one of claims 1 to 7.