CN111259025A

CN111259025A - Self-adaptive frequency conversion increment updating method for multi-source heterogeneous data

Info

Publication number: CN111259025A
Application number: CN202010036197.XA
Authority: CN
Inventors: 朱跃龙; 丁昱凯; 冯钧; 陆佳民
Original assignee: Hohai University HHU
Current assignee: Hohai University HHU
Priority date: 2020-01-14
Filing date: 2020-01-14
Publication date: 2020-06-09
Anticipated expiration: 2040-01-14
Also published as: CN111259025B

Abstract

The invention discloses a self-adaptive frequency conversion increment updating method of multi-source heterogeneous data, which comprises the following steps: determining a data source and a core database cluster; constructing a data updating model; deploying and initializing a data updating model; acquiring data at each data source through a data updating model; comparing the obtained data time stamps and judging whether the data time stamps need to be updated or not; loading the updated data to the core database cluster; and refreshing the frequency configuration table and the time stamp record table according to the updated data. The invention can dynamically update data according to the data source and the data structure, can adaptively adjust the update frequency of different data sources, and has the advantages of good flexibility, convenient configuration, high update speed and strong expandability.

Description

Self-adaptive frequency conversion increment updating method for multi-source heterogeneous data

Technical Field

The invention belongs to the field of data mining and application, and particularly relates to a self-adaptive frequency conversion increment updating method for multi-source heterogeneous data.

Background

With the development of socioeconomic and data acquisition technologies, various industries generate a large amount of data, which includes structured data and semi-structured data with strong structures, and also includes a large amount of unstructured data such as text, image, and video data. While data acquisition technologies have improved, data storage and processing technologies have also continued to evolve. The multi-source heterogeneous data means that data has multiple sources, and the data structures of the same source are often different. Common situations are "a number of multiple sources" and "a source of multiple. Due to the fact that the data collection party and the management party are different in division, data of one data source, such as precipitation data, can be collected by multiple units of data collection equipment, and redundancy is caused; on the other hand, since different services have different requirements for data, the frequency of data processing and updating is different. There may be multiple items of data of different frequencies from the same data source. Because the existing stage data storage mainly takes a structured database storage mode as a main mode, the difficulty of storing unstructured data such as texts, images, audios and videos is high. Meanwhile, different data refreshing frequencies exist in different data sources, such as network data sources, database data sources and data sources of manual filling, most of the multi-source data updating modes are mainly constant-frequency updating modes, so that the updating efficiency is low, and the updating structure flexibility is poor. Storage, processing and migration of multi-source heterogeneous data still have great difficulty.

Disclosure of Invention

The purpose of the invention is as follows: in order to solve the problems that multi-source heterogeneous data is difficult to process and update frequency is variable and difficult to determine in the prior art, a self-adaptive frequency conversion increment updating method of the multi-source heterogeneous data is provided, and the method is high in updating efficiency, stable in performance, convenient to deploy and good in expandability.

The technical scheme is as follows: in order to achieve the above object, the present invention provides a self-adaptive frequency conversion increment updating method for multi-source heterogeneous data, comprising the following steps:

s1: determining a data source and a core database cluster;

s2: constructing a data updating model;

s3: deploying and initializing the data update model constructed in the step S2;

s4: acquiring data at each data source through a data updating model;

s5: comparing the obtained data time stamps, judging whether the data time stamps need to be updated or not, if so, continuing to update, and if not, repeating the step S4;

s6: loading the updated data to the core database cluster;

s7: and refreshing the frequency configuration table and the time stamp record table according to the updated data.

Further, the step S1 is specifically:

s1-1: determining a Data Source Type Data _ Source _ Type, which comprises the following steps: manually filling a data source, a network data source and an entire database source;

s1-2: determining a data source access method according to the type of the data source;

s1-3: determining the type of a core database cluster and access, reading and writing methods;

s1-4: creating a data source basic information table SIT, wherein the fields comprise: a data source name snm, a data source IP address sip, a port number spt, a data source type stp, a target database IP address tip, a target database port number tpt, a target database user name tusnm, a target database name tnm, a target database schema name tpnm, and a target database connection password tkw.

Further, the data update model in step S2 includes a network resource obtaining unit NAU, a manually filled data obtaining unit HAU, a general database data extraction unit GDEU, an update frequency control unit FCU, a general data specification unit GDTU, and a general data loading unit GDLU.

The network resource acquisition unit NAU comprises the following construction steps:

s2 a-1: constructing an IP address resolution access module, and accessing a specified network resource address according to the IP address of the network resource inlet;

s2 a-2: constructing a network resource downloading module, and downloading the data pointed by the link to a local computer;

s2 a-3: constructing a data dump module, simply naming and arranging network resources, and storing the network resources to a computer appointed disk where the NAU is located;

s2 a-4: constructing a termination condition judgment module, and terminating the NAU program according to the input termination condition C;

the construction steps of the manual filling data acquisition unit HAU are as follows:

s2 b-1: a path index building module is used for inquiring whether the file content under the specified file path has a new data file;

s2 b-2: a file type judgment module is constructed to judge the data type of the newly added file;

s2 b-3: a data storage module is constructed, a data storage form is judged according to the type of the data file, and the data is stored to a specified disk of a computer where the HAU is located;

the manual filling data refers to a structured data file or an unstructured data file which is collected or filled manually. The manually collected or filled structured data refers to data files with clear and standard data organization structures, such as xls, csv, xlsx and the like, and the structures and the contents of the data files are not changed during storage; the unstructured data collected or filled manually refers to data files without clear standard data structures, such as texts, images, audios and the like, only the file name FileName, the file size FileSize and the file position FileLoca are stored during storage, and the information of all unstructured data is uniformly stored in a file named datainfo.

The general database data extraction unit GDEU is constructed by the following steps:

s2 c-1: creating a database basic information table DBIT, wherein the fields comprise: a database IP address dbip, a port number dbpt, a user name usnm, a database name dbnm, a mode name pnm, a database connection password dbkw and a database type dbtp;

s2 c-2: acquiring a connection driving program or manually writing according to the type of the source database;

s2 c-3: extracting test cases and testing database connection;

the update frequency control unit FCU is constructed by the following steps:

s2 d-1: creating an update timestamp record table TRT, the fields comprising: data source name snm, data source ip address sip, update timestamp uts;

s2 d-2: creating a data source update frequency configuration table (FRT), wherein fields comprise: data source name snm, data source ip address sip, update frequency suf;

s2 d-3: constructing an update timestamp record table reading module;

s2 d-4: constructing a network resource acquisition unit NAU, a manual filling data acquisition unit HAU and a database data extraction unit GDEU calling module;

s2 d-5: constructing an updating frequency calculation module for calculating and updating the frequency of each data source;

s2 d-6: constructing a data source updating frequency configuration table refreshing module, and writing the latest frequency into a configuration table;

the construction steps of the GDTU are as follows:

s2 e-1: constructing a data reading module;

s2 e-2: constructing a data merging, editing and sequencing module;

s2 e-3: constructing a data writing module;

the construction steps of the GDLU are as follows:

s2 f-1: constructing a core database cluster access module;

s2 f-2: constructing a standard data reading module;

s2 f-3: and constructing a loading data loading module of the core database.

Further, the specific process of step S3 is as follows:

s3-1: the method for deploying the data updating model comprises the following specific steps:

s3 a-1: deploying a network resource acquisition unit NAU, a manual filling data acquisition unit HAU, a general database data extraction unit GDEU and a general data specification unit GDTU on a single computer according to the data source condition, and testing;

s3 a-2: deploying and testing an updating Frequency Control Unit (FCU);

s3 a-3: deploying and testing a data loading unit GDLU;

s3-2: initializing parameters of a data update model, and specifically comprising the following steps:

s3 b-1: initializing data source basic information SIT, and the fields comprise: a data source name snm, a data source IP address sip, a port number spt, a data source type stp, a target database IP address tip, a target database port number tpt, a target database user name tusnm, a target database name tnm, a target database schema name tpnm and a target database connection password tkw;

s3 b-2: initializing an update timestamp record table TRT, the fields including: data source name snm, data source ip address sip, update timestamp uts;

s3 b-3: initializing the update frequency configuration table FCT, the fields comprising: data source name snm, data source ip address sip, update frequency suf;

s3 b-4: initializing a network resource access IP address and a termination condition C of a network resource acquisition unit NAU, and downloading a storage position NSL of a network resource;

s3 b-5: initializing datainfo.xls files of a manual filling data acquisition unit HAU and storage positions HSL of the datainfo.xls files;

s3 b-6: initializing a database basic information table DBIT, wherein fields comprise a database IP address dbip, a port number dbpt, a user name usnm, a database name dbnm, a mode name pnm, a database connection password dbkw and a database type dbtp.

Further, the specific process of step S4 is as follows: and the updating frequency control unit FCU inquires the initialized updating frequency configuration table FCT, calls the NAU, the HAU and the GDEU according to the corresponding updating frequency, and acquires data of each data source, including network data resources, manual filling data and database data.

The specific steps for acquiring the network data resources are as follows:

s4 a-1: inputting the entrance IP address and the termination condition C into a network resource acquisition unit NAU; wherein, the termination condition refers to a time interval T or a link hop count H or a termination IP address A;

s4 a-2: the network resource acquisition unit NAU continuously indexes the resource link according to the IP address and downloads the required resource link to the designated disk position NSL of the computer where the NAU is located;

the specific steps for acquiring the manual filling data are as follows:

s4 b-1: inputting a specified manual data file storage path;

s4 b-2: judging whether the data to be updated exists in the specified path, if so, further judging the type of the data;

s4 b-3: storing the data to a designated disk position HSL of a computer where the HAU is located according to the data type;

the specific steps for acquiring the database data are as follows:

s4 c-1: connecting a database basic information table DBIT according to a database information table to acquire database connection information and establish connection;

s4 c-2: and acquiring data according to the query condition.

The specific process of step S5 is as follows:

s5-1: comparing the latest data time in the acquired data, and judging whether the data needs to be updated;

s5-2: if so, the updating is continued, otherwise, the above step S4 is repeated.

Further, the specific process of step S6 is as follows:

s6-1: calling a general data specification unit GDTU, and performing specification operation on network data source data and manual filling data;

s6-2: and calling a data loading unit GDLU, and loading the database data obtained in the step S4, the manual filling data subjected to the normalization operation and the network data resources into the core database cluster.

Further, the specific process of step S7 is as follows:

s7-1: updating the data updating time stamp of each data source, taking the time of obtaining the latest data in the data as the time stamp, and reserving the original time stamp;

s7-2: refreshing the update frequency configuration table according to the current update frequency f of a certain data source_tAnd a current time stamp TS_tAnd a timestamp TS_t-1Calculating a new update frequency f of the data source_t+1And writes it into the update rate configuration table.

f_t+1The calculation method is as follows:

where α is the refresh rate, and the range is [0,1], and the larger α indicates a faster change in the update frequency.

Has the advantages that: compared with the prior art, the invention has the following advantages:

1. aiming at the problems of high difficulty, complex steps and the like of multi-source heterogeneous data processing, the invention provides a unified solution for the automatic updating of multi-source (network, manual and database) heterogeneous (numerical value, text, image and audio and video) data, has good expandability and can be deployed by a single machine.

2. Aiming at the problems that the updating frequency of multi-source heterogeneous data is not fixed and the processing efficiency of a fixed frequency method is low, the invention provides a self-adaptive variable frequency increment updating method, aiming at the characteristic that the updating frequencies of different data sources are different, an updating frequency control unit is adopted to dynamically calculate the future updating frequencies of different data sources according to the current and historical data updating time stamps, and an updating frequency configuration table and a time stamp recording table are maintained in a system; meanwhile, an incremental updating mode is adopted, so that the data transmission quantity is reduced, the communication overhead is reduced, and the speed and the updating efficiency of multi-source heterogeneous data are further improved. In addition, the invention adopts a database cluster mode to store the updated data, so that the data security is higher and the system performance is more stable.

Drawings

FIG. 1 is a flow chart of the algorithm of the present invention;

FIG. 2 is a diagram of a specific update framework of the method of the present invention.

Detailed Description

The invention is further elucidated with reference to the drawings and the embodiments.

Referring to fig. 1, the invention provides a self-adaptive frequency conversion increment updating method for multi-source heterogeneous data, which comprises the following steps:

s1: determining a data source and core database cluster:

S2: constructing a data updating model:

as shown in fig. 2, the data update model includes a network resource acquisition unit NAU, a manual filling data acquisition unit HAU, a general database data extraction unit GDEU, an update frequency control unit FCU, a general data specification unit GDTU, and a general data loading unit GDLU.

s2 c-3: extracting test cases and testing database connection;

the update frequency control unit FCU is constructed by the following steps:

s2 d-3: constructing an update timestamp record table reading module;

the construction steps of the GDTU are as follows:

s2 e-1: constructing a data reading module;

s2 e-2: constructing a data merging, editing and sequencing module;

s2 e-3: constructing a data writing module;

the steps of constructing the GDLU are as follows:

s2 f-1: constructing a core database cluster access module;

s2 f-2: constructing a standard data reading module;

s2 f-3: and constructing a loading data loading module of the core database.

S3: deploying and initializing a data update model:

s3 a-2: deploying and testing an updating Frequency Control Unit (FCU);

s3 a-3: deploying and testing a data loading unit GDLU;

s3 b-5: initializing datainfo.xls files of a manual filling data acquisition unit HAU and storage positions HSL of the datainfo.xls files; s3 b-6: initializing a database basic information table DBIT, wherein fields comprise a database IP address dbip, a port number dbpt, a user name usnm, a database name dbnm, a mode name pnm, a database connection password dbkw and a database type dbtp. S4: acquiring data at each data source through a data updating model:

and the updating frequency control unit FCU inquires the initialized updating frequency configuration table FCT, calls the NAU, the HAU and the GDEU according to the corresponding updating frequency, and acquires data of each data source, including network data resources, manual filling data and database data.

The specific steps for acquiring the network data resources in this embodiment are as follows:

the specific steps for acquiring the manual filling data are as follows:

s4 b-1: inputting a specified manual data file storage path;

the specific steps for acquiring the database data are as follows:

s4 c-2: and acquiring data according to the query condition.

S5: comparing the time of the latest data in the acquired data, judging whether the data needs to be updated, if so, continuing to update, otherwise, repeating the step S4;

s6: loading the updated data to the core database cluster:

S7: refreshing a frequency configuration table and a time stamp record table according to the updated data:

Wherein f is_t+1The calculation method is as follows:

As can be seen from fig. 2, the specific update framework obtained in this embodiment may be deployed on a single computer, and includes a network resource obtaining unit NAU, a manual filling data obtaining unit HAU, a general database data extraction unit GDEU, a general data specification unit GDTU, a general data loading unit GDLU, and an update frequency control unit FCU.

The network resource acquisition unit NAU, the manual filling data acquisition unit HAU and the general database data extraction unit GDEU are mainly used for acquiring data corresponding to a data source, aiming at data of different data types, processing is carried out in different modes, the original structure of structured data is kept, and unstructured data are converted into structured data after being processed.

The data specification unit GDTU is used for further standardizing and processing data which needs to be updated in the manual filling data source and the network resource data source, converting the data into a data form which can be loaded to a database by the general data loading unit GDLU, and finally, passing through the general data loading unit GDLU.

The updating frequency control unit FCU is mainly used for inquiring the timestamp record table and updating the configuration table, calling the network resource acquisition unit NAU, the manual filling data acquisition unit HAU and the general database data extraction unit GDEU according to the corresponding frequency according to the inquiry result, and meanwhile, according to the current updating frequency f of a certain data source_tAnd a current time stamp TS_tAnd a timestamp TS_t-1Calculating a new update frequency f of the data source_t+1And writes it into the update rate table to implement the refresh operation.

Claims

1. A self-adaptive frequency conversion increment updating method of multi-source heterogeneous data is characterized by comprising the following steps: the method comprises the following steps:

s1: determining a data source and a core database cluster;

s2: constructing a data updating model;

s4: acquiring data at each data source through a data updating model;

s6: loading the updated data to the core database cluster;

2. The adaptive frequency conversion increment updating method for the multi-source heterogeneous data according to claim 1, characterized in that: the step S1 specifically includes:

s1-1: determining a data source type, comprising: manually filling a data source, a network data source and an entire database source;

s1-4: and creating a data source basic information table.

3. The adaptive frequency conversion increment updating method for the multi-source heterogeneous data according to claim 1, characterized in that: the data update model in step S2 includes a network resource acquisition unit NAU, a manual filling data acquisition unit HAU, a general database data extraction unit GDEU, an update frequency control unit FCU, a general data specification unit GDTU, and a general data loading unit GDLU.

4. The adaptive frequency conversion increment updating method for the multi-source heterogeneous data according to claim 3, characterized in that: the network resource obtaining unit NAU in step S2 is constructed as follows:

s2 c-1: creating a database basic information table;

s2 c-3: extracting test cases and testing database connection;

the update frequency control unit FCU is constructed by the following steps:

s2 d-1: creating an update timestamp record table;

s2 d-2: creating a data source updating frequency configuration table;

s2 d-3: constructing an update timestamp record table reading module;

the construction steps of the GDTU are as follows:

s2 e-1: constructing a data reading module;

s2 e-2: constructing a data merging, editing and sequencing module;

s2 e-3: constructing a data writing module;

the construction steps of the GDLU are as follows:

s2 f-1: constructing a core database cluster access module;

s2 f-2: constructing a standard data reading module;

s2 f-3: and constructing a loading data loading module of the core database.

5. The adaptive frequency conversion increment updating method for the multi-source heterogeneous data according to claim 1, characterized in that: the specific process of step S3 is as follows:

s3 a-2: deploying and testing an updating Frequency Control Unit (FCU);

s3 a-3: deploying and testing a data loading unit GDLU;

s3 b-1: initializing basic information of a data source;

s3 b-2: initializing and updating a timestamp record table;

s3 b-3: initializing and updating a frequency configuration table;

s3 b-6: and initializing a database basic information table DBIT.

6. The adaptive frequency conversion increment updating method for the multi-source heterogeneous data according to claim 1, characterized in that: the specific process of step S4 is as follows: and the updating frequency control unit FCU inquires the initialized updating frequency configuration table FCT, calls the NAU, the HAU and the GDEU according to the corresponding updating frequency, and acquires data of each data source, including network data resources, manual filling data and database data.

7. The adaptive frequency conversion increment updating method for the multi-source heterogeneous data according to claim 6, characterized in that: the specific steps of acquiring the network data resource in step S4 are as follows:

the specific steps for acquiring the manual filling data are as follows:

s4 b-1: inputting a specified manual data file storage path;

the specific steps for acquiring the database data are as follows:

s4 c-2: and acquiring data according to the query condition.

8. The adaptive frequency conversion increment updating method for the multi-source heterogeneous data according to claim 6, characterized in that: the specific process of step S6 is as follows:

9. The adaptive frequency conversion increment updating method for the multi-source heterogeneous data according to claim 1, characterized in that: the specific process of step S7 is as follows:

10. The adaptive frequency conversion increment updating method for the multi-source heterogeneous data according to claim 9, characterized in that: f in the step S7_t+1The calculation method is as follows:

wherein α is the refresh rate, and the range is [0,1 ].