CN116303834A

CN116303834A - Data warehouse historical data storage and processing method, system and device

Info

Publication number: CN116303834A
Application number: CN202310586364.1A
Authority: CN
Inventors: 马志奇; 王梦丽; 王雪梅; 路宏达; 刘炳伟
Original assignee: Beijing Hongwei Big Data Technology Co ltd
Current assignee: Beijing Hongwei Big Data Technology Co ltd
Priority date: 2023-05-19
Filing date: 2023-05-19
Publication date: 2023-06-23
Anticipated expiration: 2043-05-19
Also published as: CN116303834B

Abstract

The invention relates to the technical field of data processing, in particular to a method, a system and a device for storing and processing historical data of a data warehouse.

Description

Data warehouse historical data storage and processing method, system and device

Technical Field

Background

Data processing (data processing) is the collection, storage, retrieval, processing, transformation, and transmission of data. The basic purpose of data processing is to extract and derive data that is valuable and meaningful to some particular person from a large, unorganized, unintelligible amount of data.

In data warehouse, the collection, storage and query of historical data are the most basic data processing, and the data processing is very heavy. The historical data in the data warehouse is mostly stored by adopting a zipper table model, and each piece of data contains a valid starting date and a valid ending date of the data, so that the current latest data and the historical snapshot data at any time point can be conveniently queried.

In the prior art, referring to fig. 1, fig. 1 is a flowchart illustrating a process of acquiring data from a database and performing data processing in the prior art according to an exemplary embodiment, in order to acquire historical data from the database, in some cases, in a manner of timestamp identification, log extraction, trigger triggering or snapshot comparison, etc., data information such as an operation type, an operation time, an operation content, etc. of modifying and deleting the data is acquired and stored in a source layer. Then, the data information is subjected to model conversion, the effective start and stop date of the data is calculated, and a pull chain table is formed and then stored in a data warehouse. Finally, from the pull chain list, the current latest data and the historical snapshot data at any time point are queried through operations such as filtering and the like, and various data analysis and application are carried out.

In the technology of historical data collection, storage and query, the historical data obtained by data collection is different from the historical data in a data warehouse in terms of organization model, the historical data obtained by data collection adopts a similar log data model, and the historical data stored in the data warehouse adopts a zipper table model, so that model conversion is required, and the similar log data model is converted into the zipper table model. The model conversion needs to calculate the date of validity of each piece of data, and after the latest data change is obtained, the time of validity of the latest change record of the data is updated. It can be seen that model conversion of historical data requires an update (update) operation on the data in the data warehouse.

However, for some products in data warehouse (Hive, MPP, etc.), it is characterized by very high data query efficiency, but extremely inefficient data update, often requiring several seconds or even longer to update a piece of data in data sets above the million level. Therefore, frequent and repeated updating operations consume a lot of time when the model of the mass data is converted, and the timeliness of data processing is seriously affected.

Disclosure of Invention

In view of the above, the present invention is directed to a method, a system and a device for storing and processing historical data of a data warehouse, so as to solve the problem that in the prior art, frequent and repeated updating operations consume a great deal of time history when a model of massive data is converted, and seriously affect the timeliness of data processing.

According to a first aspect of an embodiment of the present invention, there is provided a data warehouse history data storage and processing method, including:

acquiring historical data in a data source, and preprocessing the historical data to obtain preprocessed historical data;

storing the preprocessed historical data into a data warehouse according to a class log data model;

and inquiring the current latest data and the historical snapshot data at any time point in the data warehouse according to the window function in the data warehouse platform.

Preferably, the acquiring the historical data in the data source includes:

based on the log-like data model, acquiring historical data for processing data in a data source by adopting a timestamp identification, log extraction, trigger or snapshot comparison mode;

wherein the history data includes: the operation type, operation time and operation content of the data;

the processing of the data comprises: new addition, update, deletion.

Preferably, the preprocessing is performed on the history data to obtain preprocessed history data, which specifically includes:

and adding a corresponding unique data identifier into each piece of historical data to obtain preprocessed historical data.

Preferably, the unique data identifier is a primary key or a joint primary key in the data;

if the data identifier is a single primary key, converting the data identifier into a character string, and performing MD5 hash value operation to obtain a single data identifier;

if the data identifier is a joint primary key, converting the data identifier into a character string, splicing the character string into the character string, and performing MD5 hash value operation to obtain a single data identifier.

Preferably, the log-like data model specifically comprises:

Record={t_id，t_optype，t_optime，c_data1，c_data2，...c_dataN}。

preferably, after the preprocessed historical data is stored in the data warehouse according to the class log data model, the method further comprises:

and (5) performing operation performance optimization on the data in the data warehouse in a bucket dividing mode.

Preferably, the window function is an OLAP function.

Preferably, the method is applied to Hive, MPP data warehouse products.

According to a second aspect of an embodiment of the present invention, there is provided a data warehouse historical data storage and processing system, comprising:

the acquisition module is used for acquiring historical data in the data source, preprocessing the historical data and obtaining preprocessed historical data;

the storage module is used for storing the preprocessed historical data into the data warehouse according to the class log data model;

and the query module is used for querying the current latest data and the historical snapshot data at any time point in the data warehouse according to the window function in the data warehouse platform.

According to a third aspect of an embodiment of the present invention, there is provided a data warehouse history data storage and processing apparatus, including: at least one processor; and

a memory communicatively coupled to the at least one processor; wherein,,

the memory stores instructions executable by the at least one processor to enable the at least one processor to perform any one of the data warehouse historical data storage and processing methods described above.

The technical scheme provided by the embodiment of the invention can comprise the following beneficial effects:

according to the technical scheme, the data in the data warehouse is stored in a mode similar to a log data model without model conversion by combining the characteristics of a data warehouse product, so that the working efficiency is greatly improved, the current latest data and the historical snapshot data at any time point are directly queried in the data warehouse in a window function mode, the data warehouse is more convenient and quick, the user experience is good, and the satisfaction is high.

It is to be understood that both the foregoing general description and the following detailed description are exemplary and explanatory only and are not restrictive of the invention as claimed.

Drawings

The accompanying drawings, which are incorporated in and constitute a part of this specification, illustrate embodiments consistent with the invention and together with the description, serve to explain the principles of the invention.

FIG. 1 is a flow chart illustrating the prior art process of retrieving data from a database and performing data processing in accordance with an exemplary embodiment;

FIG. 2 is a flowchart illustrating a method of seed data warehouse historical data storage and processing according to an exemplary embodiment;

FIG. 3 is a process flow diagram illustrating a method of data warehouse historical data storage and processing according to an example embodiment;

FIG. 4 is a functional block diagram of a data warehouse historian storage and processing system, according to an exemplary embodiment.

Detailed Description

Reference will now be made in detail to exemplary embodiments, examples of which are illustrated in the accompanying drawings. When the following description refers to the accompanying drawings, the same numbers in different drawings refer to the same or similar elements, unless otherwise indicated. The implementations described in the following exemplary examples do not represent all implementations consistent with the invention. Rather, they are merely examples of apparatus and methods consistent with aspects of the invention as detailed in the accompanying claims.

Example 1

In one embodiment, referring to fig. 2, fig. 2 is a flow chart illustrating a method of data warehouse historical data storage and processing, according to an exemplary embodiment, as shown in fig. 2, the method comprising:

step S01, acquiring historical data in a data source, and preprocessing the historical data to obtain preprocessed historical data;

step S02, storing the preprocessed historical data into a data warehouse according to a class log data model;

and S03, inquiring current latest data and historical snapshot data at any time point in the data warehouse according to a window function in the data warehouse platform.

It should be noted that, the application scenario applicable to the technical solution provided in this embodiment is an occasion of querying, processing, and using data in the data warehouse, for example: schools, banks, government halls, etc.

It should be noted that, the preprocessed historical data is stored in the data warehouse according to the class log data model, so that model conversion from data acquisition to the data warehouse is eliminated, and inefficient update operation is avoided.

It can be appreciated that, according to the technical scheme provided by the embodiment, by acquiring the historical data in the data source and preprocessing the historical data to obtain the preprocessed historical data, storing the preprocessed historical data into the data warehouse according to the log-like data model, and inquiring the current latest data and the historical snapshot data at any time point in the data warehouse according to the window function in the data warehouse platform, the technical scheme of the embodiment combines the characteristics of the data warehouse product, stores the data in the data warehouse in the log-like data model mode without model conversion, greatly improves the working efficiency, directly inquires the current latest data and the historical snapshot data at any time point in the data warehouse in the window function mode, and has the advantages of more convenience, rapidness, good user experience and high satisfaction.

In a specific application, referring to fig. 3, fig. 3 is a process flow diagram of a method for storing and processing historical data in a data warehouse, where obtaining historical data in a data source includes:

the processing of the data comprises: new addition, update, deletion.

In a specific application, the preprocessing is performed on the history data to obtain preprocessed history data, which specifically includes:

In a specific application, the unique data identifier is a primary key or a joint primary key in the data;

In the related data sources such as Oracle and Mysql or other non-related data sources, based on a log-like data model, information such as operation types, operation time and operation content for adding and deleting data in the data sources is collected by adopting modes such as timestamp identification, log extraction, triggers and snapshot comparison, and history data of the data sources is obtained.

In addition to storing a log-like data model, a data identifier needs to be added to each piece of data in the data warehouse, wherein the data identifier refers to the uniqueness of identifying data on a service, namely a primary key or a joint primary key in the database. Multiple pieces of data identified by the same data constitute a history of the data. In a general processing manner, a hash operation (MD 5) is performed on the primary key or the joint primary key to form a single data identifier.

It should be noted that there are mainly 3 types of operation: the value range of the new addition, update and deletion is { I, D, U }. Adding an insertion (Insert) operation of a corresponding database, wherein the data content is the data after the insertion operation, and the data is marked by an I; updating (Update) operation of the corresponding database, wherein the data content is the data after the updating operation, and the data is marked by U; delete (Delete) operation of the corresponding database, the data content is the data before the Delete operation, identified by "D".

The operation time, that is, the time for performing operations such as adding, updating, deleting, etc., on the data, is specifically adjusted according to the time requirement, and may be generally accurate to milliseconds.

It should be noted that, for the newly added operation, the data content is the data after the insertion operation; for the update operation, the data content is the data after the update operation; for the delete operation, the data content is the data before the delete operation.

In a specific application, the log-like data model specifically comprises:

Record={t_id，t_optype，t_optime，c_data1，c_data2，...c_dataN}。

wherein t_id represents a data identifier; t_optype represents an operation type; t_optimal represents an operation time; c_data1, c_data2,..c_datan: representing various attributes or fields of the operation content.

Specifically, the handling of an identity card service is taken as an example, where the identity card service includes: the operation types such as issuing, information changing, loss supplementing and cancellation can generate historical data such as data modification and deletion in the operation process. Assume that a certain identity card comprises: information such as ID (identity card), name, address, sex, birthday, etc., the identity card undergoes the following business processes:

(1) The identity card is registered and issued in 1 st 1981;

(2) 5.5.1998, transacting renaming business, changing Name from Name to Name1;

(3) Address change business is transacted in 4 th and 8 th 2011, and the Address is changed from Address to Address1;

(4) 2022, 3 and 4, the cancellation service is handled by the immigration.

The history data generated by the identity card is stored as follows according to a log-like data model:

Record1={t_id，’I’，’1981-01-01’，ID，Name，Address，Sex，Birthday}

Record2={t_id，’U’，’1998-05-05’，ID，Name1，Address，Sex， Birthday}

Record3={t_id，’U’，’2011-04-08’，ID，Name1，Address1，Sex，Birthday}

Record4={t_id，’D’，’2022-03-04’，ID，Name1，Address1，Sex，Birthday}

it should be noted that, in the relational data sources such as Oracle and Mysql, the data acquisition method is as follows:

1. the timestamp mark is suitable for the situation that the data source contains information such as an update timestamp, a deletion mark and the like, and the information is directly converted into t_optype and t_optimal in the class log model during data acquisition;

2. the log extraction is suitable for the conditions that a data source starts a log function and has a proper log extraction function, and all information in a class log model is extracted from a log file:

3. the trigger is suitable for the condition of abundant data source calculation/storage resources, and stores the related information of the log-like model into the history table when the data source creates the trigger and the history table and performs the addition, modification and deletion operation on the data source;

4. the snapshot comparison is suitable for the condition that the data volume is not large, the last snapshot of the source data is established in the data warehouse, the source data and the snapshot data are compared regularly, and relevant information of the log-like model is judged and generated.

In a specific application, after the preprocessed historical data is stored in the data warehouse according to the class log data model, the method further comprises:

It should be noted that, in this embodiment, the data warehouse stores the history data by using the log-like data model same as the data collection, which means that the collected data can directly enter the data warehouse without buffering the source layer. Moreover, each piece of data in the log-like data model is only related to the current attribute, operation and the like of the data, has no direct relation with the previous data, is not like a traditional zipper table, and is required to update the expiration date of the validity period of the previous data when the data is updated or deleted each time, meanwhile, the model conversion from data acquisition to a data warehouse is eliminated, the low-efficiency updating operation is avoided, and finally the query performance is improved by adopting technologies such as dividing and classifying barrels, so that the query efficiency can be greatly improved, and the user experience is improved.

It should be noted that the partitioning may be performed according to t_optimal (operation time) or according to granularity of the number of months, years, etc.; and (3) carrying out barrel separation according to t_id (data identification), and putting the data with the same data identification together for later users to carry out data query or call.

In a specific application, the window function is an OLAP function.

It should be noted that the OLAP function is a standard SQL function added to implement OLAP, and provides support in the latest versions of databases such as oracle, SQL Server, DB2, hive2, mpp, and the like.

Note that the syntax of OLAP function is:

< Window function > OVER ([ parameter BY < column List > ] ORDER BY < column List for ordering >)

Window functions include both aggregate functions (max, min, sum, etc.) that can be used for window functions, and dedicated window functions (RANK, function_rank, row_number, etc.); OVER, component BY, ORDER BY as keywords; [] The content of (c) may be omitted.

The SQL statement for inquiring the current latest data by using the window function is as follows:

SELECT

t1.t_id，

t1.t_optype，

t1.t_optime，

t1.c_data1，

t1.c_data2，

...t1.c_dataN

FROM

(

SELECT

ROW_NUMBER（）OVER (PARTITION BY t_idORDER BY t_optimeDESC)r1 ,

t_id，

t_optype，

t_optime，

c_data1，

c_data2，

...c_dataN

FROM table name) t1

Where t1.r1 = 1 and t1.t_optype<>’D’；

Querying historical snapshot data at a point (assumed to be p_time)

SELECT

t1.t_id，

t1.t_optype，

t1.t_optime，

t1.c_data1，

t1.c_data2，

...t1.c_dataN

FROM

(

SELECT

ROW_NUMBER（）OVER (PARTITION BY t_idORDER BY t_optimeDESC)r1 ,

t_id，

t_optype，

t_optime，

c_data1，

c_data2，

...c_dataN

FROM table name) t1

Where t1.r1 = 1 and t1.t_optype<>’D’ and t1.t_optime>=p_time；

Based on functions such as ordering and partitioning of window functions and a conventional query function, the latest data is screened out from the historical data, the finally deleted data is filtered out, and the current latest data or a data snapshot of any historical time point is obtained.

In a specific application, the scheme of the embodiment is applied to data warehouse products such as Hive, MPP and the like.

It should be noted that, the technical scheme provided by the embodiment combines the characteristics of the data warehouse product, stores the data in the data warehouse in a log-like data model mode, does not need to perform model conversion, greatly improves the working efficiency, stores the data by dividing the barrels, directly inquires the current latest data and the historical snapshot data at any time point in the data warehouse in a window function mode, and is more convenient and quick, good in user experience and high in satisfaction.

Example two

In one embodiment, referring to FIG. 4, FIG. 4 is a functional block diagram of a data warehouse historical data storage and processing system 400, according to an exemplary embodiment, comprising:

the acquiring module 401 is configured to acquire historical data in a data source, and perform preprocessing on the historical data to obtain preprocessed historical data;

a storage module 402, configured to store the preprocessed historical data into a data warehouse according to a class log data model;

and a query module 403, configured to query the data warehouse for current latest data and historical snapshot data at any time point according to a window function in the data warehouse platform.

It may be appreciated that, in the technical solution provided in this embodiment, the obtaining module 401 is configured to obtain the history data in the data source, and perform preprocessing on the history data to obtain the preprocessed history data, the storage module 402 is configured to store the preprocessed history data in the data warehouse according to the log-like data model, and the query module 403 is configured to query the current latest data and the history snapshot data at any time point in the data warehouse according to the window function in the data warehouse platform.

Example III

A data warehouse historical data storage and processing apparatus comprising:

at least one processor; and

a memory communicatively coupled to the at least one processor; wherein,,

the memory stores instructions executable by the at least one processor to enable the at least one processor to perform the method described above.

It should be noted that, since the implementation manner and the beneficial effects of each module in the embodiment can be referred to the related description in the first embodiment, the description of the embodiment is omitted.

The computer-readable storage medium disclosed in the present embodiment includes, but is not limited to: an electrical, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or a combination of any of the foregoing. More specific examples (a non-exhaustive list) of the computer-readable storage medium would include the following: an electrical connection having one or more wires, a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing. In the context of this document, a computer readable storage medium may be any tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device.

It is to be understood that the same or similar parts in the above embodiments may be referred to each other, and that in some embodiments, the same or similar parts in other embodiments may be referred to.

Any process or method descriptions in flow charts or otherwise described herein may be understood as representing modules, segments, or portions of code which include one or more executable instructions for implementing specific logical functions or steps of the process, and further implementations are included within the scope of the preferred embodiment of the present invention in which functions may be executed out of order from that shown or discussed, including substantially concurrently or in reverse order, depending on the functionality involved, as would be understood by those reasonably skilled in the art of the present invention.

It is to be understood that portions of the present invention may be implemented in hardware, software, firmware, or a combination thereof. In the above-described embodiments, the various steps or methods may be implemented in software or firmware stored in a memory and executed by a suitable instruction execution system. For example, if implemented in hardware, as in another embodiment, may be implemented using any one or combination of the following techniques, as is well known in the art: discrete logic circuits having logic gates for implementing logic functions on data signals, application specific integrated circuits having suitable combinational logic gates, programmable Gate Arrays (PGAs), field Programmable Gate Arrays (FPGAs), and the like.

Those of ordinary skill in the art will appreciate that all or a portion of the steps carried out in the method of the above-described embodiments may be implemented by a program to instruct related hardware, where the program may be stored in a computer readable storage medium, and where the program, when executed, includes one or a combination of the steps of the method embodiments.

In addition, each functional unit in the embodiments of the present invention may be integrated in one processing module, or each unit may exist alone physically, or two or more units may be integrated in one module. The integrated modules may be implemented in hardware or in software functional modules. The integrated modules may also be stored in a computer readable storage medium if implemented in the form of software functional modules and sold or used as a stand-alone product.

The above-mentioned storage medium may be a read-only memory, a magnetic disk or an optical disk, or the like.

In the description of the present specification, a description referring to terms "one embodiment," "some embodiments," "examples," "specific examples," or "some examples," etc., means that a particular feature, structure, material, or characteristic described in connection with the embodiment or example is included in at least one embodiment or example of the present invention. In this specification, schematic representations of the above terms do not necessarily refer to the same embodiments or examples. Furthermore, the particular features, structures, materials, or characteristics described may be combined in any suitable manner in any one or more embodiments or examples.

While embodiments of the present invention have been shown and described above, it will be understood that the above embodiments are illustrative and not to be construed as limiting the invention, and that variations, modifications, alternatives and variations may be made to the above embodiments by one of ordinary skill in the art within the scope of the invention.

Claims

1. A method for storing and processing historical data of a data warehouse, comprising:

2. The method of claim 1, wherein the obtaining historical data in the data source comprises:

the processing of the data comprises: new addition, update, deletion.

3. The method according to claim 2, wherein the preprocessing the history data is performed to obtain preprocessed history data, specifically:

4. A method according to claim 3, wherein the unique data identifier is a primary key or a joint primary key in the data;

5. The method according to claim 4, wherein the log-like data model is specifically:

Record={t_id，t_optype，t_optime，c_data1，c_data2，...c_dataN}。

6. the method of claim 1, wherein after storing the preprocessed historical data in the data warehouse in the log-like data model, further comprising:

7. The method of claim 1, wherein the window function is an OLAP function.

8. The method of any one of claims 1-7, wherein the method is applied to Hive, MPP data warehouse products.

9. A data warehouse historical data storage and processing system, comprising:

10. A data warehouse history data storage and processing apparatus, comprising:

at least one processor; and

a memory communicatively coupled to the at least one processor; wherein,,

the memory stores instructions executable by the at least one processor to enable the at least one processor to perform the method of any one of claims 1-8.