CN116303834A - Data warehouse historical data storage and processing method, system and device - Google Patents

Data warehouse historical data storage and processing method, system and device Download PDF

Info

Publication number
CN116303834A
CN116303834A CN202310586364.1A CN202310586364A CN116303834A CN 116303834 A CN116303834 A CN 116303834A CN 202310586364 A CN202310586364 A CN 202310586364A CN 116303834 A CN116303834 A CN 116303834A
Authority
CN
China
Prior art keywords
data
historical
warehouse
log
identifier
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN202310586364.1A
Other languages
Chinese (zh)
Other versions
CN116303834B (en
Inventor
马志奇
王梦丽
王雪梅
路宏达
刘炳伟
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing Hongwei Big Data Technology Co ltd
Original Assignee
Beijing Hongwei Big Data Technology Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing Hongwei Big Data Technology Co ltd filed Critical Beijing Hongwei Big Data Technology Co ltd
Priority to CN202310586364.1A priority Critical patent/CN116303834B/en
Publication of CN116303834A publication Critical patent/CN116303834A/en
Application granted granted Critical
Publication of CN116303834B publication Critical patent/CN116303834B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/28Databases characterised by their database models, e.g. relational or object models
    • G06F16/283Multi-dimensional databases or data warehouses, e.g. MOLAP or ROLAP
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/22Indexing; Data structures therefor; Storage structures
    • G06F16/2282Tablespace storage structures; Management thereof
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/24Querying
    • G06F16/245Query processing
    • G06F16/2455Query execution
    • G06F16/24553Query execution of query operations
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02DCLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
    • Y02D10/00Energy efficient computing, e.g. low power processors, power management or thermal management

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Databases & Information Systems (AREA)
  • Data Mining & Analysis (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Computational Linguistics (AREA)
  • Software Systems (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The invention relates to the technical field of data processing, in particular to a method, a system and a device for storing and processing historical data of a data warehouse.

Description

Data warehouse historical data storage and processing method, system and device
Technical Field
The invention relates to the technical field of data processing, in particular to a method, a system and a device for storing and processing historical data of a data warehouse.
Background
Data processing (data processing) is the collection, storage, retrieval, processing, transformation, and transmission of data. The basic purpose of data processing is to extract and derive data that is valuable and meaningful to some particular person from a large, unorganized, unintelligible amount of data.
In data warehouse, the collection, storage and query of historical data are the most basic data processing, and the data processing is very heavy. The historical data in the data warehouse is mostly stored by adopting a zipper table model, and each piece of data contains a valid starting date and a valid ending date of the data, so that the current latest data and the historical snapshot data at any time point can be conveniently queried.
In the prior art, referring to fig. 1, fig. 1 is a flowchart illustrating a process of acquiring data from a database and performing data processing in the prior art according to an exemplary embodiment, in order to acquire historical data from the database, in some cases, in a manner of timestamp identification, log extraction, trigger triggering or snapshot comparison, etc., data information such as an operation type, an operation time, an operation content, etc. of modifying and deleting the data is acquired and stored in a source layer. Then, the data information is subjected to model conversion, the effective start and stop date of the data is calculated, and a pull chain table is formed and then stored in a data warehouse. Finally, from the pull chain list, the current latest data and the historical snapshot data at any time point are queried through operations such as filtering and the like, and various data analysis and application are carried out.
In the technology of historical data collection, storage and query, the historical data obtained by data collection is different from the historical data in a data warehouse in terms of organization model, the historical data obtained by data collection adopts a similar log data model, and the historical data stored in the data warehouse adopts a zipper table model, so that model conversion is required, and the similar log data model is converted into the zipper table model. The model conversion needs to calculate the date of validity of each piece of data, and after the latest data change is obtained, the time of validity of the latest change record of the data is updated. It can be seen that model conversion of historical data requires an update (update) operation on the data in the data warehouse.
However, for some products in data warehouse (Hive, MPP, etc.), it is characterized by very high data query efficiency, but extremely inefficient data update, often requiring several seconds or even longer to update a piece of data in data sets above the million level. Therefore, frequent and repeated updating operations consume a lot of time when the model of the mass data is converted, and the timeliness of data processing is seriously affected.
Disclosure of Invention
In view of the above, the present invention is directed to a method, a system and a device for storing and processing historical data of a data warehouse, so as to solve the problem that in the prior art, frequent and repeated updating operations consume a great deal of time history when a model of massive data is converted, and seriously affect the timeliness of data processing.
According to a first aspect of an embodiment of the present invention, there is provided a data warehouse history data storage and processing method, including:
acquiring historical data in a data source, and preprocessing the historical data to obtain preprocessed historical data;
storing the preprocessed historical data into a data warehouse according to a class log data model;
and inquiring the current latest data and the historical snapshot data at any time point in the data warehouse according to the window function in the data warehouse platform.
Preferably, the acquiring the historical data in the data source includes:
based on the log-like data model, acquiring historical data for processing data in a data source by adopting a timestamp identification, log extraction, trigger or snapshot comparison mode;
wherein the history data includes: the operation type, operation time and operation content of the data;
the processing of the data comprises: new addition, update, deletion.
Preferably, the preprocessing is performed on the history data to obtain preprocessed history data, which specifically includes:
and adding a corresponding unique data identifier into each piece of historical data to obtain preprocessed historical data.
Preferably, the unique data identifier is a primary key or a joint primary key in the data;
if the data identifier is a single primary key, converting the data identifier into a character string, and performing MD5 hash value operation to obtain a single data identifier;
if the data identifier is a joint primary key, converting the data identifier into a character string, splicing the character string into the character string, and performing MD5 hash value operation to obtain a single data identifier.
Preferably, the log-like data model specifically comprises:
Record={t_id,t_optype,t_optime,c_data1,c_data2,...c_dataN}。
preferably, after the preprocessed historical data is stored in the data warehouse according to the class log data model, the method further comprises:
and (5) performing operation performance optimization on the data in the data warehouse in a bucket dividing mode.
Preferably, the window function is an OLAP function.
Preferably, the method is applied to Hive, MPP data warehouse products.
According to a second aspect of an embodiment of the present invention, there is provided a data warehouse historical data storage and processing system, comprising:
the acquisition module is used for acquiring historical data in the data source, preprocessing the historical data and obtaining preprocessed historical data;
the storage module is used for storing the preprocessed historical data into the data warehouse according to the class log data model;
and the query module is used for querying the current latest data and the historical snapshot data at any time point in the data warehouse according to the window function in the data warehouse platform.
According to a third aspect of an embodiment of the present invention, there is provided a data warehouse history data storage and processing apparatus, including: at least one processor; and
a memory communicatively coupled to the at least one processor; wherein,,
the memory stores instructions executable by the at least one processor to enable the at least one processor to perform any one of the data warehouse historical data storage and processing methods described above.
The technical scheme provided by the embodiment of the invention can comprise the following beneficial effects:
according to the technical scheme, the data in the data warehouse is stored in a mode similar to a log data model without model conversion by combining the characteristics of a data warehouse product, so that the working efficiency is greatly improved, the current latest data and the historical snapshot data at any time point are directly queried in the data warehouse in a window function mode, the data warehouse is more convenient and quick, the user experience is good, and the satisfaction is high.
It is to be understood that both the foregoing general description and the following detailed description are exemplary and explanatory only and are not restrictive of the invention as claimed.
Drawings
The accompanying drawings, which are incorporated in and constitute a part of this specification, illustrate embodiments consistent with the invention and together with the description, serve to explain the principles of the invention.
FIG. 1 is a flow chart illustrating the prior art process of retrieving data from a database and performing data processing in accordance with an exemplary embodiment;
FIG. 2 is a flowchart illustrating a method of seed data warehouse historical data storage and processing according to an exemplary embodiment;
FIG. 3 is a process flow diagram illustrating a method of data warehouse historical data storage and processing according to an example embodiment;
FIG. 4 is a functional block diagram of a data warehouse historian storage and processing system, according to an exemplary embodiment.
Detailed Description
Reference will now be made in detail to exemplary embodiments, examples of which are illustrated in the accompanying drawings. When the following description refers to the accompanying drawings, the same numbers in different drawings refer to the same or similar elements, unless otherwise indicated. The implementations described in the following exemplary examples do not represent all implementations consistent with the invention. Rather, they are merely examples of apparatus and methods consistent with aspects of the invention as detailed in the accompanying claims.
Example 1
In one embodiment, referring to fig. 2, fig. 2 is a flow chart illustrating a method of data warehouse historical data storage and processing, according to an exemplary embodiment, as shown in fig. 2, the method comprising:
step S01, acquiring historical data in a data source, and preprocessing the historical data to obtain preprocessed historical data;
step S02, storing the preprocessed historical data into a data warehouse according to a class log data model;
and S03, inquiring current latest data and historical snapshot data at any time point in the data warehouse according to a window function in the data warehouse platform.
It should be noted that, the application scenario applicable to the technical solution provided in this embodiment is an occasion of querying, processing, and using data in the data warehouse, for example: schools, banks, government halls, etc.
It should be noted that, the preprocessed historical data is stored in the data warehouse according to the class log data model, so that model conversion from data acquisition to the data warehouse is eliminated, and inefficient update operation is avoided.
It can be appreciated that, according to the technical scheme provided by the embodiment, by acquiring the historical data in the data source and preprocessing the historical data to obtain the preprocessed historical data, storing the preprocessed historical data into the data warehouse according to the log-like data model, and inquiring the current latest data and the historical snapshot data at any time point in the data warehouse according to the window function in the data warehouse platform, the technical scheme of the embodiment combines the characteristics of the data warehouse product, stores the data in the data warehouse in the log-like data model mode without model conversion, greatly improves the working efficiency, directly inquires the current latest data and the historical snapshot data at any time point in the data warehouse in the window function mode, and has the advantages of more convenience, rapidness, good user experience and high satisfaction.
In a specific application, referring to fig. 3, fig. 3 is a process flow diagram of a method for storing and processing historical data in a data warehouse, where obtaining historical data in a data source includes:
based on the log-like data model, acquiring historical data for processing data in a data source by adopting a timestamp identification, log extraction, trigger or snapshot comparison mode;
wherein the history data includes: the operation type, operation time and operation content of the data;
the processing of the data comprises: new addition, update, deletion.
In a specific application, the preprocessing is performed on the history data to obtain preprocessed history data, which specifically includes:
and adding a corresponding unique data identifier into each piece of historical data to obtain preprocessed historical data.
In a specific application, the unique data identifier is a primary key or a joint primary key in the data;
if the data identifier is a single primary key, converting the data identifier into a character string, and performing MD5 hash value operation to obtain a single data identifier;
if the data identifier is a joint primary key, converting the data identifier into a character string, splicing the character string into the character string, and performing MD5 hash value operation to obtain a single data identifier.
In the related data sources such as Oracle and Mysql or other non-related data sources, based on a log-like data model, information such as operation types, operation time and operation content for adding and deleting data in the data sources is collected by adopting modes such as timestamp identification, log extraction, triggers and snapshot comparison, and history data of the data sources is obtained.
In addition to storing a log-like data model, a data identifier needs to be added to each piece of data in the data warehouse, wherein the data identifier refers to the uniqueness of identifying data on a service, namely a primary key or a joint primary key in the database. Multiple pieces of data identified by the same data constitute a history of the data. In a general processing manner, a hash operation (MD 5) is performed on the primary key or the joint primary key to form a single data identifier.
It should be noted that there are mainly 3 types of operation: the value range of the new addition, update and deletion is { I, D, U }. Adding an insertion (Insert) operation of a corresponding database, wherein the data content is the data after the insertion operation, and the data is marked by an I; updating (Update) operation of the corresponding database, wherein the data content is the data after the updating operation, and the data is marked by U; delete (Delete) operation of the corresponding database, the data content is the data before the Delete operation, identified by "D".
The operation time, that is, the time for performing operations such as adding, updating, deleting, etc., on the data, is specifically adjusted according to the time requirement, and may be generally accurate to milliseconds.
It should be noted that, for the newly added operation, the data content is the data after the insertion operation; for the update operation, the data content is the data after the update operation; for the delete operation, the data content is the data before the delete operation.
In a specific application, the log-like data model specifically comprises:
Record={t_id,t_optype,t_optime,c_data1,c_data2,...c_dataN}。
wherein t_id represents a data identifier; t_optype represents an operation type; t_optimal represents an operation time; c_data1, c_data2,..c_datan: representing various attributes or fields of the operation content.
Specifically, the handling of an identity card service is taken as an example, where the identity card service includes: the operation types such as issuing, information changing, loss supplementing and cancellation can generate historical data such as data modification and deletion in the operation process. Assume that a certain identity card comprises: information such as ID (identity card), name, address, sex, birthday, etc., the identity card undergoes the following business processes:
(1) The identity card is registered and issued in 1 st 1981;
(2) 5.5.1998, transacting renaming business, changing Name from Name to Name1;
(3) Address change business is transacted in 4 th and 8 th 2011, and the Address is changed from Address to Address1;
(4) 2022, 3 and 4, the cancellation service is handled by the immigration.
The history data generated by the identity card is stored as follows according to a log-like data model:
Record1={t_id,’I’,’1981-01-01’,ID,Name,Address,Sex,Birthday}
Record2={t_id,’U’,’1998-05-05’,ID,Name1,Address,Sex, Birthday}
Record3={t_id,’U’,’2011-04-08’,ID,Name1,Address1,Sex,Birthday}
Record4={t_id,’D’,’2022-03-04’,ID,Name1,Address1,Sex,Birthday}
it should be noted that, in the relational data sources such as Oracle and Mysql, the data acquisition method is as follows:
1. the timestamp mark is suitable for the situation that the data source contains information such as an update timestamp, a deletion mark and the like, and the information is directly converted into t_optype and t_optimal in the class log model during data acquisition;
2. the log extraction is suitable for the conditions that a data source starts a log function and has a proper log extraction function, and all information in a class log model is extracted from a log file:
3. the trigger is suitable for the condition of abundant data source calculation/storage resources, and stores the related information of the log-like model into the history table when the data source creates the trigger and the history table and performs the addition, modification and deletion operation on the data source;
4. the snapshot comparison is suitable for the condition that the data volume is not large, the last snapshot of the source data is established in the data warehouse, the source data and the snapshot data are compared regularly, and relevant information of the log-like model is judged and generated.
In a specific application, after the preprocessed historical data is stored in the data warehouse according to the class log data model, the method further comprises:
and (5) performing operation performance optimization on the data in the data warehouse in a bucket dividing mode.
It should be noted that, in this embodiment, the data warehouse stores the history data by using the log-like data model same as the data collection, which means that the collected data can directly enter the data warehouse without buffering the source layer. Moreover, each piece of data in the log-like data model is only related to the current attribute, operation and the like of the data, has no direct relation with the previous data, is not like a traditional zipper table, and is required to update the expiration date of the validity period of the previous data when the data is updated or deleted each time, meanwhile, the model conversion from data acquisition to a data warehouse is eliminated, the low-efficiency updating operation is avoided, and finally the query performance is improved by adopting technologies such as dividing and classifying barrels, so that the query efficiency can be greatly improved, and the user experience is improved.
It should be noted that the partitioning may be performed according to t_optimal (operation time) or according to granularity of the number of months, years, etc.; and (3) carrying out barrel separation according to t_id (data identification), and putting the data with the same data identification together for later users to carry out data query or call.
In a specific application, the window function is an OLAP function.
It should be noted that the OLAP function is a standard SQL function added to implement OLAP, and provides support in the latest versions of databases such as oracle, SQL Server, DB2, hive2, mpp, and the like.
Note that the syntax of OLAP function is:
< Window function > OVER ([ parameter BY < column List > ] ORDER BY < column List for ordering >)
Window functions include both aggregate functions (max, min, sum, etc.) that can be used for window functions, and dedicated window functions (RANK, function_rank, row_number, etc.); OVER, component BY, ORDER BY as keywords; [] The content of (c) may be omitted.
The SQL statement for inquiring the current latest data by using the window function is as follows:
SELECT
t1.t_id,
t1.t_optype,
t1.t_optime,
t1.c_data1,
t1.c_data2,
...t1.c_dataN
FROM
(
SELECT
ROW_NUMBER()OVER (PARTITION BY t_idORDER BY t_optimeDESC)r1 ,
t_id,
t_optype,
t_optime,
c_data1,
c_data2,
...c_dataN
FROM table name) t1
Where t1.r1 = 1 and t1.t_optype<>’D’;
Querying historical snapshot data at a point (assumed to be p_time)
SELECT
t1.t_id,
t1.t_optype,
t1.t_optime,
t1.c_data1,
t1.c_data2,
...t1.c_dataN
FROM
(
SELECT
ROW_NUMBER()OVER (PARTITION BY t_idORDER BY t_optimeDESC)r1 ,
t_id,
t_optype,
t_optime,
c_data1,
c_data2,
...c_dataN
FROM table name) t1
Where t1.r1 = 1 and t1.t_optype<>’D’ and t1.t_optime>=p_time;
Based on functions such as ordering and partitioning of window functions and a conventional query function, the latest data is screened out from the historical data, the finally deleted data is filtered out, and the current latest data or a data snapshot of any historical time point is obtained.
In a specific application, the scheme of the embodiment is applied to data warehouse products such as Hive, MPP and the like.
It should be noted that, the technical scheme provided by the embodiment combines the characteristics of the data warehouse product, stores the data in the data warehouse in a log-like data model mode, does not need to perform model conversion, greatly improves the working efficiency, stores the data by dividing the barrels, directly inquires the current latest data and the historical snapshot data at any time point in the data warehouse in a window function mode, and is more convenient and quick, good in user experience and high in satisfaction.
Example two
In one embodiment, referring to FIG. 4, FIG. 4 is a functional block diagram of a data warehouse historical data storage and processing system 400, according to an exemplary embodiment, comprising:
the acquiring module 401 is configured to acquire historical data in a data source, and perform preprocessing on the historical data to obtain preprocessed historical data;
a storage module 402, configured to store the preprocessed historical data into a data warehouse according to a class log data model;
and a query module 403, configured to query the data warehouse for current latest data and historical snapshot data at any time point according to a window function in the data warehouse platform.
It should be noted that, the application scenario applicable to the technical solution provided in this embodiment is an occasion of querying, processing, and using data in the data warehouse, for example: schools, banks, government halls, etc.
It may be appreciated that, in the technical solution provided in this embodiment, the obtaining module 401 is configured to obtain the history data in the data source, and perform preprocessing on the history data to obtain the preprocessed history data, the storage module 402 is configured to store the preprocessed history data in the data warehouse according to the log-like data model, and the query module 403 is configured to query the current latest data and the history snapshot data at any time point in the data warehouse according to the window function in the data warehouse platform.
Example III
A data warehouse historical data storage and processing apparatus comprising:
at least one processor; and
a memory communicatively coupled to the at least one processor; wherein,,
the memory stores instructions executable by the at least one processor to enable the at least one processor to perform the method described above.
It should be noted that, the application scenario applicable to the technical solution provided in this embodiment is an occasion of querying, processing, and using data in the data warehouse, for example: schools, banks, government halls, etc.
It should be noted that, since the implementation manner and the beneficial effects of each module in the embodiment can be referred to the related description in the first embodiment, the description of the embodiment is omitted.
The computer-readable storage medium disclosed in the present embodiment includes, but is not limited to: an electrical, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or a combination of any of the foregoing. More specific examples (a non-exhaustive list) of the computer-readable storage medium would include the following: an electrical connection having one or more wires, a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing. In the context of this document, a computer readable storage medium may be any tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device.
It is to be understood that the same or similar parts in the above embodiments may be referred to each other, and that in some embodiments, the same or similar parts in other embodiments may be referred to.
Any process or method descriptions in flow charts or otherwise described herein may be understood as representing modules, segments, or portions of code which include one or more executable instructions for implementing specific logical functions or steps of the process, and further implementations are included within the scope of the preferred embodiment of the present invention in which functions may be executed out of order from that shown or discussed, including substantially concurrently or in reverse order, depending on the functionality involved, as would be understood by those reasonably skilled in the art of the present invention.
It is to be understood that portions of the present invention may be implemented in hardware, software, firmware, or a combination thereof. In the above-described embodiments, the various steps or methods may be implemented in software or firmware stored in a memory and executed by a suitable instruction execution system. For example, if implemented in hardware, as in another embodiment, may be implemented using any one or combination of the following techniques, as is well known in the art: discrete logic circuits having logic gates for implementing logic functions on data signals, application specific integrated circuits having suitable combinational logic gates, programmable Gate Arrays (PGAs), field Programmable Gate Arrays (FPGAs), and the like.
Those of ordinary skill in the art will appreciate that all or a portion of the steps carried out in the method of the above-described embodiments may be implemented by a program to instruct related hardware, where the program may be stored in a computer readable storage medium, and where the program, when executed, includes one or a combination of the steps of the method embodiments.
In addition, each functional unit in the embodiments of the present invention may be integrated in one processing module, or each unit may exist alone physically, or two or more units may be integrated in one module. The integrated modules may be implemented in hardware or in software functional modules. The integrated modules may also be stored in a computer readable storage medium if implemented in the form of software functional modules and sold or used as a stand-alone product.
The above-mentioned storage medium may be a read-only memory, a magnetic disk or an optical disk, or the like.
In the description of the present specification, a description referring to terms "one embodiment," "some embodiments," "examples," "specific examples," or "some examples," etc., means that a particular feature, structure, material, or characteristic described in connection with the embodiment or example is included in at least one embodiment or example of the present invention. In this specification, schematic representations of the above terms do not necessarily refer to the same embodiments or examples. Furthermore, the particular features, structures, materials, or characteristics described may be combined in any suitable manner in any one or more embodiments or examples.
While embodiments of the present invention have been shown and described above, it will be understood that the above embodiments are illustrative and not to be construed as limiting the invention, and that variations, modifications, alternatives and variations may be made to the above embodiments by one of ordinary skill in the art within the scope of the invention.

Claims (10)

1. A method for storing and processing historical data of a data warehouse, comprising:
acquiring historical data in a data source, and preprocessing the historical data to obtain preprocessed historical data;
storing the preprocessed historical data into a data warehouse according to a class log data model;
and inquiring the current latest data and the historical snapshot data at any time point in the data warehouse according to the window function in the data warehouse platform.
2. The method of claim 1, wherein the obtaining historical data in the data source comprises:
based on the log-like data model, acquiring historical data for processing data in a data source by adopting a timestamp identification, log extraction, trigger or snapshot comparison mode;
wherein the history data includes: the operation type, operation time and operation content of the data;
the processing of the data comprises: new addition, update, deletion.
3. The method according to claim 2, wherein the preprocessing the history data is performed to obtain preprocessed history data, specifically:
and adding a corresponding unique data identifier into each piece of historical data to obtain preprocessed historical data.
4. A method according to claim 3, wherein the unique data identifier is a primary key or a joint primary key in the data;
if the data identifier is a single primary key, converting the data identifier into a character string, and performing MD5 hash value operation to obtain a single data identifier;
if the data identifier is a joint primary key, converting the data identifier into a character string, splicing the character string into the character string, and performing MD5 hash value operation to obtain a single data identifier.
5. The method according to claim 4, wherein the log-like data model is specifically:
Record={t_id,t_optype,t_optime,c_data1,c_data2,...c_dataN}。
6. the method of claim 1, wherein after storing the preprocessed historical data in the data warehouse in the log-like data model, further comprising:
and (5) performing operation performance optimization on the data in the data warehouse in a bucket dividing mode.
7. The method of claim 1, wherein the window function is an OLAP function.
8. The method of any one of claims 1-7, wherein the method is applied to Hive, MPP data warehouse products.
9. A data warehouse historical data storage and processing system, comprising:
the acquisition module is used for acquiring historical data in the data source, preprocessing the historical data and obtaining preprocessed historical data;
the storage module is used for storing the preprocessed historical data into the data warehouse according to the class log data model;
and the query module is used for querying the current latest data and the historical snapshot data at any time point in the data warehouse according to the window function in the data warehouse platform.
10. A data warehouse history data storage and processing apparatus, comprising:
at least one processor; and
a memory communicatively coupled to the at least one processor; wherein,,
the memory stores instructions executable by the at least one processor to enable the at least one processor to perform the method of any one of claims 1-8.
CN202310586364.1A 2023-05-19 2023-05-19 Data warehouse historical data storage and processing method, system and device Active CN116303834B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202310586364.1A CN116303834B (en) 2023-05-19 2023-05-19 Data warehouse historical data storage and processing method, system and device

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202310586364.1A CN116303834B (en) 2023-05-19 2023-05-19 Data warehouse historical data storage and processing method, system and device

Publications (2)

Publication Number Publication Date
CN116303834A true CN116303834A (en) 2023-06-23
CN116303834B CN116303834B (en) 2024-03-08

Family

ID=86820795

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202310586364.1A Active CN116303834B (en) 2023-05-19 2023-05-19 Data warehouse historical data storage and processing method, system and device

Country Status (1)

Country Link
CN (1) CN116303834B (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116483873A (en) * 2023-06-25 2023-07-25 云筑信息科技(成都)有限公司 Method for realizing historical snapshot data query based on change data

Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN105224635A (en) * 2015-09-24 2016-01-06 广州精点计算机科技有限公司 A kind of parallel OLAP construction device based on mixture model and construction method
US20160098450A1 (en) * 2014-10-03 2016-04-07 The Regents Of The University Of Michigan Querying input data
CN106708965A (en) * 2016-12-01 2017-05-24 北京奇虎科技有限公司 Data processing method and apparatus
US20190227503A1 (en) * 2018-01-19 2019-07-25 Siemens Aktiengesellschaft Method and system for collecting data from a data-source into a mom data warehouse
CN112650743A (en) * 2020-12-30 2021-04-13 咪咕文化科技有限公司 Funnel data analysis method and system, electronic device and storage medium
CN113886404A (en) * 2020-07-03 2022-01-04 北京达佳互联信息技术有限公司 Method and device for updating data, electronic equipment and storage medium

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20160098450A1 (en) * 2014-10-03 2016-04-07 The Regents Of The University Of Michigan Querying input data
CN105224635A (en) * 2015-09-24 2016-01-06 广州精点计算机科技有限公司 A kind of parallel OLAP construction device based on mixture model and construction method
CN106708965A (en) * 2016-12-01 2017-05-24 北京奇虎科技有限公司 Data processing method and apparatus
US20190227503A1 (en) * 2018-01-19 2019-07-25 Siemens Aktiengesellschaft Method and system for collecting data from a data-source into a mom data warehouse
CN113886404A (en) * 2020-07-03 2022-01-04 北京达佳互联信息技术有限公司 Method and device for updating data, electronic equipment and storage medium
CN112650743A (en) * 2020-12-30 2021-04-13 咪咕文化科技有限公司 Funnel data analysis method and system, electronic device and storage medium

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116483873A (en) * 2023-06-25 2023-07-25 云筑信息科技(成都)有限公司 Method for realizing historical snapshot data query based on change data

Also Published As

Publication number Publication date
CN116303834B (en) 2024-03-08

Similar Documents

Publication Publication Date Title
US8140495B2 (en) Asynchronous database index maintenance
WO2017096939A1 (en) Method for establishing index on hdfs-based spark-sql big-data processing system
US20080270382A1 (en) System and Method of Personalizing Information Object Searches
US7831614B2 (en) System and method for generating SQL using templates
US8661022B2 (en) Database management method and system
CN101158958B (en) Fusion enquire method based on MySQL storage engines
US20060288045A1 (en) Method for aggregate operations on streaming data
CN116303834B (en) Data warehouse historical data storage and processing method, system and device
JP2001014329A (en) Database processing method and implementation device, and medium stored with the processing program
CN109086382B (en) Data synchronization method, device, equipment and storage medium
US10664459B2 (en) Database managing method, database managing system, and database tree structure
CN112231321A (en) Oracle secondary index and index real-time synchronization method
US7159171B2 (en) Structured document management system, structured document management method, search device and search method
EP2463785A1 (en) Database and search-engine query system
WO2018059430A1 (en) Database searching
CN115757629A (en) Multi-source heterogeneous data increment synchronization method and system, storage medium and electronic equipment
CN111177303A (en) Phoenix-based Hbase secondary full-text indexing method and system
US8407242B2 (en) Temporal binding for semantic queries
US20180341709A1 (en) Unstructured search query generation from a set of structured data terms
JP2015176407A (en) Search device, search method, search program and search data structure
US7039646B2 (en) Method and system for compressing varying-length columns during index high key generation
WO2011099082A1 (en) Database management system
CN112269807A (en) Configurable credit file retrieval method and system based on Elasticissearch technology
US11537568B1 (en) Efficient data processing for schema changes
US8666972B2 (en) System and method for content management and determination of search conditions

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant