CN116431654A - Data storage method, device, medium and computing equipment based on integration of lake and warehouse - Google Patents
Data storage method, device, medium and computing equipment based on integration of lake and warehouse Download PDFInfo
- Publication number
- CN116431654A CN116431654A CN202310670792.2A CN202310670792A CN116431654A CN 116431654 A CN116431654 A CN 116431654A CN 202310670792 A CN202310670792 A CN 202310670792A CN 116431654 A CN116431654 A CN 116431654A
- Authority
- CN
- China
- Prior art keywords
- data
- real
- offline
- time
- partition table
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Granted
Links
- 238000013500 data storage Methods 0.000 title claims abstract description 55
- 238000000034 method Methods 0.000 title claims abstract description 46
- 230000010354 integration Effects 0.000 title claims abstract description 16
- 238000005192 partition Methods 0.000 claims abstract description 185
- 238000012544 monitoring process Methods 0.000 claims description 24
- 230000008859 change Effects 0.000 claims description 13
- 238000012937 correction Methods 0.000 claims description 9
- 238000004590 computer program Methods 0.000 claims description 6
- 230000006870 function Effects 0.000 description 15
- 238000012545 processing Methods 0.000 description 13
- 230000003287 optical effect Effects 0.000 description 6
- 230000005540 biological transmission Effects 0.000 description 4
- 238000004891 communication Methods 0.000 description 4
- 238000010586 diagram Methods 0.000 description 4
- 230000008878 coupling Effects 0.000 description 3
- 238000010168 coupling process Methods 0.000 description 3
- 238000005859 coupling reaction Methods 0.000 description 3
- 238000005457 optimization Methods 0.000 description 3
- 230000008569 process Effects 0.000 description 3
- 238000001914 filtration Methods 0.000 description 2
- 230000004048 modification Effects 0.000 description 2
- 238000012986 modification Methods 0.000 description 2
- 238000006467 substitution reaction Methods 0.000 description 2
- 238000004458 analytical method Methods 0.000 description 1
- 238000004364 calculation method Methods 0.000 description 1
- 238000005516 engineering process Methods 0.000 description 1
- 238000007726 management method Methods 0.000 description 1
- 230000003068 static effect Effects 0.000 description 1
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/20—Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
- G06F16/23—Updating
- G06F16/2379—Updates performed during online database operations; commit processing
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/20—Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
- G06F16/22—Indexing; Data structures therefor; Storage structures
- G06F16/2228—Indexing structures
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/20—Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
- G06F16/22—Indexing; Data structures therefor; Storage structures
- G06F16/2228—Indexing structures
- G06F16/2264—Multidimensional index structures
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/20—Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
- G06F16/25—Integrating or interfacing systems involving database management systems
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/20—Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
- G06F16/28—Databases characterised by their database models, e.g. relational or object models
- G06F16/283—Multi-dimensional databases or data warehouses, e.g. MOLAP or ROLAP
-
- Y—GENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
- Y02—TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
- Y02D—CLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
- Y02D10/00—Energy efficient computing, e.g. low power processors, power management or thermal management
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Databases & Information Systems (AREA)
- Data Mining & Analysis (AREA)
- Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Software Systems (AREA)
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
Abstract
The embodiment of the invention provides a data storage method, device, medium and computing equipment based on integration of a lake and a warehouse. The method comprises the following steps: storing the real-time data in the real-time theme into a real-time data source table; wherein the real-time topic is a topic created in Kafka; the real-time data source table is a data source table in Kafka; storing the offline data in the offline theme into an offline data source table; wherein the offline topic is a topic created in Kafka; the offline data source table is a data source table in Kafka; according to the real-time data source table and the offline data source table, performing incremental updating on the source layer partition table in the data warehouse to obtain a target source layer partition table; the target source layer partition table comprises incremental update data; and writing the increment updating data in the target source layer partition table into the dimension partition table, the fact partition table and the partition width table. The invention reduces the resources used by the data storage, thereby improving the efficiency of the data storage.
Description
Technical Field
The embodiment of the invention relates to the technical field of data storage, in particular to a lake and warehouse integrated data storage method, device, medium and computing equipment.
Background
This section is intended to provide a background or context to the embodiments of the invention that are recited in the claims. The description herein is not admitted to be prior art by inclusion in this section.
At present, the existing data storage multi-dependency column database and Hive or Kafka are realized, the data is required to be stored in two sets of storage frames, two sets of identical codes are required to be developed in the same requirement, and unified management of service data with offline requirements and real-time requirements cannot be realized, so that the existing data storage mode uses more resources, and the data storage efficiency is low.
Disclosure of Invention
In this context, the embodiment of the invention is expected to provide a data storage method, device, medium and computing equipment based on integration of lakes and reservoirs, which can improve the efficiency of data storage.
In a first aspect of the embodiment of the invention, a data storage method based on integration of a lake and a warehouse is provided, which comprises the following steps:
Storing the real-time data in the real-time theme into a real-time data source table; wherein the real-time topic is a topic created in Kafka; the real-time data source table is a data source table in the Kafka;
storing the offline data in the offline theme into an offline data source table; wherein the offline topic is a topic created in the Kafka; the offline data source table is a data source table in the Kafka;
performing incremental update on the source layer partition table in the data warehouse according to the real-time data source table and the offline data source table to obtain a target source layer partition table; the target source layer partition table comprises incremental update data;
writing incremental update data in the target source layer partition table into a dimension partition table, a fact partition table and a partition width table; wherein the dimension partition table and the fact partition table are located at a dimension modeling layer of the data warehouse; the partition wide table is located at a data service layer of the data warehouse.
In an example of this implementation, before the storing the real-time data in the real-time theme in the real-time data source table, the method further includes:
storing a current real-time state of the real-time topic and storing a current offline state of the offline topic;
Monitoring the storage states of the real-time data and the offline data to obtain a monitoring result;
if the monitoring result shows that the real-time data storage fails, restoring the real-time theme according to the current real-time state;
and if the monitoring result indicates that the offline data storage fails, restoring the offline theme according to the current offline state.
In an embodiment of the present invention, the incremental updating of the source-attached layer partition table in the data warehouse according to the real-time data source table and the offline data source table to obtain a target source-attached layer partition table includes:
acquiring a data format of a source layer partition table in a data warehouse;
adjusting the format of the real-time data and the format of the offline data according to the data format to obtain target real-time data and target offline data;
comparing the target real-time data with the data in the source layer partition table to obtain real-time newly-added data and/or real-time variation data in the target real-time data;
performing incremental update on the source layer partition table based on the real-time newly-added data and/or the real-time variation data to obtain an initially updated source layer partition table;
Comparing the target offline data with the data in the primary updated source layer partition table to obtain offline newly-added data and/or offline variation data in the target offline data; wherein the real-time newly-added data and/or the real-time change data and the offline newly-added data and/or the offline change data are incremental update data;
and performing incremental update on the source layer partition table based on the offline newly-added data and/or the offline variation data to obtain a target source layer partition table.
In an example of this embodiment, writing the incremental update data in the target source layer partition table into the dimension partition table, the fact partition table, and the partition width table includes:
correcting the increment updating data to obtain corrected data;
determining dimension data and fact data from the correction data; wherein the correction data consists of the dimension data and the fact data;
writing the dimension data into a dimension partition table in a dimension modeling layer of the data warehouse;
writing the fact data into a fact partition table in the dimension modeling layer;
the dimension data in the dimension partition table and the fact data in the fact partition table are aggregated to obtain aggregated data;
And writing the aggregate data into a partition wide table in a data service layer of the data warehouse.
In one example of this implementation, after the writing the aggregate data to the partition wide table in the data service layer of the data warehouse, the method further includes:
acquiring a writing position of data writing;
acquiring an original record corresponding to the writing position and a record keyword of writing information;
if the original record is empty, storing the writing position and the record keyword in an index table in an associated manner;
and if the original record is not empty and the original record is not matched with the record keyword, updating the index information matched with the writing position in the index table according to the record keyword to obtain an updated index table.
In a second aspect of the embodiments of the present invention, there is provided a data storage device based on integration of a lake and a reservoir, comprising:
the first storage unit is used for storing the real-time data in the real-time theme into the real-time data source table; wherein the real-time topic is a topic created in Kafka; the real-time data source table is a data source table in the Kafka;
The second storage unit is used for storing the offline data in the offline theme into an offline data source table; wherein the offline topic is a topic created in the Kafka; the offline data source table is a data source table in the Kafka;
the updating unit is used for carrying out incremental updating on the source layer partition table in the data warehouse according to the real-time data source table and the offline data source table to obtain a target source layer partition table; the target source layer partition table comprises incremental update data;
the writing unit is used for writing the increment updating data in the target source layer partition table into a dimension partition table, a fact partition table and a partition width table; wherein the dimension partition table and the fact partition table are located at a dimension modeling layer of the data warehouse; the partition wide table is located at a data service layer of the data warehouse.
In an example of this embodiment, the first storage unit is further configured to:
storing a current real-time state of the real-time topic and a current offline state of the offline topic before storing real-time data in the real-time topic into a real-time data source table;
monitoring the storage states of the real-time data and the offline data to obtain a monitoring result;
If the monitoring result shows that the real-time data storage fails, restoring the real-time theme according to the current real-time state;
and if the monitoring result indicates that the offline data storage fails, restoring the offline theme according to the current offline state.
In an embodiment of the present invention, the updating unit performs incremental updating on the source-attached layer partition table in the data warehouse according to the real-time data source table and the offline data source table, and the method for obtaining the target source-attached layer partition table specifically includes:
acquiring a data format of a source layer partition table in a data warehouse;
adjusting the format of the real-time data and the format of the offline data according to the data format to obtain target real-time data and target offline data;
comparing the target real-time data with the data in the source layer partition table to obtain real-time newly-added data and/or real-time variation data in the target real-time data;
performing incremental update on the source layer partition table based on the real-time newly-added data and/or the real-time variation data to obtain an initially updated source layer partition table;
comparing the target offline data with the data in the primary updated source layer partition table to obtain offline newly-added data and/or offline variation data in the target offline data; wherein the real-time newly-added data and/or the real-time change data and the offline newly-added data and/or the offline change data are incremental update data;
And performing incremental update on the source layer partition table based on the offline newly-added data and/or the offline variation data to obtain a target source layer partition table.
In a third aspect of embodiments of the present invention, there is provided a computing device comprising: at least one processor, memory, and input output unit; wherein the memory is for storing a computer program and the processor is for invoking the computer program stored in the memory to perform the method of any of the first aspects.
In a fourth aspect of the embodiments of the present invention, there is provided a computer readable storage medium comprising instructions which, when run on a computer, cause the computer to perform the method of any of the first aspects.
According to the lake and warehouse integrated data storage method, device, medium and computing equipment, real-time data and offline data can be stored in a set of storage modes, and the real-time data and the offline data can be stored in the same source layer partition table, dimension partition table, fact partition table and partition wide table, so that resources used for data storage are reduced, and the data storage efficiency is improved.
Drawings
The above, as well as additional purposes, features, and advantages of exemplary embodiments of the present invention will become readily apparent from the following detailed description when read in conjunction with the accompanying drawings. Several embodiments of the present invention are illustrated by way of example, and not by way of limitation, in the figures of the accompanying drawings and in which:
FIG. 1 is a schematic flow chart of a data storage method based on integration of a lake and a warehouse according to an embodiment of the invention;
FIG. 2 is a schematic structural diagram of a data storage device based on integration of a lake and a warehouse according to an embodiment of the present invention;
FIG. 3 schematically illustrates a schematic structural diagram of a medium according to an embodiment of the present invention;
FIG. 4 schematically illustrates a structural diagram of a computing device in accordance with embodiments of the present invention.
In the drawings, the same or corresponding reference numerals indicate the same or corresponding parts.
Detailed Description
The principles and spirit of the present invention will be described below with reference to several exemplary embodiments. It should be understood that these embodiments are presented merely to enable those skilled in the art to better understand and practice the invention and are not intended to limit the scope of the invention in any way. Rather, these embodiments are provided so that this disclosure will be thorough and complete, and will fully convey the scope of the disclosure to those skilled in the art.
Those skilled in the art will appreciate that embodiments of the invention may be implemented as a system, apparatus, device, method, or computer program product. Accordingly, the present disclosure may be embodied in the following forms, namely: complete hardware, complete software (including firmware, resident software, micro-code, etc.), or a combination of hardware and software.
According to the embodiment of the invention, a data storage method, device, medium and computing equipment based on integration of a lake and a warehouse are provided.
It should be noted that any number of elements in the figures are for illustration and not limitation, and that any naming is used for distinction only and not for limitation.
The principles and spirit of the present invention are explained in detail below with reference to several representative embodiments thereof.
Exemplary method
Referring to fig. 1, fig. 1 is a flow chart of a data storage method based on integration of a lake and a warehouse according to an embodiment of the invention. It should be noted that embodiments of the present invention may be applied to any scenario where applicable.
The flow of the data storage method based on integration of the lake and the warehouse, which is provided by the embodiment of the invention and is shown in fig. 1, comprises the following steps:
step S101, storing the real-time data in the real-time theme into a real-time data source table.
In the embodiment of the invention, the real-time topic is a topic created in Kafka; the real-time data source table is the data source table in Kafka.
For example, a real-time topic topic_online is created in Kafka, and a real-time data source table topic_online1 may also be created in Kafka.
As an alternative embodiment, before performing step S101, the following steps may also be performed:
storing a current real-time state of the real-time topic and storing a current offline state of the offline topic;
monitoring the storage states of the real-time data and the offline data to obtain a monitoring result;
if the monitoring result shows that the real-time data storage fails, restoring the real-time theme according to the current real-time state;
and if the monitoring result indicates that the offline data storage fails, restoring the offline theme according to the current offline state.
The implementation of the embodiment can store the current real-time state and the current offline state of the real-time theme and the offline theme before the real-time data and the offline data are stored, so that the situation that the state of the real-time theme and the offline theme before the data are stored cannot be restored when the real-time data and the offline data are stored in failure is avoided, and the integrity of the state of the real-time theme and the offline theme is ensured when the data transmission fails.
Step S102, storing the offline data in the offline theme into an offline data source table.
In the embodiment of the invention, the offline theme is a theme created in the Kafka; the offline data source table is the data source table in Kafka.
For example, an offline topic topic_offine is created in Kafka, and an offline data source table topic_offine 1 may also be created in Kafka.
Alternatively, the real-time data may call the flinkkafkaConsumer function through the Flink calculation engine, store the real-time data in the topic_online into the topic_online1 of Kafka, and the offline data may call the flinkkafkafkaConsumer function through the Flink, and push the offline data in the topic_offline into the topic_offline1 of Kafka. The real-time data and the offline data are not processed, and the original appearance of the data is kept. In the process of offline data storage and real-time data storage, an enableCheckpointing function of the Flink is required to be called to enable a check point, a setStateBackend function is called, the set state back end is matched with the use of the check point, and the check point is enabled to maintain compensation (offset) of a Kafka theme consumption thread.
And step S103, performing incremental update on the source layer partition table in the data warehouse according to the real-time data source table and the offline data source table to obtain a target source layer partition table.
In the embodiment of the present invention, the target source layer partition table includes incremental update data.
As an optional implementation manner, in step S103, incremental updating is performed on the source layer partition table in the data warehouse according to the real-time data source table and the offline data source table, so as to obtain the target source layer partition table, which may specifically be:
acquiring a data format of a source layer partition table in a data warehouse;
adjusting the format of the real-time data and the format of the offline data according to the data format to obtain target real-time data and target offline data;
comparing the target real-time data with the data in the source layer partition table to obtain real-time newly-added data and/or real-time variation data in the target real-time data;
performing incremental update on the source layer partition table based on the real-time newly-added data and/or the real-time variation data to obtain an initially updated source layer partition table;
comparing the target offline data with the data in the primary updated source layer partition table to obtain offline newly-added data and/or offline variation data in the target offline data; wherein the real-time newly-added data and/or the real-time change data and the offline newly-added data and/or the offline change data are incremental update data;
And performing incremental update on the source layer partition table based on the offline newly-added data and/or the offline variation data to obtain a target source layer partition table.
By implementing the embodiment, the real-time data and the offline data can be updated in an increment mode, so that the stored data is prevented from being stored repeatedly, and the data storage efficiency is improved.
In the embodiment of the invention, a data warehouse source layer (Operational Data Store, ODS) partition table hudi_ods can be constructed in Hudi, the table type is MOR (Merge On Read), the bottom layer is stored based on a Hadoop Distributed File System (HDFS), delta_commit function in Hudi is used for incremental update, a field and storage format partition information and serialization of a real-time data source table topic_online1 and an offline data source table topic_ofline 1 in Kafka and a target table hudi_ods are configured by using FlinkSQL, the hudi_ods can update repeated data, non-repeated data is added, incremental writing into the hudi_ods table is realized, storage in Hudi is in a log form, and the data set is modeled as a key value pair.
And step S104, writing the increment updating data in the target source layer partition table into a dimension partition table, a fact partition table and a partition width table.
In the embodiment of the invention, the dimension partition table and the fact partition table are positioned in a dimension modeling layer of the data warehouse; the partition wide table is located at a data service layer of the data warehouse.
As an optional implementation manner, the manner of writing the incremental update data in the target source layer partition table into the dimension partition table, the fact partition table and the partition width table in step S104 may specifically be:
correcting the increment updating data to obtain corrected data;
determining dimension data and fact data from the correction data; wherein the correction data consists of the dimension data and the fact data;
writing the dimension data into a dimension partition table in a dimension modeling layer of the data warehouse;
writing the fact data into a fact partition table in the dimension modeling layer;
the dimension data in the dimension partition table and the fact data in the fact partition table are aggregated to obtain aggregated data;
and writing the aggregate data into a partition wide table in a data service layer of the data warehouse.
By implementing the embodiment, the real-time data and the offline data can be stored in various formats, so that when different applications need to call the data in different formats, the data with the required format matching can be quickly obtained.
In the embodiment of the invention, a Data warehouse dimension modeling layer (Data Warehouse Detail, DWD) in Hudi can construct a dimension partition table hudi_ DWD _dim and a fact partition table hudi_ DWD _face, a Data warehouse Data service layer (Data WareHouseServce, DWS) in Hudi can construct a partition wide table hudi_dws, and the types of tables can use MOR.
And, correction processing may be performed on the data in the hudi_ods by the flanksql, processing such as data warehouse technology (Extract, load, ETL) filtering is performed on the data in the hudi_ods, filtering is performed to remove dirty data, and standard consistency processing of other fields is performed, the obtained real data and dimension data are written into the hudi_ DWD _face and the hudi_ DWD _dim tables, respectively, and the DWS layer aggregates the real data and dimension data of the DWD layer. And simultaneously, when the data is written into the DWD layer and the DWS layer, the index information of the data is synchronously written into an index table (Hbase).
Optionally, after writing the aggregate data into the partition wide table in the data service layer of the data warehouse, the following steps may be further performed:
acquiring a writing position of data writing;
acquiring an original record corresponding to the writing position and a record keyword of writing information;
If the original record is empty, storing the writing position and the record keyword in an index table in an associated manner;
and if the original record is not empty and the original record is not matched with the record keyword, updating the index information matched with the writing position in the index table according to the record keyword to obtain an updated index table.
In this embodiment, the writing position where data is written and the recording keyword of the data may be recorded in the index table at the same time when data is written, and when data at any writing position is changed, the recording keyword corresponding to the writing position may be updated in the index table, so that the instantaneity of the index table is ensured.
In the embodiment of the invention, when an online analysis processing (Online Analytical Processing, OLAP) query platform queries the DWD layer and the DWS layer data in Hudi based on a query optimization module, the position information stored in Hbase is queried first, the position in the corresponding Hudi is obtained, and the required data is positioned rapidly.
And the query optimization module is used for carrying out index optimization processing based on Hbase. The method is mainly characterized in that a tagLocation function and an updateLocation function are mainly added through modification of source codes in Hudi, and the implementation class is hood index.
In the process of writing data, a tagLocation function is called to mark the position of an input record, the method mainly uses a locationTagFunction to process the original record, all records on a Hudi designated partition are traversed in the locationTagFunction, then a recordKey (record key) is generated in batches, corresponding information is taken from an Hbase index table, and then position information is generated; after the data is written, the position information recorded by the update location function is required to be updated, the main method is that the update location function is used for traversing state information by connecting Hbase to obtain new position information, judging whether the position information needs to be updated or not according to the position of hood record in WriteStatus, and carrying out position update processing on the newly inserted data without updating the position information, and synchronously deleting the position information stored in Hbase for deleting operation.
The invention can reduce the resources used by data storage, thereby improving the efficiency of data storage. In addition, the invention can also ensure the integrity of the states of the real-time theme and the off-line theme under the condition of data transmission failure. In addition, the invention can also improve the efficiency of data storage. In addition, the invention can also quickly acquire the data with the required format matching. In addition, the invention can also ensure the real-time performance of the index table.
Exemplary apparatus
Having described the method of an exemplary embodiment of the present invention, a data storage device based on integration of a lake and a reservoir according to an exemplary embodiment of the present invention will be described with reference to fig. 2, the device including:
a first storage unit 201, configured to store real-time data in a real-time theme into a real-time data source table; wherein the real-time topic is a topic created in Kafka; the real-time data source table is a data source table in the Kafka;
a second storage unit 202, configured to store offline data in the offline theme into an offline data source table; wherein the offline topic is a topic created in the Kafka; the offline data source table is a data source table in the Kafka;
an updating unit 203, configured to perform incremental updating on the source-attached layer partition table in the data warehouse according to the real-time data source table in the first storage unit 201 and the offline data source table in the second storage unit 202, so as to obtain a target source-attached layer partition table; the target source layer partition table comprises incremental update data;
a writing unit 204, configured to write the incremental update data in the target source layer partition table obtained by the updating unit 203 into a dimension partition table, a fact partition table, and a partition width table; wherein the dimension partition table and the fact partition table are located at a dimension modeling layer of the data warehouse; the partition wide table is located at a data service layer of the data warehouse.
As an alternative embodiment, the first storage unit 201 is further configured to:
storing a current real-time state of the real-time topic and a current offline state of the offline topic before storing real-time data in the real-time topic into a real-time data source table;
monitoring the storage states of the real-time data and the offline data to obtain a monitoring result;
if the monitoring result shows that the real-time data storage fails, restoring the real-time theme according to the current real-time state;
and if the monitoring result indicates that the offline data storage fails, restoring the offline theme according to the current offline state.
The implementation of the embodiment can store the current real-time state and the current offline state of the real-time theme and the offline theme before the real-time data and the offline data are stored, so that the situation that the state of the real-time theme and the offline theme before the data are stored cannot be restored when the real-time data and the offline data are stored in failure is avoided, and the integrity of the state of the real-time theme and the offline theme is ensured when the data transmission fails.
As an optional implementation manner, the updating unit 203 performs incremental updating on the source-layer partition table in the data warehouse according to the real-time data source table and the offline data source table, so as to obtain the target source-layer partition table specifically may be:
Acquiring a data format of a source layer partition table in a data warehouse;
adjusting the format of the real-time data and the format of the offline data according to the data format to obtain target real-time data and target offline data;
comparing the target real-time data with the data in the source layer partition table to obtain real-time newly-added data and/or real-time variation data in the target real-time data;
performing incremental update on the source layer partition table based on the real-time newly-added data and/or the real-time variation data to obtain an initially updated source layer partition table;
comparing the target offline data with the data in the primary updated source layer partition table to obtain offline newly-added data and/or offline variation data in the target offline data; wherein the real-time newly-added data and/or the real-time change data and the offline newly-added data and/or the offline change data are incremental update data;
and performing incremental update on the source layer partition table based on the offline newly-added data and/or the offline variation data to obtain a target source layer partition table.
By implementing the embodiment, the real-time data and the offline data can be updated in an increment mode, so that the stored data is prevented from being stored repeatedly, and the data storage efficiency is improved.
As an optional implementation manner, the writing unit 204 writes the incremental update data in the target source layer partition table into the dimension partition table, the fact partition table and the partition width table may specifically be:
correcting the increment updating data to obtain corrected data;
determining dimension data and fact data from the correction data; wherein the correction data consists of the dimension data and the fact data;
writing the dimension data into a dimension partition table in a dimension modeling layer of the data warehouse;
writing the fact data into a fact partition table in the dimension modeling layer;
the dimension data in the dimension partition table and the fact data in the fact partition table are aggregated to obtain aggregated data;
and writing the aggregate data into a partition wide table in a data service layer of the data warehouse.
By implementing the embodiment, the real-time data and the offline data can be stored in various formats, so that when different applications need to call the data in different formats, the data with the required format matching can be quickly obtained.
As an alternative embodiment, the writing unit 204 is further configured to:
After the aggregate data is written into a partition wide table in a data service layer of the data warehouse, a writing position for writing the data is obtained;
acquiring an original record corresponding to the writing position and a record keyword of writing information;
if the original record is empty, storing the writing position and the record keyword in an index table in an associated manner;
and if the original record is not empty and the original record is not matched with the record keyword, updating the index information matched with the writing position in the index table according to the record keyword to obtain an updated index table.
In this embodiment, the writing position where data is written and the recording keyword of the data may be recorded in the index table at the same time when data is written, and when data at any writing position is changed, the recording keyword corresponding to the writing position may be updated in the index table, so that the instantaneity of the index table is ensured.
The invention can reduce the resources used by data storage, thereby improving the efficiency of data storage. In addition, the invention can also ensure the integrity of the states of the real-time theme and the off-line theme under the condition of data transmission failure. In addition, the invention can also improve the efficiency of data storage. In addition, the invention can also quickly acquire the data with the required format matching. In addition, the invention can also ensure the real-time performance of the index table.
Exemplary Medium
Having described the method and apparatus of the exemplary embodiments of the present invention, reference will now be made to FIG. 3 for a description of a computer-readable storage medium of the exemplary embodiments of the present invention, and reference will be made to FIG. 3 for an illustration of a computer-readable storage medium as an optical disk 30 having a computer program (i.e., program product) stored thereon that, when executed by a processor, performs the steps described in the above-described method embodiments, e.g., storing real-time data in a real-time theme into a real-time data source table; wherein the real-time topic is a topic created in Kafka; the real-time data source table is a data source table in Kafka; storing the offline data in the offline theme into an offline data source table; wherein the offline topic is a topic created in Kafka; the offline data source table is a data source table in Kafka; according to the real-time data source table and the offline data source table, performing incremental updating on the source layer partition table in the data warehouse to obtain a target source layer partition table; the target source layer partition table comprises incremental update data; writing incremental update data in the target source layer partition table into a dimension partition table, a fact partition table and a partition width table; the specific implementation of each step is not repeated here.
It should be noted that examples of the computer readable storage medium may also include, but are not limited to, a phase change memory (PRAM), a Static Random Access Memory (SRAM), a Dynamic Random Access Memory (DRAM), other types of Random Access Memory (RAM), a Read Only Memory (ROM), an Electrically Erasable Programmable Read Only Memory (EEPROM), a flash memory, or other optical or magnetic storage medium, which will not be described in detail herein.
Exemplary computing device
Having described the methods, apparatus and media of exemplary embodiments of the present invention, next, a computing device for lake and reservoir-based integrated data storage of exemplary embodiments of the present invention is described with reference to FIG. 4.
FIG. 4 illustrates a block diagram of an exemplary computing device 40 suitable for use in implementing embodiments of the invention, the computing device 40 may be a computer system or a server. The computing device 40 shown in fig. 4 is merely an example and should not be taken as limiting the functionality and scope of use of embodiments of the present invention.
As shown in fig. 4, components of computing device 40 may include, but are not limited to: one or more processors or processing units 401, a system memory 402, a bus 403 that connects the various system components (including the system memory 402 and the processing units 401).
The system memory 402 may include computer system readable media in the form of volatile memory, such as Random Access Memory (RAM) 4021 and/or cache memory 4022. Computing device 40 may further include other removable/non-removable, volatile/nonvolatile computer system storage media. By way of example only, ROM4023 may be used to read from and write to non-removable, nonvolatile magnetic media (not shown in FIG. 4 and commonly referred to as a "hard disk drive"). Although not shown in fig. 4, a magnetic disk drive for reading from and writing to a removable non-volatile magnetic disk (e.g., a "floppy disk"), and an optical disk drive for reading from or writing to a removable non-volatile optical disk (e.g., a CD-ROM, DVD-ROM, or other optical media), may be provided. In such cases, each drive may be coupled to bus 403 through one or more data medium interfaces. The system memory 402 may include at least one program product having a set (e.g., at least one) of program modules configured to carry out the functions of the embodiments of the invention.
A program/utility 4025 having a set (at least one) of program modules 4024 may be stored, for example, in system memory 402, and such program modules 4024 include, but are not limited to: an operating system, one or more application programs, other program modules, and program data, each or some combination of which may include an implementation of a network environment. Program modules 4024 generally perform the functions and/or methodologies of the described embodiments of the present invention.
The processing unit 401 executes various functional applications and data processing by running a program stored in the system memory 402, for example, stores real-time data in a real-time theme into a real-time data source table; wherein the real-time topic is a topic created in Kafka; the real-time data source table is a data source table in Kafka; storing the offline data in the offline theme into an offline data source table; wherein the offline topic is a topic created in Kafka; the offline data source table is a data source table in Kafka; according to the real-time data source table and the offline data source table, performing incremental updating on the source layer partition table in the data warehouse to obtain a target source layer partition table; the target source layer partition table comprises incremental update data; and writing the increment updating data in the target source layer partition table into the dimension partition table, the fact partition table and the partition width table. The specific implementation of each step is not repeated here. It should be noted that while in the above detailed description reference is made to several units/modules or sub-units/sub-modules of a lake-reservoir integrated based data storage device, such a division is merely exemplary and not mandatory. Indeed, the features and functionality of two or more units/modules described above may be embodied in one unit/module in accordance with embodiments of the present invention. Conversely, the features and functions of one unit/module described above may be further divided into ones that are embodied by a plurality of units/modules.
In the description of the present invention, it should be noted that the terms "first," "second," and "third" are used for descriptive purposes only and are not to be construed as indicating or implying relative importance.
It will be clear to those skilled in the art that, for convenience and brevity of description, specific working procedures of the above-described systems, apparatuses and units may refer to corresponding procedures in the foregoing method embodiments, and are not repeated herein.
In the several embodiments provided by the present invention, it should be understood that the disclosed systems, devices, and methods may be implemented in other manners. The above-described apparatus embodiments are merely illustrative, for example, the division of the units is merely a logical function division, and there may be other manners of division in actual implementation, and for example, multiple units or components may be combined or integrated into another system, or some features may be omitted, or not performed. Alternatively, the coupling or direct coupling or communication connection shown or discussed with each other may be through some communication interface, device or unit indirect coupling or communication connection, which may be in electrical, mechanical or other form.
The units described as separate units may or may not be physically separate, and units shown as units may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. Some or all of the units may be selected according to actual needs to achieve the purpose of the solution of this embodiment.
In addition, each functional unit in the embodiments of the present invention may be integrated in one processing unit, or each unit may exist alone physically, or two or more units may be integrated in one unit.
The functions, if implemented in the form of software functional units and sold or used as a stand-alone product, may be stored in a non-volatile computer readable storage medium executable by a processor. Based on this understanding, the technical solution of the present invention may be embodied essentially or in a part contributing to the prior art or in a part of the technical solution, in the form of a software product stored in a storage medium, comprising several instructions for causing a computer device (which may be a personal computer, a server, a network device, etc.) to perform all or part of the steps of the method according to the embodiments of the present invention. And the aforementioned storage medium includes: a U-disk, a removable hard disk, a Read-Only Memory (ROM), a random access Memory (RAM, random Access Memory), a magnetic disk, or an optical disk, or other various media capable of storing program codes.
Finally, it should be noted that: the above examples are only specific embodiments of the present invention, and are not intended to limit the scope of the present invention, but it should be understood by those skilled in the art that the present invention is not limited thereto, and that the present invention is described in detail with reference to the foregoing examples: any person skilled in the art may modify or easily conceive of the technical solution described in the foregoing embodiments, or perform equivalent substitution of some of the technical features, while remaining within the technical scope of the present disclosure; such modifications, changes or substitutions do not depart from the spirit and scope of the technical solutions of the embodiments of the present invention, and are intended to be included in the scope of the present invention. Therefore, the protection scope of the present invention shall be subject to the protection scope of the claims.
Furthermore, although the operations of the methods of the present invention are depicted in the drawings in a particular order, this is not required to either imply that the operations must be performed in that particular order or that all of the illustrated operations be performed to achieve desirable results. Additionally or alternatively, certain steps may be omitted, multiple steps combined into one step to perform, and/or one step decomposed into multiple steps to perform.
Claims (10)
1. A data storage method based on integration of a lake and a warehouse comprises the following steps:
storing the real-time data in the real-time theme into a real-time data source table; wherein the real-time topic is a topic created in Kafka; the real-time data source table is a data source table in the Kafka;
storing the offline data in the offline theme into an offline data source table; wherein the offline topic is a topic created in the Kafka; the offline data source table is a data source table in the Kafka;
performing incremental update on the source layer partition table in the data warehouse according to the real-time data source table and the offline data source table to obtain a target source layer partition table; the target source layer partition table comprises incremental update data;
writing incremental update data in the target source layer partition table into a dimension partition table, a fact partition table and a partition width table; wherein the dimension partition table and the fact partition table are located at a dimension modeling layer of the data warehouse; the partition wide table is located at a data service layer of the data warehouse.
2. The lake and reservoir integration-based data storage method of claim 1, wherein before storing the real-time data in the real-time theme into the real-time data source table, the method further comprises:
Storing a current real-time state of the real-time topic and storing a current offline state of the offline topic;
monitoring the storage states of the real-time data and the offline data to obtain a monitoring result;
if the monitoring result shows that the real-time data storage fails, restoring the real-time theme according to the current real-time state;
and if the monitoring result indicates that the offline data storage fails, restoring the offline theme according to the current offline state.
3. The lake and warehouse integrated data storage method of claim 1, wherein the incremental updating of the source layer partition table in the data warehouse according to the real-time data source table and the offline data source table to obtain a target source layer partition table comprises the following steps:
acquiring a data format of a source layer partition table in a data warehouse;
adjusting the format of the real-time data and the format of the offline data according to the data format to obtain target real-time data and target offline data;
comparing the target real-time data with the data in the source layer partition table to obtain real-time newly-added data and/or real-time variation data in the target real-time data;
Performing incremental update on the source layer partition table based on the real-time newly-added data and/or the real-time variation data to obtain an initially updated source layer partition table;
comparing the target offline data with the data in the primary updated source layer partition table to obtain offline newly-added data and/or offline variation data in the target offline data; wherein the real-time newly-added data and/or the real-time change data and the offline newly-added data and/or the offline change data are incremental update data;
and performing incremental update on the source layer partition table based on the offline newly-added data and/or the offline variation data to obtain a target source layer partition table.
4. The lake and warehouse integrated data storage method according to any one of claims 1-3, wherein writing incremental update data in the target source layer partition table into a dimension partition table, a fact partition table and a partition width table comprises:
correcting the increment updating data to obtain corrected data;
determining dimension data and fact data from the correction data; wherein the correction data consists of the dimension data and the fact data;
Writing the dimension data into a dimension partition table in a dimension modeling layer of the data warehouse;
writing the fact data into a fact partition table in the dimension modeling layer;
the dimension data in the dimension partition table and the fact data in the fact partition table are aggregated to obtain aggregated data;
and writing the aggregate data into a partition wide table in a data service layer of the data warehouse.
5. The lake and reservoir integration-based data storage method of claim 4, after writing the aggregate data to a partition wide table in a data services layer of the data warehouse, the method further comprising:
acquiring a writing position of data writing;
acquiring an original record corresponding to the writing position and a record keyword of writing information;
if the original record is empty, storing the writing position and the record keyword in an index table in an associated manner;
and if the original record is not empty and the original record is not matched with the record keyword, updating the index information matched with the writing position in the index table according to the record keyword to obtain an updated index table.
6. A data storage device based on integration of a lake and a warehouse, comprising:
The first storage unit is used for storing the real-time data in the real-time theme into the real-time data source table; wherein the real-time topic is a topic created in Kafka; the real-time data source table is a data source table in the Kafka;
the second storage unit is used for storing the offline data in the offline theme into an offline data source table; wherein the offline topic is a topic created in the Kafka; the offline data source table is a data source table in the Kafka;
the updating unit is used for carrying out incremental updating on the source layer partition table in the data warehouse according to the real-time data source table and the offline data source table to obtain a target source layer partition table; the target source layer partition table comprises incremental update data;
the writing unit is used for writing the increment updating data in the target source layer partition table into a dimension partition table, a fact partition table and a partition width table; wherein the dimension partition table and the fact partition table are located at a dimension modeling layer of the data warehouse; the partition wide table is located at a data service layer of the data warehouse.
7. The lake and reservoir integrated based data storage device of claim 6, the first storage unit further to:
Storing a current real-time state of the real-time topic and a current offline state of the offline topic before storing real-time data in the real-time topic into a real-time data source table;
monitoring the storage states of the real-time data and the offline data to obtain a monitoring result;
if the monitoring result shows that the real-time data storage fails, restoring the real-time theme according to the current real-time state;
and if the monitoring result indicates that the offline data storage fails, restoring the offline theme according to the current offline state.
8. The lake and warehouse integrated data storage device according to claim 6, wherein the updating unit performs incremental updating on the source layer partition table in the data warehouse according to the real-time data source table and the offline data source table, and the method for obtaining the target source layer partition table specifically comprises the following steps:
acquiring a data format of a source layer partition table in a data warehouse;
adjusting the format of the real-time data and the format of the offline data according to the data format to obtain target real-time data and target offline data;
comparing the target real-time data with the data in the source layer partition table to obtain real-time newly-added data and/or real-time variation data in the target real-time data;
Performing incremental update on the source layer partition table based on the real-time newly-added data and/or the real-time variation data to obtain an initially updated source layer partition table;
comparing the target offline data with the data in the primary updated source layer partition table to obtain offline newly-added data and/or offline variation data in the target offline data; wherein the real-time newly-added data and/or the real-time change data and the offline newly-added data and/or the offline change data are incremental update data;
and performing incremental update on the source layer partition table based on the offline newly-added data and/or the offline variation data to obtain a target source layer partition table.
9. A computing device, the computing device comprising:
at least one processor, memory, and input output unit;
wherein the memory is configured to store a computer program, and the processor is configured to invoke the computer program stored in the memory to perform the method of any of claims 1-5.
10. A computer readable storage medium comprising instructions which, when run on a computer, cause the computer to perform the method of any one of claims 1 to 5.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202310670792.2A CN116431654B (en) | 2023-06-08 | 2023-06-08 | Data storage method, device, medium and computing equipment based on integration of lake and warehouse |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202310670792.2A CN116431654B (en) | 2023-06-08 | 2023-06-08 | Data storage method, device, medium and computing equipment based on integration of lake and warehouse |
Publications (2)
Publication Number | Publication Date |
---|---|
CN116431654A true CN116431654A (en) | 2023-07-14 |
CN116431654B CN116431654B (en) | 2023-09-08 |
Family
ID=87089344
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202310670792.2A Active CN116431654B (en) | 2023-06-08 | 2023-06-08 | Data storage method, device, medium and computing equipment based on integration of lake and warehouse |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN116431654B (en) |
Citations (10)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN112507029A (en) * | 2020-12-18 | 2021-03-16 | 上海哔哩哔哩科技有限公司 | Data processing system and data real-time processing method |
CN113051275A (en) * | 2021-03-31 | 2021-06-29 | 银盛支付服务股份有限公司 | Storage architecture method compatible with real-time and offline data processing |
CN113901037A (en) * | 2021-09-29 | 2022-01-07 | 紫金诚征信有限公司 | Data management method, device and storage medium |
US20220129483A1 (en) * | 2020-10-28 | 2022-04-28 | Beijing Zhongxiangying Technology Co., Ltd. | Data processing method and device, computing device and medium |
CN114579614A (en) * | 2022-02-11 | 2022-06-03 | 武汉物易云通网络科技有限公司 | Real-time data full-scale acquisition method and device and computer equipment |
CN115114266A (en) * | 2022-06-29 | 2022-09-27 | 徐工汉云技术股份有限公司 | Flow and batch integrated warehouse counting integration method and system |
US20220398254A1 (en) * | 2020-12-25 | 2022-12-15 | Boe Technology Group Co., Ltd. | Data processing method, platform, computer-readable storage medium and electronic device |
CN115599871A (en) * | 2022-10-19 | 2023-01-13 | 上海哔哩哔哩科技有限公司(Cn) | Lake and bin integrated data processing system and method |
CN115796914A (en) * | 2022-11-17 | 2023-03-14 | 唯品会(广州)软件有限公司 | Operation data analysis method, system, computer device and storage medium |
CN116089545A (en) * | 2023-04-07 | 2023-05-09 | 云筑信息科技(成都)有限公司 | Method for collecting storage medium change data into data warehouse |
-
2023
- 2023-06-08 CN CN202310670792.2A patent/CN116431654B/en active Active
Patent Citations (10)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20220129483A1 (en) * | 2020-10-28 | 2022-04-28 | Beijing Zhongxiangying Technology Co., Ltd. | Data processing method and device, computing device and medium |
CN112507029A (en) * | 2020-12-18 | 2021-03-16 | 上海哔哩哔哩科技有限公司 | Data processing system and data real-time processing method |
US20220398254A1 (en) * | 2020-12-25 | 2022-12-15 | Boe Technology Group Co., Ltd. | Data processing method, platform, computer-readable storage medium and electronic device |
CN113051275A (en) * | 2021-03-31 | 2021-06-29 | 银盛支付服务股份有限公司 | Storage architecture method compatible with real-time and offline data processing |
CN113901037A (en) * | 2021-09-29 | 2022-01-07 | 紫金诚征信有限公司 | Data management method, device and storage medium |
CN114579614A (en) * | 2022-02-11 | 2022-06-03 | 武汉物易云通网络科技有限公司 | Real-time data full-scale acquisition method and device and computer equipment |
CN115114266A (en) * | 2022-06-29 | 2022-09-27 | 徐工汉云技术股份有限公司 | Flow and batch integrated warehouse counting integration method and system |
CN115599871A (en) * | 2022-10-19 | 2023-01-13 | 上海哔哩哔哩科技有限公司(Cn) | Lake and bin integrated data processing system and method |
CN115796914A (en) * | 2022-11-17 | 2023-03-14 | 唯品会(广州)软件有限公司 | Operation data analysis method, system, computer device and storage medium |
CN116089545A (en) * | 2023-04-07 | 2023-05-09 | 云筑信息科技(成都)有限公司 | Method for collecting storage medium change data into data warehouse |
Non-Patent Citations (1)
Title |
---|
花海洋;李一凡;赵怀慈;: "基于分布式数据仓库技术的ETL系统的研究与应用", 微计算机信息, no. 30, pages 151 - 153 * |
Also Published As
Publication number | Publication date |
---|---|
CN116431654B (en) | 2023-09-08 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
US11153380B2 (en) | Continuous backup of data in a distributed data store | |
EP3571606B1 (en) | Query language interoperability in a graph database | |
US10229011B2 (en) | Log-structured distributed storage using a single log sequence number space | |
EP3477914B1 (en) | Data recovery method and device, and cloud storage system | |
US9507843B1 (en) | Efficient replication of distributed storage changes for read-only nodes of a distributed database | |
US20090012932A1 (en) | Method and System For Data Storage And Management | |
CN108509462B (en) | Method and device for synchronizing activity transaction table | |
US9699017B1 (en) | Dynamic utilization of bandwidth for a quorum-based distributed storage system | |
CN111124474B (en) | API version control method and device | |
US8214377B2 (en) | Method, system, and program for managing groups of objects when there are different group types | |
US20120203745A1 (en) | System and method for range search over distributive storage systems | |
US11250019B1 (en) | Eventually consistent replication in a time-series database | |
US7624117B2 (en) | Complex data assembly identifier thesaurus | |
CN111177143B (en) | Key value data storage method and device, storage medium and electronic equipment | |
US20080294673A1 (en) | Data transfer and storage based on meta-data | |
US20110320404A1 (en) | Standby index in physical data replication | |
CN113934713A (en) | Order data indexing method, system, computer equipment and storage medium | |
US8527478B1 (en) | Handling bulk and incremental updates while maintaining consistency | |
Arrieta-Salinas et al. | Classic replication techniques on the cloud | |
CN110362590A (en) | Data managing method, device, system, electronic equipment and computer-readable medium | |
CN117033454A (en) | Data processing method, device, equipment and medium | |
CN112711606A (en) | Database access method and device, computer equipment and storage medium | |
CN116431654B (en) | Data storage method, device, medium and computing equipment based on integration of lake and warehouse | |
US20050066235A1 (en) | Automated fault finding in repository management program code | |
CN116049306A (en) | Data synchronization method, device, electronic equipment and readable storage medium |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |