CN116756155A - Pull chain table processing method and device, computer equipment and storage medium - Google Patents

Pull chain table processing method and device, computer equipment and storage medium Download PDF

Info

Publication number
CN116756155A
CN116756155A CN202310515943.7A CN202310515943A CN116756155A CN 116756155 A CN116756155 A CN 116756155A CN 202310515943 A CN202310515943 A CN 202310515943A CN 116756155 A CN116756155 A CN 116756155A
Authority
CN
China
Prior art keywords
data
incremental
increment
state
chain
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202310515943.7A
Other languages
Chinese (zh)
Inventor
卞璐璐
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
New H3C Big Data Technologies Co Ltd
Original Assignee
New H3C Big Data Technologies Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by New H3C Big Data Technologies Co Ltd filed Critical New H3C Big Data Technologies Co Ltd
Priority to CN202310515943.7A priority Critical patent/CN116756155A/en
Publication of CN116756155A publication Critical patent/CN116756155A/en
Pending legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/23Updating
    • G06F16/2379Updates performed during online database operations; commit processing
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/25Integrating or interfacing systems involving database management systems
    • G06F16/254Extract, transform and load [ETL] procedures, e.g. ETL data flows in data warehouses

Landscapes

  • Engineering & Computer Science (AREA)
  • Databases & Information Systems (AREA)
  • Theoretical Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The invention relates to the technical field of data storage, and discloses a zipper table processing method, a zipper table processing device, computer equipment and a storage medium. The pull chain table processing method comprises the following steps: acquiring incremental data and creating an incremental table by using the incremental data; the increment table is used for storing increment data; according to the increment data in the increment table, carrying out state update on the target data stored in the current pull chain table so as to update the state of the target data in the open chain state into the closed chain state; based on the line-level increment pulling and inserting mode, the increment data in the increment table are written into the pulling chain table, and the state of the increment data added in the pulling chain table is in an open chain state so as to store the increment data through the zipper table. The invention reduces the data volume processed in the zipper table processing process, improves the utilization rate of computing resources, improves the execution efficiency of the zipper table processing process and reduces the realization cost of the zipper table processing process.

Description

Pull chain table processing method and device, computer equipment and storage medium
Technical Field
The invention relates to the technical field of data storage, in particular to a zipper table processing method, a zipper table processing device, computer equipment and a storage medium.
Background
The data warehouse is a data storage architecture for storing a large amount of data accumulated over time. In the related art, the problem of incremental update of data in a large data scenario can be solved by a zipper table scheme implemented in a data warehouse architecture. Specifically, in the related art, a stretch-draw linked list and an increment table are needed to be used, the increment table and the zipper table are firstly associated, then original old data in the zipper table and increment data in the increment table are combined and then written into the zipper table in an overwriting manner, in the overwriting process, all original old data in the zipper table is needed to be deleted, and then the old data and the increment data are rewritten into the stretch-draw linked list.
However, the deletion process and the re-write process have excessive data processing amount, resulting in problems of waste of computing resources, low execution efficiency, and the like.
Disclosure of Invention
In view of the above, the present invention provides a zipper table processing method, apparatus, computer device and storage medium, so as to solve the problems of the zipper table processing scheme in the related art, such as waste of computing resources and low execution efficiency.
In a first aspect, the present invention provides a zipper table processing method, which is applied to a data warehouse, the method comprising:
acquiring incremental data and creating an incremental table by using the incremental data; the increment table is used for storing increment data;
according to the increment data in the increment table, carrying out state update on the target data stored in the current pull chain table so as to update the state of the target data in the open chain state into the closed chain state;
based on the line-level increment pulling and inserting mode, the increment data in the increment table are written into the pulling chain table, and the state of the increment data added in the pulling chain table is in an open chain state so as to store the increment data through the zipper table.
Compared with the prior art, the method has the advantages that compared with the prior art, the method avoids the deletion process of a large amount of data and the rewriting process of a large amount of data, thereby greatly reducing the data amount processed in the zipper table processing process, reducing the consumption of IO resources, improving the utilization rate of computing resources, improving the execution efficiency of the zipper table processing process and reducing the realization cost of the zipper table processing process.
In an alternative embodiment, the updating of the state of the target data stored in the current pull chain table according to the increment data in the increment table includes:
acquiring a first item code from incremental data in an incremental table, and acquiring a second item code from data in an open chain state stored in a pull chain table;
matching the first item code with the second item code, and determining data corresponding to the second item code matched with the first item code as target data according to a matching result;
the open-chain state of the target data is updated to a closed-chain state.
According to the method and the device, the target data can be accurately judged according to the mode that the incremental data are matched with the item codes of the original data in the zipper table, so that the open-chain state of the target data can be accurately changed into the closed-chain state.
In an alternative embodiment, updating the open-chain state of the target data to the closed-chain state includes:
acquiring a target date from incremental data corresponding to the target data; the target date is the creation date or the modification date of the incremental data;
and updating the expiration date in the target data to be the target date so as to update the open-chain state of the target data to be the closed-chain state.
By updating the expiration date in the target data to the creation date or the modification date of the incremental data, the invention can ensure reliable closed-chain operation on the target data.
In an alternative embodiment, obtaining incremental data includes:
and acquiring incremental data from the data source according to the preset frequency.
The invention can also effectively acquire the historical change data based on the preset frequency, so that the historical change data is effectively recorded through the pull chain table in the subsequent step, and further, the historical state of a certain piece of data or a certain period of time is effectively acquired according to the pull chain table.
In an alternative embodiment, based on the way of incremental pulling and inserting at the row level, the incremental data in the incremental table is written into the pull chain table, including:
reading incremental data from the incremental table in a row-by-row reading manner;
and inserting the increment data in the increment table into the pull chain table after the state update by adopting a line-by-line insertion mode.
Based on the mode of row-by-row reading and row-by-row insertion, the invention can further save computing resources and improve the updating efficiency of the zipper table.
In an alternative embodiment, before obtaining the incremental data, the method further includes:
acquiring first data and creating a full data table by using the first data; the full data table is used for storing first data;
initializing a blank zipper table by using second data stored in the full data table to store the second data through the blank zipper table; the first data includes second data.
The embodiment of the invention can initialize the blank zipper pulling list by utilizing partial data in the full data list in the first execution process, and can create the zipper list with reliable data structure so as to maintain the zipper state better.
In an alternative embodiment, the full data table, the pull chain table, and the increment table are all stored in Hudi data format.
Based on the Hudi data format storage pull chain table, the full data table and the increment table, the invention is more suitable for realizing the functions of adding, deleting and modifying the large data set, and further improving the processing performance of the server on the large data set.
In a second aspect, the present invention provides a slide fastener table processing apparatus comprising:
the first creation module is used for acquiring the incremental data and creating an incremental table by utilizing the incremental data; the increment table is used for storing increment data;
the state updating module is used for carrying out state updating on the target data stored in the current pull chain table according to the increment data in the increment table so as to update the state of the target data in the open chain state into the closed chain state;
and the data writing module is used for writing the incremental data in the incremental table into the pull chain table based on the incremental pulling and inserting modes of the row level, and enabling the state of the incremental data added in the pull chain table to be an open chain state so as to store the incremental data through the zipper table.
In a third aspect, the present invention provides a computer device comprising: the zipper table processing device comprises a memory and a processor, wherein the memory and the processor are in communication connection, the memory stores computer instructions, and the processor executes the computer instructions, so that the zipper table processing method of the first aspect or any corresponding embodiment of the first aspect is executed.
In a fourth aspect, the present invention provides a computer-readable storage medium having stored thereon computer instructions for causing a computer to execute the zipper table processing method of the first aspect or any one of the corresponding embodiments thereof.
Drawings
In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings that are needed in the description of the embodiments or the prior art will be briefly described, and it is obvious that the drawings in the description below are some embodiments of the present invention, and other drawings can be obtained according to the drawings without inventive effort for a person skilled in the art.
FIG. 1 is a flow chart of a zipper table processing method according to an embodiment of the invention;
FIG. 2 is a flow diagram of another pull chain table processing method according to an embodiment of the present invention;
FIG. 3 is a flow chart of yet another zipper table processing method according to an embodiment of the invention;
FIG. 4 is a flow chart of a further zipper table processing method according to an embodiment of the invention;
FIG. 5 is a block diagram of a zipper table processing device according to an embodiment of the invention;
fig. 6 is a schematic diagram of a hardware structure of a computer device according to an embodiment of the present invention.
Detailed Description
For the purpose of making the objects, technical solutions and advantages of the embodiments of the present invention more apparent, the technical solutions of the embodiments of the present invention will be clearly and completely described below with reference to the accompanying drawings in the embodiments of the present invention, and it is apparent that the described embodiments are some embodiments of the present invention, but not all embodiments of the present invention. All other embodiments, which can be made by those skilled in the art based on the embodiments of the invention without making any inventive effort, are intended to be within the scope of the invention.
In the data storage business scenario, the data amount of the basic business library table in the server for storing the data is large, and only a small part of key fields often change slowly with time, which is called a slow change dimension in the data warehouse design, how to process the slow change dimension to ensure that the history state can be correctly stored and queried, and save storage resources as much as possible is an important point to be considered in the data warehouse table design by those skilled in the art.
The pull chain table is constructed in the design of the data warehouse to record the historical changes of key fields, and is a table for maintaining the historical state and the latest state of the data. The zipper table scheme is used when the historical snapshot data of a certain time point or time period needs to be checked, so that the query requirement on the historical data can be met, and storage resources can be saved.
Hive (data warehouse management system) belongs to a data warehouse tool in Hadoop (a framework of distributed data storage and computation) ecology, and is used for solving the problem of analysis and statistics of massive structured data. The traditional big data scheme can not meet the requirement of incremental update, the related technology adopts a pull chain table mode, solves the problem of incremental update in big data scenes such as e-commerce transaction data and the like through a zipper table based on Hive, and an incremental update mechanism needs to use an incremental table and a zipper table containing basic information, so that the implementation process is that the overwriting operation is performed after the association operation is performed. The association operation associates the increment table with the zipper table in the execution process, and then needs to overwrite the contents in the pull chain table or the table partition, wherein the overwriting process specifically comprises the following steps: and deleting the original old data in the zipper table, merging the old data in the zipper table with the incremental data, and then completely rewriting the merged old data into the zipper table. Obviously, the related art inevitably includes a process of deleting old data and a process of combining incremental data with the old data and then writing the combined incremental data and the old data into the pull chain table in the process of processing the pull chain table, so that the data processing amount involved in the whole process of processing the pull chain table is huge, and when a large batch of operation conditions are met, for example, the whole process is carried out on a whole day of event or the whole upstream data source needs to be reloaded, and the problems of waste of computing resources, low execution efficiency, high cost and the like exist. Even if the related art rewrites only the data of the specified table partition, in the case where the history data accumulation amount is large, problems such as excessive occupation of the computing resources and slow writing still result.
In accordance with an embodiment of the present invention, there is provided a zipper table processing method embodiment, it being noted that the steps shown in the flowchart of the figures may be performed in a computer system, such as a set of computer executable instructions, and, although a logical order is shown in the flowchart, in some cases, the steps shown or described may be performed in an order other than that shown or described herein.
In this embodiment, a zipper table processing method is provided and applied to a data warehouse, where the data warehouse is specifically a data warehouse of a distributed storage server, and fig. 1 is a flowchart of the zipper table processing method according to an embodiment of the present invention, as shown in fig. 1, and the flowchart includes the following steps:
step S101, incremental data is obtained, and an incremental table is created by utilizing the incremental data; the increment table is used for storing increment data.
The incremental data specifically includes update data, which is data updated to existing history data, and new addition data, which is newly added data.
In this embodiment, the incremental data is structured data, and the incremental table may be, for example, a data warehouse table. Taking the product sales scenario as an example, the increment table may be a sales detail information table updated according to a preset period, and the preset period may be one day, and the sales detail table is updated every day, which is certainly not limited thereto.
The sales detail information table may be exemplified by a specified field as a basis for the incremental table partition, and the specified field may be, for example, a creation date or a modification date, but is not limited thereto.
Step S102, according to the increment data in the increment table, the state of the target data stored in the current pull chain table is updated, so that the state of the target data in the open chain state is updated to the closed chain state.
The target data is data corresponding to the incremental data in the pull chain table, for example, the target data and the incremental data have the same code, and the code may be, for example, an item code, which is not limited thereto.
In this embodiment, updating the state of the target data in the open-chain state to the closed-chain state includes: updating the first content of the appointed target field in the target data to second content; taking the increment table as an example of the sales detail information table, the specified target field may be an expiration date, the first content may be infinite, for example, "9999-12-31", and the second content may be a specific date, for example, "2022-01-10", which is not limited to the above example.
Step S103, based on the way of row-level increment pulling and inserting, writing the increment data in the increment table into a pull chain table, and enabling the state of the increment data added in the pull chain table to be an open chain state so as to store the increment data through the zipper table.
In this embodiment, the state of the incremental data added in the pull chain table is made to be an open-chain state by determining the content of the specified target field of each line of incremental data as the first content; taking the increment table as a sales detail information table as an example, the content of the specified target field is determined to be '9999-12-31'. In this embodiment, the step of setting the state of the incremental data added in the pull chain table to the open chain state may further include: determining the content of a preset target field of each row of incremental data as third content; taking the increment table as a sales detail information table as an example, the preset target field is an expiration date, and the third content is a date, for example, the expiration date is set to be a certain date less than or equal to the current date.
The embodiment specifically writes the incremental data in the increment table into the pull chain table based on the manner of incremental pulling and inserting (insert) at the row level, for example, writes the data newly added daily into the pull chain table.
According to the embodiment of the invention, all old data are not required to be deleted, the combined data are not required to be rewritten, based on the updating of the designated target field, the corresponding target data in the zipper table are updated into a closed-chain state according to the incremental data, then the zipper table is updated in a mode of only writing the incremental data into the zipper table and enabling the incremental data to be in an open-chain state, and the data table of the zipper table is updated by a data updating mechanism at the data warehouse table row level.
In this embodiment, a zipper table processing method is provided and applied to a data warehouse, where the data warehouse is specifically a data warehouse of a distributed storage server, and fig. 2 is a flowchart of the zipper table processing method according to an embodiment of the present invention, as shown in fig. 2, and the flowchart includes the following steps:
step S201, obtaining incremental data and creating an incremental table by utilizing the incremental data; the increment table is used for storing increment data. Please refer to step S101 in the embodiment shown in fig. 1 in detail, which is not described herein.
Additionally, in an alternative embodiment, obtaining incremental data includes: and acquiring incremental data from the data source according to the preset frequency.
The preset frequency may be, for example, a time/day, that is, acquired once a day, and the data source involved may be a database, which is not limited thereto.
The invention can also effectively acquire the historical change data based on the preset frequency, effectively record the historical change data through the pull chain table on the basis of timing trigger scheduling, and further effectively acquire the historical state of a certain piece of data at a certain time or a certain period of time according to the pull chain table.
Step S202, performing state update on the target data stored in the current pull chain table according to the increment data in the increment table so as to update the state of the target data in the open chain state into the closed chain state.
Specifically, the step S202 includes:
step S2021, acquiring the first item code from the incremental data in the increment table, and acquiring the second item code from the data in the open-chain state stored in the pull chain table.
The data in the increment table and the pull chain table of the embodiment of the invention are provided with item code fields, and the item codes are recorded in the columns of the item code fields in the table.
In this embodiment, the first item code is acquired as an item code in the incremental data, and the second item code is acquired as an item code in the data in the pull chain table.
Step S2022, matching the first item code with the second item code, and determining data corresponding to the second item code matched with the first item code as target data according to the result of the matching.
The matching results in this embodiment may include: the first item code is the same as the second item code, and the first item code is different from the second item code.
In this embodiment, in particular, when the first item code is the same as the second item code, the data corresponding to the second item code that matches the first item code is determined as the target data.
Step S2023 updates the open-chain state of the target data to the closed-chain state.
The embodiment updates the expiration date in the target data to a specific date, such as the creation date of the target data, etc., to realize the closed-chain operation on the target data.
Based on the acquisition of item codes from the increment table and the zipper table, the method and the device can accurately judge the target data according to the condition that the increment data are the same as the item codes of the original data in the zipper table, so that the open-chain state of the target data can be accurately changed into the closed-chain state.
In some alternative embodiments, step S2023 described above comprises:
step a1, acquiring a target date from incremental data corresponding to the target data; wherein the target date is the creation date or modification date of the incremental data.
The data in the increment table of the embodiment of the invention is provided with a creation date field and a modification date field, wherein the creation date is recorded in the column of the creation date field, and the modification date is recorded in the column of the modification date field.
And a step a2, updating the expiration date in the target data to be the target date so as to update the open-chain state of the target data to be the closed-chain state.
For example, the expiration date in the target date is updated to the creation date, or the expiration date in the target data is updated to the modification date.
The method and the device can also be used for ensuring reliable closed-chain operation on the target data based on the target date in the incremental data, in particular by updating the expiration date in the target data to the creation date or the modification date of the incremental data.
Step S203, based on the way of the row-level increment pulling and inserting, the increment data in the increment table is written into the pull chain table, and the state of the increment data added in the pull chain table is the open chain state, so that the increment data is stored through the zipper table. Please refer to step S103 in the embodiment shown in fig. 1 in detail, which is not described herein.
In this embodiment, a zipper table processing method is provided and applied to a data warehouse, where the data warehouse is specifically a data warehouse of a distributed storage server, and fig. 3 is a flowchart of the zipper table processing method according to an embodiment of the present invention, as shown in fig. 3, and the flowchart includes the following steps:
step S301, incremental data is acquired, and an incremental table is created by utilizing the incremental data; the increment table is used for storing increment data. Please refer to step S101 in the embodiment shown in fig. 1 in detail, which is not described herein.
Step S302, according to the increment data in the increment table, the state of the target data stored in the current pull chain table is updated, so that the state of the target data in the open chain state is updated to the closed chain state. Please refer to step S202 in the embodiment shown in fig. 2, which is not described herein.
Step S303, based on the way of pulling and inserting the increment of the row level, the increment data in the increment table is written into the pull chain table, and the state of the increment data added in the pull chain table is opened, so that the increment data is stored through the zipper table.
Specifically, the step S303 includes:
in step S3031, incremental data is read from the incremental table in a row-by-row read manner.
Each column of the increment table corresponds to one field, and each row of the increment table is used for recording one piece of data.
For the line-by-line reading mode, the embodiment can read each piece of data recorded in the increment table line by line according to the set reading sequence from top to bottom or from bottom to top so as to improve the execution efficiency.
Step S3032, the incremental data in the incremental table is inserted into the pull chain table after the state update by adopting a line-by-line insertion mode.
In this embodiment, inserting the incremental data in the incremental table into the pull chain table after the state update includes: after each line of data is read, the read line of data is inserted into a zipper table, or each line of data is inserted into the zipper table one by one after all lines of data are read.
Therefore, the invention can realize the data table more of the pull chain table by a data updating mechanism of the data warehouse table row level in the data writing stage on the basis of avoiding the overwriting of the whole pull chain table, obviously reduces the consumption of IO (input and output) resources, further saves the computing resources and improves the execution efficiency.
In this embodiment, a zipper table processing method is provided, which can be applied to a data warehouse, and fig. 4 is a flowchart of the zipper table processing method according to an embodiment of the invention, as shown in fig. 4, the flowchart includes the following steps:
step S410, acquiring first data and creating a full data table by using the first data; the full data table is used to store the first data.
In this embodiment, the first data may be historical data, for example, may be structured historical data. The process of acquiring the first data may include: historical data is collected from the data source, such as historical sales data from the data source, in which case the full data sheet may be a sales detail information sheet.
Taking the sales details information table as an example, the sales details information table of the embodiment of the present invention may include the following fields: project_id, project_name, salesman (salesman), order number (order_number), transaction amount (W), project_status_tracking, creation date (creation_time), and modification date (modification_time), the following full-scale data table examples are given, with the contents under each column being specific field values or field contents.
Step S400, initializing a blank zipper table by using second data stored in the full data table to store the second data through the blank zipper table; the first data includes second data.
The second data is part of the data in the first data, for example, the second data is data in a preset time period in the first data, the preset time period may be one day, and the second data in this embodiment is historical data under the creation date specified in the first data.
In an alternative embodiment, the specified creation date is the day before the current date, i.e. the second data may be historical data of the day before the current date.
In this embodiment, the historical data of the day before the current date is pulled from the full-scale data table by using the SparkSQL API (application programming interface of the structured query language database under a fast general-purpose computing engine designed for large-scale data processing).
The full data table in this embodiment is used to initialize the pull chain table, and the initialization process includes adding two target fields based on the existing fields in the full data table, for maintaining the zipper state.
Taking a sales detail information table as an example, two target fields are specifically an expiration date and a date of expiration to maintain the zipper status through the newly added two date columns. In the embodiment, when the expiration date is infinity, the piece of data is in an open-chain state, and when the expiration date is an effective period, the piece of data is in a closed-chain state; the effective date refers to the start time of a certain data life cycle, and the expiration date refers to the end time of a certain data life cycle.
The data is specifically the data of the row in which the target field in the table is located, and infinity may be set to "9999-12-31", for example, and the validity period is a specific date, for example, "2022-01-10", which is not limited to the above setting method.
Taking the increment table as a sales detail information table as an example, the increment table in the embodiment of the invention may specifically include the following fields: project_id, project name (project_name), salesman (salesman), order number (order_number), transaction amount (universal), project status tracking (project_status_tracking), creation date (create_time) of newly added data, modification date (modification_time).
Step S401, obtaining incremental data and creating an incremental table by utilizing the incremental data; the increment table is used for storing increment data.
The detailed implementation process of step S401 is shown in step S101 in the embodiment of fig. 1, and will not be described herein.
The present embodiment exemplarily gives the following increment table example; compared to the full data table example described above, the updated data in the increment table reflects 2 updated data and 2 newly added data for days 2022-01-11.
Step S402, performing state update on the target data stored in the current pull chain table according to the increment data in the increment table so as to update the state of the target data in the open chain state into the closed chain state. Please refer to step S202 in the embodiment shown in fig. 2, which is not described herein.
Step S403, based on the way of pulling and inserting the increment of the row level, writing the increment data in the increment table into a pull chain table, and enabling the state of the increment data added in the pull chain table to be an open chain state so as to store the increment data through the zipper table. Please refer to step S103 in the embodiment shown in fig. 1 in detail, which is not described herein.
In some alternative embodiments, after the incremental data in the incremental table is written into the pull chain table, the step of obtaining the incremental data is returned, that is, after the step S403 is performed, the step S401 is returned, and the interval duration represents the execution period of the data update task, which realizes the function of updating the zipper table at regular time.
The fields in the pull chain table of the embodiment are added with the effective date field and the expiration date field on the basis of the full data table, and the content in the pull chain table is data in a certain time or a certain period of time in the full data table.
In some alternative embodiments, the full data table, the pull chain table, and the increment table are all stored in Hudi data format.
The Hudi related to the invention is Hadoop Updates and Incrementals (update and increment) in short, and is an open-source data management framework.
The pull chain table in this embodiment is a pull chain table based on Hudi, which is an optimization of a zipper table based on Hive, and Hudi is a file organization management layer for providing line-level insert update and incremental pull for a large dataset stored on, for example, a Hadoop file system, so that the processing performance of a server is greatly improved in the execution process of add-delete-modify operations for the large dataset by processing only records with changes and overwriting only updated or deleted portions of the table, without overwriting the entire table partition or even the entire table in the related art. The invention fully utilizes the updating capability of Hudi, avoids the realization mode of deleting the table data and importing the table data in full quantity, greatly reduces the data quantity which needs to participate in updating, saves the computing resource and improves the updating efficiency of the data in the pull chain table.
Taking the product sales scenario as an example, in the related art, the data volume of the zipper table is larger and larger along with the time, and the data in the daily increment table is increased from the first millions to the daily update volume to tens of millions due to the expansion of the service scale and the rising of the ordering volume, so that the execution speed of the timing task is obviously reduced, and the failure or the influence on the normal execution of other timing tasks caused by the occupation of the CPU (central processing unit ) and excessive memory by the task execution often occurs. Next, embodiments of the present invention will be further described with reference to a zipper watch implementation process of sales details in a product sales scenario.
The method comprises the steps of collecting historical data from a sales detail data source once, creating a full-scale data table by using the historical data, and initializing a pull chain table by using the full-scale data table. And creating an increment table by using the daily increment data. In this embodiment, a full-volume data table, a pull chain table and an increment table are created in SparkSQL (structured query language database under a fast general-purpose computing engine designed for large-scale data processing), and in this embodiment Hudi, when performing update operation, a deduplication operation may be performed according to a primary key (e.g., item encoding or creation date), and for data of the same primary key, a choice is made according to the specified field. According to the embodiment, the open chain state of the existing record of the pull chain table can be adjusted to be the closed chain state according to the existing record of the daily increase data table, and the daily increase data can be written into the pull chain table, and the process can be executed periodically, for example, once daily, so that the data total storage optimization scheme is realized based on the daily update function of the zipper table.
More specifically, the daily gain data in this embodiment may be cleaned by an ETL (Extract-Transform-Load) rule, and then sent to a directory created by date in an HDFS (Hadoop Distributed File System ) file system by means of a message queue, and trigger a SparkSQL task, where the SparkSQL task in this embodiment includes two tasks executed in series, the first task is used to adjust an open-chain state of an existing record of a pull chain table to a closed-chain state according to an existing record of the daily gain data table, and the second task is used to write the daily gain data into the pull chain table. In one task, the closed-chain operation to be performed is a modification operation for the pull chain table for modifying the date of failure ods_end_date of data already existing in the pull chain table and also existing in the daily gain data table; in a second task, the method is used for writing all newly generated data (including newly added data and data which is changed data for a pull chain table) into the pull chain table, and opening the written data, for example, setting the effective date of the data as the date of the day and the expiration date as '9999-12-31'. The full-scale data table, the daily increment table, and the pull chain table according to this embodiment may be expressed as follows, the full-scale data table may be expressed as ods_ods_samples_deltas, the daily increment table may be expressed as default_ods_samples_update, and the pull chain table may be expressed as default_ods_samples_history.
Compared with the traditional zipper table based on Hive for solving the problem of big data storage, the zipper table processing method based on Hudi obviously reduces IO resource consumption, and in verification of sales detail information storage, the IO optimization rate of the sales detail zipper task is 10.25%, and the optimization rate of CPU and memory is about 15.34%.
The embodiment also provides a zipper table processing device, which is used for realizing the embodiment and the preferred embodiment, and is not described in detail. As used below, the term "module" may be a combination of software and/or hardware that implements a predetermined function. While the means described in the following embodiments are preferably implemented in software, implementation in hardware, or a combination of software and hardware, is also possible and contemplated.
The present embodiment provides a zipper watch processing device, as shown in fig. 5, including:
a first creation module 501, configured to obtain incremental data, and create an incremental table using the incremental data; the increment table is used for storing increment data.
The state update module 502 is configured to perform state update on the target data stored in the current pull chain table according to the incremental data in the incremental table, so as to update the state of the target data in the open chain state to the closed chain state.
The data writing module 503 is configured to write the incremental data in the incremental table into the pull chain table based on the manner of incremental pulling and inserting at the row level, and make the state of the incremental data added in the pull chain table be an open chain state, so as to store the incremental data through the zipper table.
In some alternative embodiments, the status update module 502 includes:
the acquisition unit is used for acquiring the first item code from the increment data in the increment table and acquiring the second item code from the data in the open chain state stored in the pull chain table.
And a matching unit for matching the first item code with the second item code and for determining data corresponding to the second item code matched with the first item code as target data according to a result of the matching.
And the updating unit is used for updating the open-chain state of the target data into the closed-chain state.
In some alternative embodiments, the updating unit comprises:
an acquisition subunit, configured to acquire a target date from incremental data corresponding to the target data; wherein the target date is the creation date or modification date of the incremental data.
And the updating subunit is used for updating the expiration date in the target data to be the target date so as to update the open-chain state of the target data to be the closed-chain state.
In some alternative embodiments, the first creation module 501 is configured to obtain incremental data from a data source at a preset frequency.
In some alternative embodiments, the data writing module 503 includes a reading unit and an inserting unit.
And the reading unit is used for reading the increment data from the increment table in a row-by-row reading mode.
The inserting unit is used for inserting the increment data in the increment table into the pull chain table after the state update by adopting a line-by-line inserting mode.
In some alternative embodiments, the zipper watch processing device further comprises:
the second creation module is used for acquiring the first data and creating a full data table by using the first data; the full data table is used to store the first data.
The table initializing module is used for initializing a blank zipper table by using the second data stored in the full data table so as to store the second data through the blank zipper table; the first data includes second data.
In some alternative embodiments, the full data table, the pull chain table, and the increment table are all stored in Hudi data format.
The zipper table processing device in this embodiment is presented in the form of functional units, where the units refer to ASIC circuits, processors and memories executing one or more software or firmware programs, and/or other devices that can provide the functionality described above.
Further functional descriptions of the above respective modules and units are the same as those of the above corresponding embodiments, and are not repeated here.
The embodiment of the invention also provides computer equipment which is provided with the zipper table processing device shown in the figure 5.
Referring to fig. 6, fig. 6 is a schematic structural diagram of a computer device according to an alternative embodiment of the present invention, as shown in fig. 6, the computer device includes: one or more processors 10, memory 20, and interfaces for connecting the various components, including high-speed interfaces and low-speed interfaces. The various components are communicatively coupled to each other using different buses and may be mounted on a common motherboard or in other manners as desired. The processor may process instructions executing within the computer device, including instructions stored in or on memory to display graphical information of the GUI on an external input/output device, such as a display device coupled to the interface. In some alternative embodiments, multiple processors and/or multiple buses may be used, if desired, along with multiple memories and multiple memories. Also, multiple computer devices may be connected, each providing a portion of the necessary operations (e.g., as a server array, a set of blade servers, or a multiprocessor system). One processor 10 is illustrated in fig. 6.
The processor 10 may be a central processor, a network processor, or a combination thereof. The processor 10 may further include a hardware chip, among others. The hardware chip may be an application specific integrated circuit, a programmable logic device, or a combination thereof. The programmable logic device may be a complex programmable logic device, a field programmable gate array, a general-purpose array logic, or any combination thereof.
Wherein the memory 20 stores instructions executable by the at least one processor 10 to cause the at least one processor 10 to perform the methods shown in implementing the above embodiments.
The memory 20 may include a storage program area that may store an operating system, at least one application program required for functions, and a storage data area; the storage data area may store data created according to the use of the computer device, etc. In addition, the memory 20 may include high-speed random access memory, and may also include non-transitory memory, such as at least one magnetic disk storage device, flash memory device, or other non-transitory solid-state storage device. In some alternative embodiments, memory 20 may optionally include memory located remotely from processor 10, which may be connected to the computer device via a network. Examples of such networks include, but are not limited to, the internet, intranets, local area networks, mobile communication networks, and combinations thereof.
Memory 20 may include volatile memory, such as random access memory; the memory may also include non-volatile memory, such as flash memory, hard disk, or solid state disk; the memory 20 may also comprise a combination of the above types of memories.
The computer device also includes a communication interface 30 for the computer device to communicate with other devices or communication networks.
The embodiments of the present invention also provide a computer readable storage medium, and the method according to the embodiments of the present invention described above may be implemented in hardware, firmware, or as a computer code which may be recorded on a storage medium, or as original stored in a remote storage medium or a non-transitory machine readable storage medium downloaded through a network and to be stored in a local storage medium, so that the method described herein may be stored on such software process on a storage medium using a general purpose computer, a special purpose processor, or programmable or special purpose hardware. The storage medium can be a magnetic disk, an optical disk, a read-only memory, a random access memory, a flash memory, a hard disk, a solid state disk or the like; further, the storage medium may also comprise a combination of memories of the kind described above. It will be appreciated that a computer, processor, microprocessor controller or programmable hardware includes a storage element that can store or receive software or computer code that, when accessed and executed by the computer, processor or hardware, implements the methods illustrated by the above embodiments.
Although embodiments of the present invention have been described in connection with the accompanying drawings, various modifications and variations may be made by those skilled in the art without departing from the spirit and scope of the invention, and such modifications and variations fall within the scope of the invention as defined by the appended claims.

Claims (10)

1. A zipper table processing method, characterized by being applied to a data warehouse, comprising:
acquiring incremental data and creating an incremental table by using the incremental data; the increment table is used for storing the increment data;
performing state updating on the target data stored in the current pull chain table according to the increment data in the increment table so as to update the state of the target data in an open chain state into a closed chain state;
based on the line-level incremental pulling and inserting mode, the incremental data in the incremental table are written into the pulling chain table, and the state of the incremental data added in the zipper table is an open-chain state, so that the incremental data is stored through the zipper table.
2. The method of claim 1, wherein the updating the state of the target data stored in the current pull chain table according to the incremental data in the incremental table comprises:
acquiring a first item code from incremental data in the incremental table, and acquiring a second item code from data in an open chain state stored in the pull chain table;
matching the first item code with the second item code, and determining data corresponding to the second item code matched with the first item code as target data according to a matching result;
and updating the open-chain state of the target data into a closed-chain state.
3. The method of claim 2, wherein the updating the open-chain state of the target data to a closed-chain state comprises:
acquiring a target date from incremental data corresponding to the target data; wherein the target date is the creation date or the modification date of the incremental data;
and updating the expiration date in the target data into the target date so as to update the open-chain state of the target data into the closed-chain state.
4. The method of claim 1, wherein the obtaining incremental data comprises:
and acquiring the incremental data from the data source according to a preset frequency.
5. The method of claim 1, wherein the writing incremental data in the increment table into the pull chain table based on the row-level incremental pull and insert comprises:
reading the incremental data from the incremental table in a row-by-row reading manner;
and inserting the increment data in the increment table into a pull chain table after the state update by adopting a line-by-line insertion mode.
6. The method of any one of claims 1 to 5, further comprising, prior to the obtaining incremental data:
acquiring first data and creating a full data table by using the first data; the full data table is used for storing the first data;
initializing a blank zipper table by using second data stored in the full data table to store the second data through the blank zipper table; the first data includes the second data.
7. The method of claim 6, wherein the step of providing the first layer comprises,
the full data table, the zipper table and the increment table are all stored in Hudi data format.
8. A zipper table processing device, the device comprising:
the first creation module is used for obtaining incremental data and creating an incremental table by utilizing the incremental data; the increment table is used for storing the increment data;
the state updating module is used for carrying out state updating on the target data stored in the current pull chain table according to the increment data in the increment table so as to update the state of the target data in the open chain state into the closed chain state;
and the data writing module is used for writing the incremental data in the incremental table into the pull chain table based on the row-level incremental pulling and inserting mode, and enabling the state of the incremental data added in the zipper table to be an open-chain state so as to store the incremental data through the zipper table.
9. A computer device, comprising:
a memory and a processor, the memory and the processor being communicatively connected to each other, the memory having stored therein computer instructions, the processor executing the computer instructions to perform the zipper table processing method of any one of claims 1 to 7.
10. A computer-readable storage medium having stored thereon computer instructions for causing a computer to execute the zipper table processing method of any one of claims 1 to 7.
CN202310515943.7A 2023-05-06 2023-05-06 Pull chain table processing method and device, computer equipment and storage medium Pending CN116756155A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202310515943.7A CN116756155A (en) 2023-05-06 2023-05-06 Pull chain table processing method and device, computer equipment and storage medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202310515943.7A CN116756155A (en) 2023-05-06 2023-05-06 Pull chain table processing method and device, computer equipment and storage medium

Publications (1)

Publication Number Publication Date
CN116756155A true CN116756155A (en) 2023-09-15

Family

ID=87959685

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202310515943.7A Pending CN116756155A (en) 2023-05-06 2023-05-06 Pull chain table processing method and device, computer equipment and storage medium

Country Status (1)

Country Link
CN (1) CN116756155A (en)

Similar Documents

Publication Publication Date Title
CN112534396B (en) Diary watch in database system
US8825601B2 (en) Logical data backup and rollback using incremental capture in a distributed database
US11003549B2 (en) Constant time database recovery
US11436139B2 (en) Object storage change-events
US20200050692A1 (en) Consistent read queries from a secondary compute node
EP2797013B1 (en) Database update execution according to power management schemes
US20200364185A1 (en) Method for data replication in a data analysis system
CN107004016B (en) Efficient data manipulation support
CN104519103A (en) Synchronous network data processing method, server and related system
CN113342834A (en) Method for solving historical data change in big data system
US11099960B2 (en) Dynamically adjusting statistics collection time in a database management system
US9454557B2 (en) Unit of work based incremental data processing
US11874753B2 (en) Log compression
CN116756155A (en) Pull chain table processing method and device, computer equipment and storage medium
US11341150B1 (en) Organizing time-series data for query acceleration and storage optimization
US10409651B2 (en) Incremental workflow execution
CN114218335A (en) Data processing method and device
US10810662B1 (en) Utilizing time-series tables for tracking accumulating values
CN112149008B (en) Method for calculating document version set
CN117472451A (en) System migration method, device, computer equipment and readable storage medium
CN114564445A (en) Method and system suitable for Hive data warehouse to write and update data
CN115904799A (en) Training method and device of machine learning model executed by parameter server
CN115858533A (en) Point information tracing method, device, equipment and medium based on hash table
CN118796850A (en) Method, apparatus, medium and program product for performing status update processing
CN115033582A (en) Data increment warehousing method, device and system for online database

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination