CN116821146B

CN116821146B - Apache Iceberg-based data list updating method and system

Info

Publication number: CN116821146B
Application number: CN202311113956.8A
Authority: CN
Inventors: 吕宴全; 陈吉平; 徐进挺
Original assignee: Hangzhou Daishu Technology Co ltd
Current assignee: Hangzhou Daishu Technology Co ltd
Priority date: 2023-08-31
Filing date: 2023-08-31
Publication date: 2023-12-08
Anticipated expiration: 2043-08-31
Also published as: CN116821146A

Abstract

The application discloses a data list updating method and system based on Apache Iceberg, which relate to the technical field of data processing and comprise the following steps: constructing a column update file, and adding a first field representing update column information in metadata of all data files; acquiring each data file corresponding to the target data table, and screening a plurality of target data files from each data file according to the query condition; determining a data set to be updated according to a plurality of target data files, updating data in the data set to be updated and generating an update record; the primary key information and the first field information of the updated data are extracted from the update record, and written into the column update file. According to the application, the file type is newly added in the Apache Iceberg, and the complete updated content can be recorded only by writing the main key and updating the column information, so that the writing of unmodified column information is avoided, and the storage cost in the whole column updating scene is effectively reduced.

Description

Apache Iceberg-based data list updating method and system

Technical Field

The application relates to the technical field of data processing, in particular to a data list updating method and system based on Apache Iceberg.

Background

In a data warehouse, an index is an important measure for measuring service performance and monitoring service data, and in the case of new service requirements, data source changes, data quality problems or query performance optimization, an index column is generally required to be added or updated so as to better reflect the service requirements and the data changes.

In a traditional Hive offline data warehouse, the method of updating a column is generally: creating a new table, copying the data of the original table into the new table, updating the value of the appointed column in the new table, deleting the original table, and renaming the new table as the name of the original table. However, the method is only suitable for offline scenes, and is time-consuming when the data volume is large, and in addition, when the method is used for row update, no partition exists in the original table, because partition information cannot be copied into the new table, if the partition exists in the original table, the partition data needs to be exported to other files, then the partition is deleted, the table is updated, and finally the partition data is imported again, so that the method is more time-consuming.

Disclosure of Invention

The application provides a data list updating method based on Apache Iceberg, which aims to solve the problems of long time consumption and high data storage expense in the updating process caused by the fact that all information is rearranged during the updating of the data list in the prior art.

In order to achieve the above purpose, the present application adopts the following technical scheme:

the application discloses an Apache Iceberg-based data list updating method, which comprises the following steps:

constructing column update files, and adding a first field representing updated column information in metadata of all data files, wherein all data files comprise the column update files;

acquiring each data file corresponding to the target data table, and screening a plurality of target data files from the data files according to the query condition;

determining a data set to be updated according to the target data files, updating the data in the data set to be updated and generating an update record;

and extracting the primary key information and the first field information of the updated data from the update record, and writing the primary key information and the first field information into the column update file.

Preferably, the method further comprises:

and generating a first serial number according to the data updating time, writing the first serial number into all the first files generated by the data updating, and recording the maximum value and the minimum value of each column in each corresponding first file and bloom filter auxiliary information in metadata of each first file.

Preferably, the screening a plurality of target data files from the data files according to the query condition includes:

and comparing the maximum value and the minimum value of each column in each data file and the auxiliary information of the bloom filter with the query condition one by one, and removing the data which do not accord with the query condition in each data file to obtain a plurality of target data files.

Preferably, the metadata of all the data files further comprises a serial number field;

the all data files also include general data files and delete data files.

Preferably, the determining the data set to be updated according to the target data files includes:

determining the minimum sequence number in the target data files, and recording the mapping relation between each second sequence number in the target data files and the corresponding target data files, wherein the second sequence number is the sequence number larger than the minimum sequence number in the target data files;

loading all target data files corresponding to the minimum serial numbers to generate a first result set, and determining all target data files associated with each second serial number according to the mapping relation;

and processing the first result set according to the type of each target data file associated with each second serial number to obtain a data set to be updated.

Preferably, the processing the first result set according to the type of each target data file associated with each second serial number to obtain a data set to be updated includes:

when the target data file associated with the second serial number is a general data file, adding all data in the target data file into the first result set to obtain a second result set;

when the target data file associated with the second serial number is a deleted data file, deleting all data matched with the second result set in the target data file to obtain a third result set;

when the target data file associated with the second serial number is a column update file, covering the information corresponding to the original field in the third result set with the first field information in the target data file to obtain a fourth result set, and replacing the first result set with the fourth result set;

repeating the steps until all target data files associated with all the second serial numbers are processed, and taking a fourth result set corresponding to the finally processed second serial numbers as a data set to be updated.

An Apache Iceberg-based data table column update system, comprising:

the creation module is used for constructing column update files and adding a first field representing updated column information into metadata of all data files, wherein all data files comprise the column update files;

the selecting module is used for acquiring each data file corresponding to the target data table and screening a plurality of target data files from the data files according to the query condition;

the updating module is used for determining a data set to be updated according to the plurality of target data files, updating the data in the data set to be updated and generating an updating record;

and the extraction module is used for extracting the primary key information and the first field information of the updated data from the update record and writing the primary key information and the first field information into the column update file.

Preferably, the system further comprises:

and the recording module is used for generating a first serial number according to the data updating time, writing the first serial number into all the first files generated by the data updating, and recording the maximum value and the minimum value of each column in each corresponding first file and bloom filter auxiliary information in the metadata of each first file.

An electronic device comprising a memory and a processor, the memory to store one or more computer instructions, wherein the one or more computer instructions are executable by the processor to implement an Apache Iceberg-based data table column update method of any of the above.

A computer readable storage medium storing a computer program which, when executed by a computer, causes the computer to implement an Apache Iceberg-based data table column updating method as set forth in any one of the preceding claims.

The application has the following beneficial effects:

the application executes the column update operation based on Apache Iceberg, avoids searching the data to be updated by scanning the whole content of all data files, effectively reduces the writing delay, quickens the writing completion time, and simultaneously, can record the complete update content by only writing the main key and updating the column information through newly adding a file type in Apache Iceberg, thereby avoiding writing unmodified column information and effectively reducing the storage cost under the whole column update scene.

Drawings

In order to more clearly illustrate the embodiments of the application or the technical solutions of the prior art, the drawings which are used in the description of the embodiments or the prior art will be briefly described, it being obvious that the drawings in the description below are only some embodiments of the application, and that other drawings can be obtained according to these drawings without inventive faculty for a person skilled in the art.

FIG. 1 is a flow chart of a data table column updating method based on Apache Iceberg provided by the application;

FIG. 2 is a flow chart of the construction of a data set to be updated in the present application;

fig. 3 is a schematic diagram of a data table column updating system based on Apache Iceberg provided by the application.

Detailed Description

The following description of the embodiments of the present application will be made clearly and fully with reference to the accompanying drawings, in which it is evident that the embodiments described are only some, but not all embodiments of the application. All other embodiments, which can be made by those skilled in the art based on the embodiments of the application without making any inventive effort, are intended to be within the scope of the application.

The terms "first," "second," and the like in the claims and the description of the application, are used for distinguishing between similar objects and not necessarily for describing a particular sequential or chronological order, and it is to be understood that the terms so used may be interchanged, if appropriate, merely to describe the manner in which objects of the same nature are distinguished in the embodiments of the application by the description, and furthermore, the terms "comprise" and "have" and any variations thereof are intended to cover a non-exclusive inclusion, such that a process, method, system, article, or apparatus that comprises a list of elements is not necessarily limited to those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus.

As shown in fig. 1, the present application provides a data table column updating method based on Apache Iceberg, which includes the following steps:

s110, constructing column update files, and adding a first field representing updated column information in metadata of all data files, wherein all data files comprise the column update files;

s120, acquiring each data file corresponding to the target data table, and screening a plurality of target data files from the data files according to the query condition;

s130, determining a data set to be updated according to the target data files, updating data in the data set to be updated and generating an update record;

s140, extracting the primary key information and the first field information of the updated data from the update record, and writing the primary key information and the first field information into the column update file.

Apache Iceberg is a common open source data lake solution, has the advantages of quick writing and inquiring, and generates a sequence number corresponding to operation time for each adding and deleting operation, wherein two types of data file types are defined in Apache Iceberg: the method comprises the steps of general data files and deleted data files, wherein the data files comprise field equaids for marking main key information and sequence number fields for marking the sequence in which all files are created, simultaneously, apache Iceberg also calculates and merges all files at a certain moment into complete data to be written into a new general data file, and cleans up previous files to reduce the operation times required for playback, at the moment, the same sequence numbers are written into files corresponding to a data table, and the same sequence numbers are recorded as reference sequence numbers.

Further, determining the minimum sequence number in the plurality of target data files, and recording the mapping relation between each second sequence number in the plurality of target data files and the corresponding target data files, wherein the second sequence number is the sequence number larger than the minimum sequence number in the plurality of target data files;

In an exemplary embodiment, as shown in fig. 2, all data files corresponding to the target data table when the task is started are obtained first, the maximum value and the minimum value of each column in each data file and the auxiliary information of the bloom filter are respectively compared with the query condition, and data which does not meet the query condition in the data files are deleted to reduce unnecessary reading and accelerate query, so that are obtained a plurality of target data files corresponding to the target data table, the minimum sequence number contained in the target data files are determined, the minimum sequence number contained in the target data files is determined, the sequence number contained in the target data files is recorded as the minimum sequence number is recorded as the second sequence number, all the target data files corresponding to each second sequence number are sequentially found, and the sequence number 0 is recorded as the minimum sequence number, and the sequence number 0 is represented by the initial sequence number written for the first time or the reference generated through calculation and combination, and the sequence number 0 is loaded and the sequence number 0 is recorded as the first result set, and the sequence number 0 is recorded and the sequence number 0 is the second sequence number and the second sequence number is mapped and the sequence number and the corresponding to the sequence number.

Further, when the target data file associated with the second serial number is a general data file, adding all data in the target data file into the first result set to obtain a second result set;

Then, the first result set is processed according to the type of each target data file associated with each second serial number to obtain a data set to be updated, and for any one second serial number, when the corresponding target data file is a general data file, all data in the target data file is added into the first result set res0 to obtain a second result set res1; when the corresponding target data file is a deleted data file, matching the equivalent Ids field information in the target data file with the corresponding data in the second result set, and deleting the data matched with the equivalent Ids field information in the second result set to obtain a third result set res2; when the corresponding target data file is a column update file, loading the column update file to obtain updated data, matching the equivalent Ids field information in the update data with a third result set, and for the matched data, covering the information corresponding to the original field in the third result set with the partialIds field information in the column update file to obtain a fourth result set res3, after all the target data files corresponding to a certain second sequence number are processed, replacing the first result set with the fourth result set, if the second sequence number which is not processed is still present at the moment, repeating the steps, and then outputting the fourth result set corresponding to the last processed second sequence number as a data set to be updated, wherein the fourth result set is the new first result set.

And finally, modifying data in the data set to be updated in the memory to obtain an update record, extracting primary key information and first field information from the update record, writing the primary key information and the first field information into a column update file, and writing the column which is not modified into the column update file.

According to the embodiment, the row updating operation is performed based on Apache Iceberg, so that searching of data to be updated by scanning all contents of all data files is avoided, writing delay is effectively reduced, writing completion time is shortened, meanwhile, through adding a file type in Apache Iceberg, complete updating contents can be recorded only by writing a main key and updating row information, writing of unmodified row information is avoided, and storage cost in a whole row updating scene is effectively reduced.

As shown in fig. 3, the present application further provides a data table column updating system based on Apache Iceberg, which includes:

One embodiment of the above system may be: the creation module builds a column update file, and adds a first field representing updated column information in metadata of all data files, wherein all data files comprise the column update file; the method comprises the steps that a selection module obtains each data file corresponding to a target data table, and a plurality of target data files are selected from the data files according to query conditions; the updating module determines a data set to be updated according to the plurality of target data files, updates data in the data set to be updated and generates an updating record; the extraction module extracts the primary key information and the first field information of the updated data from the update record, and writes the primary key information and the first field information into the column update file.

The application also provides an electronic device comprising a memory and a processor, wherein the memory is used for storing one or more computer instructions, and the one or more computer instructions are executed by the processor to realize the data list updating method based on Apache Iceberg.

It will be clearly understood by those skilled in the art that, for convenience and brevity of description, the specific working process of the electronic device described above may refer to the corresponding process in the foregoing method embodiment, which is not described herein again.

The present application also provides a computer-readable storage medium storing a computer program which, when executed by a computer, implements an Apache Iceberg-based data table column updating method as described above.

By way of example, a computer program may be divided into one or more modules/units stored in a memory and executed by a processor and the I/O interface transmission of data accomplished by an input interface and an output interface to accomplish the present application, and one or more modules/units may be a series of computer program instruction segments capable of accomplishing specific functions for describing the execution of the computer program in a computer device.

The computer device may be a desktop computer, a notebook computer, a palm computer, a cloud server, or the like. The computer device may include, but is not limited to, a memory, a processor, and it will be appreciated by those skilled in the art that the present embodiments are merely examples of computer devices and are not limiting of computer devices, may include more or fewer components, or may combine certain components, or different components, e.g., a computer device may also include an input, a network access device, a bus, etc.

The processor may be a central processing unit (Central Processing Unit, CPU), but may also be other general purpose processors, digital signal processors (Digital Signal Processor, DSP), application specific integrated circuits (Application Specific Integrated Circuit, ASIC), off-the-shelf programmable gate arrays (Field-Programmable Gate Array, FPGA) or other programmable logic devices, discrete gate or transistor logic devices, discrete hardware components, or the like. A general purpose processor may be a microprocessor or the processor may be any conventional processor or the like.

The memory may be an internal storage unit of the computer device, such as a hard disk or a memory of the computer device. The memory may also be an external storage device of the computer device, for example, a plug-in hard disk, a Smart Media Card (SMC), a Secure Digital (SD) Card, a Flash memory Card (Flash Card) or the like, which are provided on the computer device, and further, the memory may also include an internal storage unit of the computer device and an external storage device, and the memory may also be used to store a computer program and other programs and data required by the computer device, and the memory may also be used to temporarily store the program code in an output device, where the aforementioned storage medium includes a U-disk, a removable hard disk, a read-only memory ROM, a random access memory RAM, a disk or an optical disk and other various Media that can store program codes.

The foregoing is merely illustrative of specific embodiments of the present application, and the scope of the present application is not limited thereto, but any changes or substitutions within the technical scope of the present application should be covered by the scope of the present application. Therefore, the protection scope of the present application shall be subject to the protection scope of the claims.

Claims

1. The data list updating method based on Apache Iceberg is characterized by comprising the following steps:

determining a data set to be updated according to the target data files, updating the data in the data set to be updated and generating an update record, wherein the method comprises the following steps:

processing the first result set according to the type of each target data file associated with each second serial number to obtain a data set to be updated;

the processing the first result set according to the type of each target data file associated with each second serial number to obtain a data set to be updated includes:

repeating the steps until all target data files associated with all second serial numbers are processed, and taking a fourth result set corresponding to the finally processed second serial numbers as a data set to be updated;

2. The method for updating a data list based on Apache Iceberg of claim 1, further comprising:

3. The method for updating a data list based on Apache Iceberg of claim 1, wherein the screening a plurality of target data files from the data files according to query conditions comprises:

4. The method for updating a data list based on Apache Iceberg of claim 1, wherein metadata of all data files further includes a sequence number field;

the all data files also include general data files and delete data files.

5. An Apache Iceberg-based data table column update system, comprising:

the updating module is used for determining a data set to be updated according to the plurality of target data files, updating the data in the data set to be updated and generating an updating record, and comprises the following steps:

6. The Apache Iceberg-based data table column update system of claim 5, further comprising:

7. An electronic device comprising a memory and a processor, the memory to store one or more computer instructions, wherein the one or more computer instructions are executable by the processor to implement an Apache Iceberg-based data table column update method of any one of claims 1-4.

8. A computer-readable storage medium storing a computer program, wherein the computer program when executed causes a computer to implement an Apache Iceberg-based data table column updating method according to any one of claims 1 to 4.