CN116521641A

CN116521641A - Data lake-based data reading and writing method, data reading and writing device and storage medium

Info

Publication number: CN116521641A
Application number: CN202310096998.9A
Authority: CN
Inventors: 朱子龙; 周明伟; 钱浩东
Original assignee: Zhejiang Dahua Technology Co Ltd
Current assignee: Zhejiang Dahua Technology Co Ltd
Priority date: 2023-01-18
Filing date: 2023-01-18
Publication date: 2023-08-01

Abstract

The application discloses a data reading and writing method, a data reading and writing device and a computer storage medium based on a data lake, wherein the data reading and writing method comprises the following steps: acquiring record data and a primary key of the record data; reading a first primary key list of basic files of all file groups in the data lake; judging whether the primary key of the recorded data exists in a first primary key list of any one basic file or not; if not, reading a second primary key list of log files of all file groups in the data lake; judging whether the primary key of the recorded data exists in a second primary key list of any one log file; if yes, the recorded data is written into the log file of the second main key list corresponding to the main key with the recorded data. According to the data reading and writing method, the capability that the main key recorded in the log file can be indexed can be given through changing the format of the log file, the new record supports writing into the log file, IO (input output) expenditure during data ingestion is reduced, and the ingestion speed of a data lake during new data ingestion is accelerated.

Description

Data lake-based data reading and writing method, data reading and writing device and storage medium

Technical Field

The present disclosure relates to the field of data query technologies, and in particular, to a data reading and writing method, a data reading and writing device, and a computer storage medium based on a data lake.

Background

Data lakes play an important role in modern large data storage systems. It can store vast amounts of any type of data, including structured, semi-structured, and unstructured data. These data are continuously stored from different data sources into the DFS (Distributed file system ) through data lakes. Hudi is used as a data lake storage system, and provides the capability of rapid data ingestion and deletion for DFS.

However, the present index information can only be stored in the basic file, when the table format is the MOR table, for a newly added record, the record cannot be added to the log file, but the whole basic file is rewritten, so that the defects of higher IO overhead and write amplification exist, the data intake speed is slower, and more files occupy more storage space.

Disclosure of Invention

The application provides a data reading and writing method, a data reading and writing device and a computer storage medium based on a data lake.

The technical scheme adopted by the application is to provide a data reading and writing method based on a data lake, which comprises the following steps:

acquiring record data and a main key of the record data;

reading a first primary key list of basic files of all file groups in the data lake;

judging whether the primary key of the recorded data exists in a first primary key list of any one basic file or not;

if not, reading a second primary key list of log files of all file groups in the data lake;

judging whether the primary key of the recorded data exists in a second primary key list of any one log file or not;

if yes, the recorded data is written into a log file of a second main key list corresponding to the main key of the recorded data.

The data reading and writing method further comprises the following steps:

when the main key of the recorded data exists in a first main key list of any one basic file, the recorded data is written into a log file of a file group corresponding to the basic file;

or when the primary key of the record data does not exist in the first primary key list of any one basic file and does not exist in the second primary key list of any one log file, writing the record data into the log file of the file group with the smallest basic file in the data lake.

Wherein the determining whether the primary key of the record data exists in the second primary key list of any log file includes:

acquiring a primary key value range of each log file;

acquiring a first log file conforming to a primary key value condition based on the primary key value of the recorded data, wherein the primary key value condition is that the primary key value of the recorded data is in a primary key value range of the first log file;

and judging whether the primary key of the recorded data exists in a second primary key list of any one of the first log files.

Wherein the determining whether the primary key of the record data exists in the second primary key list of any one of the first log files includes:

acquiring a preset bloom filter of each first log file;

acquiring a second log file meeting a filtering condition based on the main key of the recorded data, wherein the filtering condition is that the output of a bloom filter in the second log file by the main key of the recorded data is possible;

and judging whether the primary key of the recorded data exists in a second primary key list of any one second log file.

Wherein writing the record data into a log file of a second primary key list corresponding to a primary key in which the record data exists, comprises:

writing a current data block at the end of a log file of a second main key list corresponding to the main key of the recorded data, wherein the current data block is used for storing the recorded data;

and setting the footer information of the current data block according to the main key of the recorded data, wherein the footer information comprises a bloom filter and a main key list of a current log file.

The setting the footer information of the current data block according to the primary key of the record data includes:

acquiring the footer information of the last data block of the current data block in the current log file;

and updating a main key list in the footer information of the last data block by using the main key of the current data block to generate the footer information of the current data block.

The updating the main key list in the header information of the last data block by using the main key of the current data block to generate the header information of the current data block includes:

acquiring footer information of a previous data block;

when the footer information of the last data block does not exist or is empty, acquiring all the data blocks of the current log file, and extracting the main keys of all the data blocks;

generating a main key list of the current log according to the main keys of all the data blocks and the main keys of the current data block, and generating a bloom filter according to the main key list of the current log;

and generating the footer information of the current data block according to the primary key list and the bloom filter of the current log.

The data reading and writing method further comprises the following steps:

and compressing the record data of the log files of all the file groups to the base file to which the record data belongs according to a preset period.

Another technical scheme adopted by the application is to provide a data read-write device, which comprises a memory and a processor coupled with the memory;

the memory is used for storing program data, and the processor is used for executing the program data to realize the data reading and writing method.

Another technical solution adopted in the present application is to provide a computer storage medium for storing program data, which when executed by a computer, is used to implement the data reading and writing method as described above.

The beneficial effects of this application are: the data read-write device acquires the recorded data and acquires a main key of the recorded data; reading a first primary key list of basic files of all file groups in the data lake; judging whether the primary key of the recorded data exists in a first primary key list of any one basic file or not; if not, reading a second primary key list of log files of all file groups in the data lake; judging whether the primary key of the recorded data exists in a second primary key list of any one log file; if yes, the recorded data is written into the log file of the second main key list corresponding to the main key with the recorded data. According to the data reading and writing method, the capability that the main key recorded in the log file can be indexed can be given through changing the format of the log file, so that a basic file does not need to be rewritten when a record is newly added, the occupation of storage space when data is ingested is reduced, the record is supported to be written into the log file when the record is newly added, IO (input/output) expenditure when the data is ingested is reduced, and the ingestion speed of a data lake when new data is ingested is accelerated.

Drawings

In order to more clearly illustrate the technical solutions of the embodiments of the present application, the drawings that are needed in the description of the embodiments will be briefly introduced below, and it is obvious that the drawings in the following description are only some embodiments of the present application, and that other drawings can be obtained according to these drawings without inventive effort for a person skilled in the art.

FIG. 1 is a flow chart of an embodiment of a data read/write method provided in the present application;

FIG. 2 is a schematic diagram of the overall flow of the data read-write method provided in the present application;

FIG. 3 is a schematic diagram of a file group in a data lake provided herein;

FIG. 4 is a flowchart illustrating specific sub-steps of the data read/write method step S15 shown in FIG. 1;

FIG. 5 is a flowchart illustrating specific sub-steps of the data read/write method step S16 shown in FIG. 1;

FIG. 6 is a schematic diagram of an embodiment of a data read/write device according to the present application;

fig. 7 is a schematic structural diagram of an embodiment of a computer storage medium provided in the present application.

Detailed Description

The following description of the technical solutions in the embodiments of the present application will be made clearly and completely with reference to the accompanying drawings in the embodiments of the present application, and it is apparent that the described embodiments are only some embodiments of the present application, not all embodiments. All other embodiments, which can be made by one of ordinary skill in the art without undue burden from the present disclosure, are within the scope of the present disclosure.

In current data lake implementations, the MOR table format is employed in order to obtain near real-time data acquisition capabilities. This table format will append the record modification and deletion operations to the log file and write the newly added record to the new base file. The log file uses a row-wise storage like the Avro format, which benefits from the feature that writes can be appended, which results in higher write performance, but poorer query performance. The base file uses columnar storage such as a Parque format, and for newly added records, the file format does not support adding data to the file, and the new addition of the records needs to be completed by rewriting the whole file.

The disadvantage of rewriting the entire file is that the real-time performance of writing is poor and the phenomenon of write amplification is serious. At present, the reason why the newly added data cannot be directly additionally written into the log file is that whether a record already exists or not can only be judged by searching the main key index information stored in the page footer of the basic file, but the main key index information of the log file is not recorded, and whether the record is newly added or modified cannot be determined. This also results in that for a newly arrived record, the data lake will look at all base files to determine if the record exists, find that it does not exist, the data lake will select a base file with the smallest file size, read the data in it to memory, and write a new base file with the newly added data.

Compared with a mode of directly adding files, the process has the defects of higher IO overhead and write amplification, so that the data acquisition speed is slower, and more files occupy more storage space.

In this regard, the present application provides a data read-write method, which is implemented in the following manner: the index information of the record is added in the footer of the log file, so that a new record can be added into the log file without rewriting the whole basic file, the data acquisition speed of the data lake is further improved, and the data acquisition capacity of the data lake can be obtained in near real time when the data lake is used on a MOR table; at the same time, the design can better control the size and the number of the files, thereby reducing the pressure of a large number of small files on the distributed storage system.

The technical terms mentioned in the present application are first described below:

1. COW: copy-on-write, hudi (a data lake storage system), has high data intake delay and low query delay, and uses a columnar file format to store data.

2. MOR: and merging when reading, wherein the data is stored in a column file format and a row file format in a table format in Hudi, and has low data ingestion delay and high query delay.

3. Hudi: a data lake implementation.

4. Parque: a columnar file format is typically stored as a base file in a data lake.

5. Avro: a line file format is typically used as a log file storage format in a data lake.

6. DFS: a distributed file system.

7. Record: record a piece of data in Hudi.

8. Index: a mapping of recorded primary keys and file paths.

9. File group: a base file and a plurality of log files.

10. Bloom filter: the return false representative value must not exist and the return true representative value may exist.

Referring to fig. 1 and fig. 2 in conjunction with the above description, fig. 1 is a schematic flow chart of an embodiment of a data reading and writing method provided in the present application, and fig. 2 is a schematic flow chart of a total flow chart of the data reading and writing method provided in the present application.

The data reading and writing method is applied to a data reading and writing device, wherein the data reading and writing device can be a server or a system formed by the cooperation of the server and the data reading and writing device. Accordingly, each part, for example, each unit, sub-unit, module, and sub-module, included in the data read/write device may be all disposed in the server, or may be disposed in the server and the data read/write device, respectively.

Further, the server may be hardware or software. When the server is hardware, the server may be implemented as a distributed server cluster formed by a plurality of servers, or may be implemented as a single server. When the server is software, it may be implemented as a plurality of software or software modules, for example, software or software modules for providing a distributed server, or may be implemented as a single software or software module, which is not specifically limited herein. In some possible implementations, the data read-write method of the embodiments of the present application may be implemented by a processor invoking computer readable instructions stored in a memory.

Specifically, as shown in fig. 1, the data read-write method in the embodiment of the present application specifically includes the following steps:

step S11: recording data is acquired, and a primary key of the recording data is acquired.

In this embodiment of the present application, as shown in fig. 2, when the data lake performs data ingestion, for a new record, the primary key and the primary key index information stored in the footers of all the base files are extracted.

Step S12: and reading a first primary key list of the basic files of all file groups in the data lake.

Step S13: and judging whether the primary key of the recorded data exists in a first primary key list of any one basic file.

In this embodiment of the present application, the data read-write device compares the primary keys in the primary key list of all the base file footers in the data lake with the primary keys of the record data, determines whether the primary keys of the record data exist in the primary key list of one or more base file footers, and analyzes whether the record data is an operation of modifying the record or an operation of newly adding the record by detecting whether the primary keys exist.

Specifically, as shown in fig. 2, if the primary key of the record data exists in the primary key list of a certain base file, the data read-write device writes the record data into the file group of the base file, and then appends the modification operation of the record data to the log file of the file group. If the primary key of the record data does not exist in the primary key list of all the base files, the process proceeds to step S14.

Step S14: and reading a second primary key list of log files of all file groups in the data lake.

In this embodiment of the present application, when the primary key of the record data does not exist in the primary key list of all the base files, the data read-write device continues to extract the primary key index information stored in the last Avro data block footer of all the log files in the data lake, that is, the primary key list.

Step S15: and judging whether the primary key of the recorded data exists in a second primary key list of any one log file.

In this embodiment of the present application, as shown in fig. 2, if the primary key of the record data does not exist in the first primary key list of any one base file and does not exist in the second primary key list of any one log file, the data read-write device analyzes the record data as a newly added record data. At this time, the data read-write device selects a file group in the data lake in which the base file is smallest, and then appends the operation record of the recorded data to the log file of the file group.

If the primary key of the record data exists in the primary key list of a log file, the data read-write device still analyzes the record data as a modified record data, and proceeds to step S16.

Further, the log file of the present application is used for recording the primary key information of the current data block by adding an ordered primary key list to the content part of the Avro data block. In addition, the minimum value and the maximum value of the primary key in the range of the bloom filter and the current log file are further added to the log file in the footer part, so that the query efficiency of the primary key is improved.

With continued reference to fig. 3 and fig. 4, fig. 3 is a schematic structural diagram of a file group in a data lake provided in the present application; fig. 4 is a flowchart illustrating specific sub-steps of the data read/write method step S15 shown in fig. 1.

As shown in fig. 3, the present application improves and optimizes the Avro data block in the log file, increases the function of recording the primary key of the record data in the Avro data block, and adds a bloom filter to the footer of the Avro data block, and the information of the primary key minimum value and the primary key maximum value of the log file.

The main key index information stored in the base file and the log file footer is identical in type and is the maximum value and the minimum value of the bloom filter and the main key in the file.

Specifically, as shown in fig. 4, the data read-write method in the embodiment of the present application specifically includes the following steps:

step S151: and acquiring a primary key value range of each log file.

In this embodiment of the present application, as shown in fig. 3, the data read-write device may obtain a primary key value range of the log file at a footer of a last Avro data block of the log file, where the primary key value range is determined by a numerical value range between a primary key maximum value and a primary key minimum value.

Step S152: and acquiring a first log file conforming to a primary key value condition based on the value of the primary key of the recorded data, wherein the primary key value condition is that the value of the primary key of the recorded data is in a primary key value range of the first log file.

In this embodiment of the present application, the data read-write device screens out the first log file that meets the primary key value condition from all log files according to the primary key value of the recorded data. The data read-write device can filter out the log files of which the part of the primary key value range does not comprise the numerical value of the primary key of the recorded data through the primary key numerical value condition, so that the workload of traversing the log files is reduced, and the recording and inquiring efficiency is improved.

Step S153: and judging whether the primary key of the recorded data exists in a second primary key list of any one of the first log files.

In this embodiment of the present application, the data read-write device continuously queries whether the primary key of the record data exists in the primary key list of a certain log file from the log file subjected to the primary key numerical condition screening.

Furthermore, the data read-write device can further utilize the bloom filter of the Avro data block footer to continuously screen the log files subjected to the primary key numerical value condition screening, so that the primary key query efficiency is further improved.

Specifically, the data read/write device inputs the primary key of the recorded data into the bloom filter of each log file, and obtains the output of the bloom filter. If the bloom filter returns a "false" result, the primary key representing the record data must not exist in the associated log file, thereby filtering out the log file. If the result returned by the bloom filter is "true", the primary key representing the record data may exist in the log file, and then the primary key query is performed on the log file.

In the embodiment of the application, for one record, by extracting the primary key value of the record and comparing the maximum value and minimum value information stored in the footer with the bloom filter, the base file and the log file which do not have the record can be directly eliminated, so that the number of the record for comparison one by one is reduced.

For the possible false alarm condition of the bloom filter, if the bloom filter returns to exist, all the Avro data blocks in the log file are further read and compared with the information of the record main key in the blocks one by one.

Step S16: the record data is written into the log file of the second primary key list corresponding to the primary key where the record data exists.

In this embodiment of the present application, if the record data exists in the primary key list of a certain log file, the data read-write device selects the file group of the base file to which the record data belongs, and additionally writes the record data into the log file in the file group.

So far, in the whole data shooting process, for the processing of the newly added record, the data read-write device does not need to rewrite the whole basic file, and the operation of writing the log file is added to replace the operation.

Specifically, although the index speed of the record information in the log file is slower than that of the base file, the data read-write device can compress the record information of the log file into the base file at regular time by the compression service of the data lake, and the time overhead of the part can be effectively controlled.

Further, the data read/write device stores the primary key of the record data by adding a read/write mode to the end of the log file through the Avro data block. The method comprises the steps of redesigning an Avro data block format of a log file, adding an ordered main key list in new data block content, recording main key information of a current data block, and adding a bloom filter and the minimum value and the maximum value of main keys in the range of the current log file in a footer part.

With continued reference to fig. 5, fig. 5 is a flowchart illustrating a specific sub-step of the data read/write method step S16 shown in fig. 1.

Specifically, as shown in fig. 5, the data read-write method in the embodiment of the present application specifically includes the following steps:

step S161: and writing the current data block at the end of the log file of the second main key list corresponding to the main key with the recorded data for storing the recorded data.

Step S162: and setting the footer information of the current data block according to the main key of the recorded data, wherein the footer information comprises a bloom filter and a main key list of the current log file.

In the embodiment of the application, when the Avro data block is added, the footer information of the previous Avro data block is checked, the maximum value and the minimum value of the footer information and the ordered primary key list are extracted, the updated maximum value and the minimum value are written into the footer in combination with the ordered primary key list of the current Avro data block, and a bloom filter generated according to the ordered primary key list is written into the footer.

The Avro data block design of the new version log file is compatible with the old version. When the new version of Avro data block is added into the old version log file and the page footer data of the previous Avro data block is empty, all the Avro data blocks in the file are traversed, the primary key information in the Avro data block is extracted, and after the new version of Avro data block is ordered with the primary key information of the newly added Avro data block, the maximum value, the minimum value and a bloom filter generated according to the primary key information are written into the page footers of the Avro data block. When the page footer data of the previous Avro data block is read to be not empty, the subsequently added Avro data block can calculate the maximum value and minimum value information of the bloom filter and the primary key which are required to be set according to the information of the previous data block, and the record in the whole file is not required to be traversed.

It should be noted that when there is no new version of Avro data block in the log file, it is necessary to traverse all Avro data blocks in the log file to determine whether a record exists, and the extra time overhead generated here can be solved by running the data lake compression service once. In addition, for the compatibility design of the old version log file, the extra time cost only occurs when the new version Avro data block is additionally written into the old version log file for the first time, and then the extra time cost does not occur when the new version Avro data block is rewritten, namely the extra time cost of the part is disposable and can be basically ignored.

In the design of the Avro data block of the new version, the content of the main key which is independently stored is added, and in the piece-by-piece comparison process, the data read-write device does not need to read the whole Avro data block, can directly read the information of the main key part in the block, and reduces IO overhead in the comparison process.

In summary, the application provides a data lake-based data intake optimization method, which can solve the problems, provide near real-time data intake capability, and relieve write amplification to a certain extent and control file size. In the data ingestion optimization method, an ordered primary key list is added to the content part of the Avro data block of the log file, and primary key information of the current data block is recorded; the footer portion adds bloom filters and minimum and maximum values for primary keys within the current log file.

The purpose of this is to bring the ability to index records to the log file so that the data lake can quickly determine whether an inserted record is newly added or modified by the page footer of the log file. When the data lake performs data intake, the inserted record can be directly written into the log file without rewriting the whole basic file, the data of the log file is integrated into the basic file through the compression operation of the data lake, and the real-time performance of the data intake is further enhanced.

In the embodiment of the application, the data reading and writing device acquires the recorded data and acquires a main key of the recorded data; reading a first primary key list of basic files of all file groups in the data lake; judging whether the primary key of the recorded data exists in a first primary key list of any one basic file or not; if not, reading a second primary key list of log files of all file groups in the data lake; judging whether the primary key of the recorded data exists in a second primary key list of any one log file; if yes, the recorded data is written into the log file of the second main key list corresponding to the main key with the recorded data. According to the data reading and writing method, the capability that the main key recorded in the log file can be indexed can be given through changing the format of the log file, so that a basic file does not need to be rewritten when a record is newly added, the occupation of storage space when data is ingested is reduced, the record is supported to be written into the log file when the record is newly added, IO (input/output) expenditure when the data is ingested is reduced, and the ingestion speed of a data lake when new data is ingested is accelerated.

The data reading and writing method can solve the problem that when the data lake ingests data, the new data ingests speed is low, and the ingestion speed is improved by additionally writing the newly added record operation into the log file. Meanwhile, the newly added record is avoided from rewriting the basic file, the size and the number of the file are better controlled, and the storage overhead is reduced.

In addition, the data read-write method of the application endows the main key recorded in the log file with the capability of being indexed by redesigning the log file format of the existing data lake; by changing the log file format, the required data is added to the Avro data block, the log file format of the old version is compatible, the user does not need to perceive modification details, and smooth transition can be realized; the introduction of an external third party component is avoided, and the consistency problem is avoided; the newly added record does not need to be read in and rewritten by the basic file, so that the occupation of the storage space is reduced; the newly added record is additionally written into the log file, so that IO overhead during data ingestion is reduced, and the real-time performance of data ingestion of the data lake is improved.

The above embodiments are only one common case of the present application, and do not limit the technical scope of the present application, so any minor modifications, equivalent changes or modifications made to the above matters according to the scheme of the present application still fall within the scope of the technical scheme of the present application.

With continued reference to fig. 6, fig. 6 is a schematic structural diagram of an embodiment of a data read-write device provided in the present application. The data read/write apparatus 500 of the embodiment of the present application includes a processor 51, a memory 52, an input/output device 53, and a bus 54.

The processor 51, the memory 52, and the input/output device 53 are respectively connected to the bus 54, and the memory 52 stores program data, and the processor 51 is configured to execute the program data to implement the data read/write method described in the above embodiment.

In the present embodiment, the processor 51 may also be referred to as a CPU (Central Processing Unit ). The processor 51 may be an integrated circuit chip with signal processing capabilities. Processor 51 may also be a general purpose processor, a digital signal processor (DSP, digital Signal Process), an application specific integrated circuit (ASIC, application Specific Integrated Circuit), a field programmable gate array (FPGA, field Programmable Gate Array) or other programmable logic device, discrete gate or transistor logic device, discrete hardware components. The general purpose processor may be a microprocessor or the processor 51 may be any conventional processor or the like.

Still further, referring to fig. 7, fig. 7 is a schematic structural diagram of an embodiment of the computer storage medium provided in the present application, in which the program data 61 is stored in the computer storage medium 600, and the program data 61 is used to implement the data read/write method of the above embodiment when being executed by a processor.

Embodiments of the present application are implemented in the form of software functional units and sold or used as a stand-alone product, which may be stored on a computer readable storage medium. Based on such understanding, the technical solution of the present application may be embodied in essence or a part contributing to the prior art or all or part of the technical solution, in the form of a software product stored in a storage medium, including several instructions to cause a computer device (which may be a personal computer, a server, or a network device, etc.) or a processor (processor) to perform all or part of the steps of the methods described in the embodiments of the present application. And the aforementioned storage medium includes: a U-disk, a removable hard disk, a Read-Only Memory (ROM), a random access Memory (RAM, random Access Memory), a magnetic disk, or an optical disk, or other various media capable of storing program codes.

The foregoing is merely an embodiment of the present application, and the patent scope of the present application is not limited thereto, but the equivalent structures or equivalent flow changes made in the present application and the contents of the drawings are utilized, or directly or indirectly applied to other related technical fields, which are all included in the patent protection scope of the present application.

Claims

1. The data reading and writing method based on the data lake is characterized by comprising the following steps of:

acquiring record data and a main key of the record data;

2. The method for reading and writing data according to claim 1, wherein,

the data reading and writing method further comprises the following steps:

3. The method for reading and writing data according to claim 1, wherein,

the determining whether the primary key of the record data exists in the second primary key list of any log file includes:

acquiring a primary key value range of each log file;

4. A data reading and writing method according to claim 3, wherein,

the determining whether the primary key of the record data exists in the second primary key list of any one of the first log files includes:

acquiring a preset bloom filter of each first log file;

5. The method for reading and writing data according to claim 1, wherein,

the writing the record data into the log file of the second primary key list corresponding to the primary key of the record data comprises the following steps:

6. The method for reading and writing data according to claim 5, wherein,

7. The method for reading and writing data according to claim 6, wherein,

acquiring footer information of a previous data block;

8. The method for reading and writing data according to claim 1, wherein,

the data reading and writing method further comprises the following steps:

9. A data read-write device, wherein the data read-write device comprises a memory and a processor coupled with the memory;

wherein the memory is for storing program data and the processor is for executing the program data to implement the data read-write method according to any one of claims 1 to 8.

10. A computer storage medium for storing program data which, when executed by a computer, is adapted to carry out the data read-write method according to any one of claims 1 to 8.