CN116881261B

CN116881261B - Flink-based real-time whole-reservoir lake entering method

Info

Publication number: CN116881261B
Application number: CN202311133058.9A
Authority: CN
Inventors: 张赵中; 唐金鑫; 吴小前
Original assignee: Beijing Deepexi Technology Co Ltd
Current assignee: Beijing Deepexi Technology Co Ltd
Priority date: 2023-09-05
Filing date: 2023-09-05
Publication date: 2023-11-07
Anticipated expiration: 2043-09-05
Also published as: CN116881261A

Abstract

The application relates to the technical field of data processing, in particular to a method for entering a lake in a real-time whole library based on a Flink, which comprises the steps of matching a plurality of operators in the Flink, butting a database supporting CDC, acquiring real-time change data of the database, taking a distributed file system as a transit, and synchronizing a data change DML statement and a table structure change DDL statement in the real-time change data into an Iceberg table through a checkpoint. According to the technical scheme, the method and the device for synchronizing the data in the Iceberg data lake realize the real-time synchronization of the data in the whole source database on the basis of guaranteeing the performance of the Flink operation.

Description

Flink-based real-time whole-reservoir lake entering method

Technical Field

The application relates to the technical field of data processing, in particular to a method for entering a lake in a whole pool in real time based on a Flink.

Background

Apache Flink is the most popular big data real-time computing framework at present, and Flink is a distributable open source computing framework facing data stream processing and batch data processing, and can support two application types of stream processing and batch processing based on the same Flink stream execution model. And it has many connector sub-items supporting most of the mainstream databases (e.g. msyql, oracle), distributed data stream platforms (e.g. kafka).

A data lake is a centralized repository that allows users to store all structured and unstructured data on an arbitrary scale. The user can store the data as it is (without first structuring the data) and run different types of analysis, from control panels and visualization to big data processing, real-time analysis and machine learning, to guide better decisions.

Apache Iceberg is an advanced data lake formatting storage technology. The Iceberg table is an open table format designed for large PB (one of the data storage levels) level tables. The function of the form format is to determine how the user manages, organizes and tracks all the files that make up the form. The user can treat them as physical data files (written in part or ORC, etc.) and how they are built to form an abstraction layer between tables.

CDC (Change Data Capture, data capture) is a technology that enables capturing data and data structures from a source database.

In the prior art, the table of the mysql database is collected into the Iceberg data lake through the CDC technology based on the Flink framework by the open source technology, but the structure of the table of the mysql database is determined at the beginning, and the structure of the table is fixed when the table is written into the Iceberg data lake. In practice, however, in a system, the structure of the tables in the database often changes, and the data itself is subject to deletion, but these changes cannot be synchronized to the Iceberg data lake.

Disclosure of Invention

The application provides a real-time whole-reservoir lake entering method based on a Flink, which aims to solve the problem that the structure change and the data change of a table cannot be synchronized in a data lake after the table and the data in the database are acquired into the data lake in the related technology at least to a certain extent.

The scheme of the application is as follows:

a method for real-time whole-reservoir lake entering based on Flink comprises the following steps:

step S1, a database supporting CDC is docked through a Source operator, and real-time change data of the database are obtained; the real-time change data includes: data change DML (Data Manipulation Language ) statements and table structure change DDL (Data Definition Language ) statements;

s2, performing keyBy on the real-time change data according to table names, and distributing the real-time change data with different table names to different StreamDdlHandler operators to process table structure change DDL sentences; the StreamDdlHandler operator is an operator for processing DDL sentences;

step S3, a writer commit mark is sent to a writer instance through a StreamDdlHandler operator, so that the writer instance commits the current table data corresponding to the table structure change DDL statement, and data updated in a table are cached in real time until a preset condition is met, and the table structure change DDL statement is executed;

s4, the data change DML statement is sent to a DynamicStreamWriter operator through a StreamDdlHandler operator; the DynamicStreamWriter operator is an operator for writing data to a distributed file system;

s5, writing the table data corresponding to the data change DML statement into a distributed file system through the dynamic stream writer operator, acquiring or creating a writer instance in a cache map according to the table name of each table data, completing the table data in the writer instance in each checkpoint, and sending a manifest to a compiler;

step S6, sending the manifest to the dynamicFilesCommitter operator through the dynamicStreamWriter operator; the dynamicfilesCommitter operator is an operator for submitting metadata;

step S7, caching the manifest through the DynamicFilesCommitter operator, starting a new checkpoint, and storing the manifest of all the cached tables in the checkpoint State; after the current checkpoint is completed, submitting the manifest of each current checkpoint to an Iceberg table;

wherein, the Source operator, the streamdlhandler operator, the dynamicstreamWriter operator, and the dynamicFilesCommitter operator are all operators in the Flink.

Preferably, the method further comprises:

based on the Source operator in the Flink, interfacing with the Kafka of the external system with the CDC function, acquiring real-time change data in the Kafka of the external system.

Preferably, the method further comprises:

judging whether a new or deleted table structure change DDL statement exists or whether a table to which the table structure change DDL statement belongs receives a data change DML statement or not through a streamDdlHandler;

if yes, executing a table structure change DDL statement; if not, executing the step S3;

judging whether repeated table structure change DDL sentences exist or not through a StreamDdlHandler;

if yes, discarding repeated table structure change DDL statement; if not, step S3 is performed.

Preferably, the preset condition is:

monitoring whether a preset number of success marks are generated by a writer instance in the distributed file system; or, exceeding the preset cache time; the buffer time length is set according to the interval time length of the checkpoint event.

Preferably, after executing the table structure change DDL statement, the method further comprises:

and notifying an operator processing cache of the summary submission result.

Preferably, the method further comprises:

and (4) modulo the parallelism of the single table by the data change DML statement to obtain a hashKey, carrying out hash on the data change DML statement by the hashKey, and executing step S4 on the hashed data change DML statement.

Preferably, the Source operator, the streamdlhandler operator, the dynamicstreamWriter operator, the dynamicFilesCommitter operator are all multiple, and each operator processes tasks in parallel;

the method further comprises the steps of:

and establishing a plurality of task slots, and placing each dynamic streamwriter operator in different task slots to process different table data.

Preferably, the method further comprises:

hash the table name of the Iceberg table;

step S6 is executed according to the table name hash of the Iceberg table.

Preferably, the method further comprises:

judging whether the table name of the Iceberg table is received for the first time or not through the dynamicofilesCommitter operator;

if yes, initializing a cache map, and acquiring the maximum submission ckpID in the table attribute of the Iceberg table when the current operation is snapshot restarting;

if the maximum submitted ckpID in the table attribute of the Iceberg table is smaller than the maximum submitted ckpID stored in the checkpointsState, submitting the manifest corresponding to the uncommitted ckpID in the checkpointsState to the Iceberg table.

Preferably, the method further comprises:

if the step S7 is failed to be executed, the failure table is sent to a checkCommitSink operator with single parallelism;

printing a log through the checkCommitSink operator, throwing out an exception, ending the job and waiting for restarting.

The technical scheme provided by the application can comprise the following beneficial effects: the application discloses a method for entering a lake in real time whole warehouse based on Flink, which comprises the following steps: the method comprises the steps of docking a database supporting CDC through a Source operator in a Flink, and obtaining real-time change data of the database; the real-time change data includes: data change DML statements and table structure change DDL statements; performing keyBy on the real-time change data according to table names, and distributing the real-time change data with different table names to different StreamDdlHandler operators to process table structure change DDL sentences; the streamDdlHandler operator is an operator that processes DDL statements; transmitting a writer commit mark to a writer instance through a streamDdlHandler operator, enabling the writer instance to commit current table data corresponding to the table structure change DDL statement, caching updated data in the table in real time until a preset condition is met, and executing the table structure change DDL statement; the data change DML statement and the data change DML statement are sent to a DynamicStreamWriter operator through a StreamDdlHandler operator; the DynamicStreamWriter operator is an operator that writes data to a distributed file system; writing table data corresponding to the data change DML statement into a distributed file system through a dynamic stream writer operator, acquiring or creating a writer instance in a cache map according to the table name of each table data, completing the table data in the writer instance in each checkpoint, and sending a manifest to a compiler; transmitting the manifest to a dynamicfilesCommitter operator by a dynamicstreamWriter operator; the dynamicfilesCommitter operator is an operator that submits metadata; caching the manifest through a dynamicFilesCommitter operator, starting a new checkpoint, and storing the manifest of all the cached tables in the checkpoint State; after the current checkpoint is completed, the manifest of each sheet of the current checkpoint is submitted to the Iceberg table. The Source operator, the streamdlHandler operator, the DynamicStreamWriter operator and the DynamicFilesCommitter operator are all operators in the Flink, and the technical scheme of the application realizes the real-time synchronization of the data of the whole database of the Source database to the Iceberg data lake on the basis of ensuring the performance of the Flink operation.

It is to be understood that both the foregoing general description and the following detailed description are exemplary and explanatory only and are not restrictive of the application as claimed.

Drawings

The accompanying drawings, which are incorporated in and constitute a part of this specification, illustrate embodiments consistent with the application and together with the description, serve to explain the principles of the application.

FIG. 1 is a schematic flow chart of a method for real-time whole-reservoir lake entering based on Flink according to an embodiment of the application.

Detailed Description

Reference will now be made in detail to exemplary embodiments, examples of which are illustrated in the accompanying drawings. When the following description refers to the accompanying drawings, the same numbers in different drawings refer to the same or similar elements, unless otherwise indicated. The implementations described in the following exemplary examples do not represent all implementations consistent with the application. Rather, they are merely examples of apparatus and methods consistent with aspects of the application as detailed in the accompanying claims.

Fig. 1 is a flow chart of a method for real-time whole-reservoir lake-entering based on a link according to an embodiment of the present application, referring to fig. 1, a method for real-time whole-reservoir lake-entering based on a link includes:

step S1, a database supporting CDC is docked through a Source operator, and real-time change data of the database are obtained; the real-time change data includes: data change DML statements and table structure change DDL statements;

in this embodiment, the Source operator, the streamdclHandler operator, the dynamicstreamWriter operator, and the dynamicfilesCommitter operator are all conventional operators in the link. In this embodiment, the Source operator, the streamdlhandler operator, the dynamicstreamWriter operator, and the dynamicfilesCommitter operator are all plural, and each operator processes tasks in parallel.

The Source operator is used for interfacing a database supporting CDC and acquiring real-time change data of the database; and (3) also interfacing with Kafka of an external system with a CDC function, and acquiring real-time change data in Kafka of the external system.

The data change DML statement belongs to the data layer language in the table, and the data is changed when the operation of the data in the table, such as writing data, is performed.

The table structure alters the language to which the DDL statement belongs at the structural level of the table, and the operations to the table structure, such as adding fields, modifying the table.

S2, performing keyBy on the real-time change data according to table names, and distributing the real-time change data with different table names to different StreamDdlHandler operators to process table structure change DDL sentences; the streamDdlHandler operator is an operator that processes DDL statements;

the real-time change data is keyBy according to the table name, that is, the table name is hashed, so that the real-time change data is distributed to different streamdsldhanders according to the table name for processing.

In specific practice, different tables can also be processed in the same streamdlhandler, and resource utilization is improved by multiplexing.

Step S3, a writer commit mark is sent to the writer instance through a StreamDdlHandler operator, so that the writer instance commits the current table data corresponding to the table structure change DDL statement, and the updated data in the table is cached in real time until a preset condition is met, and the table structure change DDL statement is executed;

before step S3 is executed, determining whether there is a new or deleted table structure change DDL statement, and whether a table to which the table structure change DDL statement belongs receives a data change DML statement by using a streamdsilhandler;

The logic determination process is to execute the DDL statement only once to prevent the repeated creation or the de-duplication of the table.

It should be noted that the writer instance in step S3 is a preset related writer instance. In step S3, a writer commit flag is sent to all relevant writer instances, so that the writer instances commit the current table data corresponding to the table structure change DDL statement.

In this step, the table structure change DDL statement is executed first to maintain the table structure before writing data, for example, delete columns, add columns, and modify columns at column level; new tables at the table level, delete tables, modify table operations.

It should be noted that, the fact that the writer instance performs commit on the current table data corresponding to the table structure change DDL statement means that the file which is not written to the file system is written, and then the metadata is synchronized to the metadata center.

The preset conditions in step S3 are as follows: monitoring whether a preset number of success marks are generated by a writer instance in the distributed file system; or, exceeding the preset cache time; the buffer time length is set according to the interval time length of the checkpoint event.

In this embodiment, when the writer instance is made to commit the current table data corresponding to the table structure change DDL statement, the number of success marks generated by the writer instance in the distributed file system is monitored, and at the same time, updated data in the table is cached in real time.

The table structure change DDL statement is executed when a sufficient number of success markers are generated by the writer instance, or when a preconfigured cache time period is exceeded.

It should be noted that, the buffer duration is set according to the interval duration of the checkpoint event. Preferably, the buffer duration is three times the interval duration of the checkpoint event.

It should be noted that, after the checkpoint event is activated, the database write process is triggered to write out the dirty data block in the data buffer into the data file.

It should be noted that, after executing the table structure change DDL statement, the method further includes:

and notifying an operator processing cache of the summary submission result.

S4, data change DML statement and sending to a DynamicStreamWriter operator through a StreamDdlHandler operator; the DynamicStreamWriter operator is an operator that writes data to a distributed file system;

in this embodiment, the data change DML statement is modulo the single table parallelism to obtain the hashKey, the hash is performed on the data change DML statement by the hashKey, and step S4 is performed on the hashed data change DML statement.

In this embodiment, the data is changed to DML statement, and the parallelism of the single table is modulo by a primary key or other fields to obtain the hashKey. Unique identification of the primary key single bar data. Modulo the parallelism of a single table means that the single table is submitted to write different numbers of files each time according to the parallelism.

It should be noted that, the hashKey is generated after the single table parallelism is modulo, and the hashKey is obtained by modulo the single table parallelism through a main key or other fields, so as to write the single table through different operators, so as to speed up the writing speed.

S5, table data corresponding to the data change DML statement are written into a distributed file system through a dynamic stream writer operator, a writer instance is obtained or created in a cache map according to the table name of each table data, the table data in the writer instance are completed in each checkpoint, and a manifest is sent to a composer;

note that, complete in this embodiment refers to writing all the data cached in each checkpoint period to the external storage system, and returning the written statistics. Comprising the following steps: how many files are written, the specific path of each file, the type of each file already in size, etc. This information is used when changing the metadata of the Iceberg table.

It should be noted that the commander refers to a data submitter, that is, a dynamicfilesCommitter operator hereinafter.

It should be noted that, in this embodiment, a plurality of task slots are established, and each DynamicStreamWriter operator is placed in a different task slot to process different table data, so as to increase the processing speed.

Step S6, sending the manifest to a dynamicFilesCommitter operator through a dynamicStreamWriter operator; the dynamicfilesCommitter operator is an operator that submits metadata;

it should be noted that the method further includes: hash the table name of the Iceberg table;

step S6 is executed according to the table name hash of the Iceberg table.

It should be noted that the Iceberg table is a table in the Iceberg data lake.

It should be noted that the DynamicStreamWriter operator sends a manifest to the dynamicfilesCommitter operator through the Iceberg table name hash, and a table can only be processed by one dynamicfilesCommitter operator, otherwise, a lock preemption problem occurs.

Note that, the manifest in this embodiment is file metadata obtained by storing the current refresh data to the bottom layer.

Step S7, caching the manifest through a DynamicFilesCommitter operator, starting a new checkpoint, and storing the manifest of all the cached tables in the checkpoint State; after the current checkpoint is completed, submitting the manifest of each current checkpoint to an Iceberg table;

after receiving the upstream data (the processing result of the Iceberg table name hash and the upstream dynamicostreamwriter operator), dynamicofilesCommitter caches the manifest.

It should be noted that, when a new checkpoint snapshot is started, the thread pool may be used to write the manifest file of each table, so as to improve the write performance when the number of tables is too large.

Note that before storing the manifest of all the tables of the present cache in the checkpoints state, the checkpoints state needs to be emptied.

After all operators finish the current checkpoint, submitting the manifest of each current checkpoint to the Iceberg table by using a thread pool.

It should be noted that, in this embodiment, there is also a fault tolerance mechanism after the commit failure, including:

judging whether the table name of the Iceberg table is received for the first time or not through a DynamicFilesCommitter operator;

It should be noted that the snapshot in this step may be a checkpoint or a savepoint.

Note that, the ckpID refers to an ID used to perform state saving on data flowing in real time at the time of periodic checkpoints of the link system, and may be used to automatically resume operation from a state, rather than from an initial state, when a job is abnormal. Each time of the checkpoints is marked by a unique ID and is monotonically increased, when each operator in the flow diagram receives the information of the checkpoints, the ID of the current checkpoint can be obtained from the information, and the operators can conveniently carry out logic judgment.

It should be noted that the method further includes:

printing the log through a checkCommitSink operator, throwing out an exception, ending the job, and waiting for restarting.

It should be noted that, if the execution of step S7 is successful, the current notch lake entering procedure is successfully ended.

It is to be understood that the same or similar parts in the above embodiments may be referred to each other, and that in some embodiments, the same or similar parts in other embodiments may be referred to.

It should be noted that in the description of the present application, the terms "first," "second," and the like are used for descriptive purposes only and are not to be construed as indicating or implying relative importance. Furthermore, in the description of the present application, unless otherwise indicated, the meaning of "plurality" means at least two.

Any process or method descriptions in flow charts or otherwise described herein may be understood as representing modules, segments, or portions of code which include one or more executable instructions for implementing specific logical functions or steps of the process, and further implementations are included within the scope of the preferred embodiment of the present application in which functions may be executed out of order from that shown or discussed, including substantially concurrently or in reverse order, depending on the functionality involved, as would be understood by those reasonably skilled in the art of the present application.

It is to be understood that portions of the present application may be implemented in hardware, software, firmware, or a combination thereof. In the above-described embodiments, the various steps or methods may be implemented in software or firmware stored in a memory and executed by a suitable instruction execution system. For example, if implemented in hardware, as in another embodiment, may be implemented using any one or combination of the following techniques, as is well known in the art: discrete logic circuits having logic gates for implementing logic functions on data signals, application specific integrated circuits having suitable combinational logic gates, programmable Gate Arrays (PGAs), field Programmable Gate Arrays (FPGAs), and the like.

Those of ordinary skill in the art will appreciate that all or a portion of the steps carried out in the method of the above-described embodiments may be implemented by a program to instruct related hardware, where the program may be stored in a computer readable storage medium, and where the program, when executed, includes one or a combination of the steps of the method embodiments.

In addition, each functional unit in the embodiments of the present application may be integrated in one processing module, or each unit may exist alone physically, or two or more units may be integrated in one module. The integrated modules may be implemented in hardware or in software functional modules. The integrated modules may also be stored in a computer readable storage medium if implemented in the form of software functional modules and sold or used as a stand-alone product.

The above-mentioned storage medium may be a read-only memory, a magnetic disk or an optical disk, or the like.

In the description of the present specification, a description referring to terms "one embodiment," "some embodiments," "examples," "specific examples," or "some examples," etc., means that a particular feature, structure, material, or characteristic described in connection with the embodiment or example is included in at least one embodiment or example of the present application. In this specification, schematic representations of the above terms do not necessarily refer to the same embodiments or examples. Furthermore, the particular features, structures, materials, or characteristics described may be combined in any suitable manner in any one or more embodiments or examples.

While embodiments of the present application have been shown and described above, it will be understood that the above embodiments are illustrative and not to be construed as limiting the application, and that variations, modifications, alternatives and variations may be made to the above embodiments by one of ordinary skill in the art within the scope of the application.

Claims

1. A method for real-time whole-reservoir lake entering based on Flink is characterized by comprising the following steps:

2. The method according to claim 1, wherein the method further comprises:

3. The method according to claim 1, wherein the method further comprises:

4. The method according to claim 1, wherein the preset conditions are:

5. The method of claim 1, wherein after executing the table structure change DDL statement, the method further comprises:

and notifying an operator processing cache of the summary submission result.

6. The method according to claim 1, wherein the method further comprises:

7. The method of claim 1, wherein the Source operator, streamdlhandler operator, dynamicstreamWriter operator, dynamicfilesCommitter operator are all plural, each operator processing tasks in parallel;

the method further comprises the steps of:

8. The method according to claim 1, wherein the method further comprises:

hash the table name of the Iceberg table;

step S6 is executed according to the table name hash of the Iceberg table.

9. The method of claim 8, wherein the method further comprises:

10. The method according to claim 1, wherein the method further comprises: