CN116881261B - Flink-based real-time whole-reservoir lake entering method - Google Patents

Flink-based real-time whole-reservoir lake entering method Download PDF

Info

Publication number
CN116881261B
CN116881261B CN202311133058.9A CN202311133058A CN116881261B CN 116881261 B CN116881261 B CN 116881261B CN 202311133058 A CN202311133058 A CN 202311133058A CN 116881261 B CN116881261 B CN 116881261B
Authority
CN
China
Prior art keywords
operator
data
change
real
statement
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202311133058.9A
Other languages
Chinese (zh)
Other versions
CN116881261A (en
Inventor
张赵中
唐金鑫
吴小前
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing Deepexi Technology Co Ltd
Original Assignee
Beijing Deepexi Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing Deepexi Technology Co Ltd filed Critical Beijing Deepexi Technology Co Ltd
Priority to CN202311133058.9A priority Critical patent/CN116881261B/en
Publication of CN116881261A publication Critical patent/CN116881261A/en
Application granted granted Critical
Publication of CN116881261B publication Critical patent/CN116881261B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/22Indexing; Data structures therefor; Storage structures
    • G06F16/2282Tablespace storage structures; Management thereof
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/27Replication, distribution or synchronisation of data between databases or within a distributed database system; Distributed database system architectures therefor
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/28Databases characterised by their database models, e.g. relational or object models
    • G06F16/283Multi-dimensional databases or data warehouses, e.g. MOLAP or ROLAP
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02DCLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
    • Y02D10/00Energy efficient computing, e.g. low power processors, power management or thermal management

Abstract

The application relates to the technical field of data processing, in particular to a method for entering a lake in a real-time whole library based on a Flink, which comprises the steps of matching a plurality of operators in the Flink, butting a database supporting CDC, acquiring real-time change data of the database, taking a distributed file system as a transit, and synchronizing a data change DML statement and a table structure change DDL statement in the real-time change data into an Iceberg table through a checkpoint. According to the technical scheme, the method and the device for synchronizing the data in the Iceberg data lake realize the real-time synchronization of the data in the whole source database on the basis of guaranteeing the performance of the Flink operation.

Description

Flink-based real-time whole-reservoir lake entering method
Technical Field
The application relates to the technical field of data processing, in particular to a method for entering a lake in a whole pool in real time based on a Flink.
Background
Apache Flink is the most popular big data real-time computing framework at present, and Flink is a distributable open source computing framework facing data stream processing and batch data processing, and can support two application types of stream processing and batch processing based on the same Flink stream execution model. And it has many connector sub-items supporting most of the mainstream databases (e.g. msyql, oracle), distributed data stream platforms (e.g. kafka).
A data lake is a centralized repository that allows users to store all structured and unstructured data on an arbitrary scale. The user can store the data as it is (without first structuring the data) and run different types of analysis, from control panels and visualization to big data processing, real-time analysis and machine learning, to guide better decisions.
Apache Iceberg is an advanced data lake formatting storage technology. The Iceberg table is an open table format designed for large PB (one of the data storage levels) level tables. The function of the form format is to determine how the user manages, organizes and tracks all the files that make up the form. The user can treat them as physical data files (written in part or ORC, etc.) and how they are built to form an abstraction layer between tables.
CDC (Change Data Capture, data capture) is a technology that enables capturing data and data structures from a source database.
In the prior art, the table of the mysql database is collected into the Iceberg data lake through the CDC technology based on the Flink framework by the open source technology, but the structure of the table of the mysql database is determined at the beginning, and the structure of the table is fixed when the table is written into the Iceberg data lake. In practice, however, in a system, the structure of the tables in the database often changes, and the data itself is subject to deletion, but these changes cannot be synchronized to the Iceberg data lake.
Disclosure of Invention
The application provides a real-time whole-reservoir lake entering method based on a Flink, which aims to solve the problem that the structure change and the data change of a table cannot be synchronized in a data lake after the table and the data in the database are acquired into the data lake in the related technology at least to a certain extent.
The scheme of the application is as follows:
a method for real-time whole-reservoir lake entering based on Flink comprises the following steps:
step S1, a database supporting CDC is docked through a Source operator, and real-time change data of the database are obtained; the real-time change data includes: data change DML (Data Manipulation Language ) statements and table structure change DDL (Data Definition Language ) statements;
s2, performing keyBy on the real-time change data according to table names, and distributing the real-time change data with different table names to different StreamDdlHandler operators to process table structure change DDL sentences; the StreamDdlHandler operator is an operator for processing DDL sentences;
step S3, a writer commit mark is sent to a writer instance through a StreamDdlHandler operator, so that the writer instance commits the current table data corresponding to the table structure change DDL statement, and data updated in a table are cached in real time until a preset condition is met, and the table structure change DDL statement is executed;
s4, the data change DML statement is sent to a DynamicStreamWriter operator through a StreamDdlHandler operator; the DynamicStreamWriter operator is an operator for writing data to a distributed file system;
s5, writing the table data corresponding to the data change DML statement into a distributed file system through the dynamic stream writer operator, acquiring or creating a writer instance in a cache map according to the table name of each table data, completing the table data in the writer instance in each checkpoint, and sending a manifest to a compiler;
step S6, sending the manifest to the dynamicFilesCommitter operator through the dynamicStreamWriter operator; the dynamicfilesCommitter operator is an operator for submitting metadata;
step S7, caching the manifest through the DynamicFilesCommitter operator, starting a new checkpoint, and storing the manifest of all the cached tables in the checkpoint State; after the current checkpoint is completed, submitting the manifest of each current checkpoint to an Iceberg table;
wherein, the Source operator, the streamdlhandler operator, the dynamicstreamWriter operator, and the dynamicFilesCommitter operator are all operators in the Flink.
Preferably, the method further comprises:
based on the Source operator in the Flink, interfacing with the Kafka of the external system with the CDC function, acquiring real-time change data in the Kafka of the external system.
Preferably, the method further comprises:
judging whether a new or deleted table structure change DDL statement exists or whether a table to which the table structure change DDL statement belongs receives a data change DML statement or not through a streamDdlHandler;
if yes, executing a table structure change DDL statement; if not, executing the step S3;
judging whether repeated table structure change DDL sentences exist or not through a StreamDdlHandler;
if yes, discarding repeated table structure change DDL statement; if not, step S3 is performed.
Preferably, the preset condition is:
monitoring whether a preset number of success marks are generated by a writer instance in the distributed file system; or, exceeding the preset cache time; the buffer time length is set according to the interval time length of the checkpoint event.
Preferably, after executing the table structure change DDL statement, the method further comprises:
and notifying an operator processing cache of the summary submission result.
Preferably, the method further comprises:
and (4) modulo the parallelism of the single table by the data change DML statement to obtain a hashKey, carrying out hash on the data change DML statement by the hashKey, and executing step S4 on the hashed data change DML statement.
Preferably, the Source operator, the streamdlhandler operator, the dynamicstreamWriter operator, the dynamicFilesCommitter operator are all multiple, and each operator processes tasks in parallel;
the method further comprises the steps of:
and establishing a plurality of task slots, and placing each dynamic streamwriter operator in different task slots to process different table data.
Preferably, the method further comprises:
hash the table name of the Iceberg table;
step S6 is executed according to the table name hash of the Iceberg table.
Preferably, the method further comprises:
judging whether the table name of the Iceberg table is received for the first time or not through the dynamicofilesCommitter operator;
if yes, initializing a cache map, and acquiring the maximum submission ckpID in the table attribute of the Iceberg table when the current operation is snapshot restarting;
if the maximum submitted ckpID in the table attribute of the Iceberg table is smaller than the maximum submitted ckpID stored in the checkpointsState, submitting the manifest corresponding to the uncommitted ckpID in the checkpointsState to the Iceberg table.
Preferably, the method further comprises:
if the step S7 is failed to be executed, the failure table is sent to a checkCommitSink operator with single parallelism;
printing a log through the checkCommitSink operator, throwing out an exception, ending the job and waiting for restarting.
The technical scheme provided by the application can comprise the following beneficial effects: the application discloses a method for entering a lake in real time whole warehouse based on Flink, which comprises the following steps: the method comprises the steps of docking a database supporting CDC through a Source operator in a Flink, and obtaining real-time change data of the database; the real-time change data includes: data change DML statements and table structure change DDL statements; performing keyBy on the real-time change data according to table names, and distributing the real-time change data with different table names to different StreamDdlHandler operators to process table structure change DDL sentences; the streamDdlHandler operator is an operator that processes DDL statements; transmitting a writer commit mark to a writer instance through a streamDdlHandler operator, enabling the writer instance to commit current table data corresponding to the table structure change DDL statement, caching updated data in the table in real time until a preset condition is met, and executing the table structure change DDL statement; the data change DML statement and the data change DML statement are sent to a DynamicStreamWriter operator through a StreamDdlHandler operator; the DynamicStreamWriter operator is an operator that writes data to a distributed file system; writing table data corresponding to the data change DML statement into a distributed file system through a dynamic stream writer operator, acquiring or creating a writer instance in a cache map according to the table name of each table data, completing the table data in the writer instance in each checkpoint, and sending a manifest to a compiler; transmitting the manifest to a dynamicfilesCommitter operator by a dynamicstreamWriter operator; the dynamicfilesCommitter operator is an operator that submits metadata; caching the manifest through a dynamicFilesCommitter operator, starting a new checkpoint, and storing the manifest of all the cached tables in the checkpoint State; after the current checkpoint is completed, the manifest of each sheet of the current checkpoint is submitted to the Iceberg table. The Source operator, the streamdlHandler operator, the DynamicStreamWriter operator and the DynamicFilesCommitter operator are all operators in the Flink, and the technical scheme of the application realizes the real-time synchronization of the data of the whole database of the Source database to the Iceberg data lake on the basis of ensuring the performance of the Flink operation.
It is to be understood that both the foregoing general description and the following detailed description are exemplary and explanatory only and are not restrictive of the application as claimed.
Drawings
The accompanying drawings, which are incorporated in and constitute a part of this specification, illustrate embodiments consistent with the application and together with the description, serve to explain the principles of the application.
FIG. 1 is a schematic flow chart of a method for real-time whole-reservoir lake entering based on Flink according to an embodiment of the application.
Detailed Description
Reference will now be made in detail to exemplary embodiments, examples of which are illustrated in the accompanying drawings. When the following description refers to the accompanying drawings, the same numbers in different drawings refer to the same or similar elements, unless otherwise indicated. The implementations described in the following exemplary examples do not represent all implementations consistent with the application. Rather, they are merely examples of apparatus and methods consistent with aspects of the application as detailed in the accompanying claims.
Fig. 1 is a flow chart of a method for real-time whole-reservoir lake-entering based on a link according to an embodiment of the present application, referring to fig. 1, a method for real-time whole-reservoir lake-entering based on a link includes:
step S1, a database supporting CDC is docked through a Source operator, and real-time change data of the database are obtained; the real-time change data includes: data change DML statements and table structure change DDL statements;
in this embodiment, the Source operator, the streamdclHandler operator, the dynamicstreamWriter operator, and the dynamicfilesCommitter operator are all conventional operators in the link. In this embodiment, the Source operator, the streamdlhandler operator, the dynamicstreamWriter operator, and the dynamicfilesCommitter operator are all plural, and each operator processes tasks in parallel.
The Source operator is used for interfacing a database supporting CDC and acquiring real-time change data of the database; and (3) also interfacing with Kafka of an external system with a CDC function, and acquiring real-time change data in Kafka of the external system.
The data change DML statement belongs to the data layer language in the table, and the data is changed when the operation of the data in the table, such as writing data, is performed.
The table structure alters the language to which the DDL statement belongs at the structural level of the table, and the operations to the table structure, such as adding fields, modifying the table.
S2, performing keyBy on the real-time change data according to table names, and distributing the real-time change data with different table names to different StreamDdlHandler operators to process table structure change DDL sentences; the streamDdlHandler operator is an operator that processes DDL statements;
the real-time change data is keyBy according to the table name, that is, the table name is hashed, so that the real-time change data is distributed to different streamdsldhanders according to the table name for processing.
In specific practice, different tables can also be processed in the same streamdlhandler, and resource utilization is improved by multiplexing.
Step S3, a writer commit mark is sent to the writer instance through a StreamDdlHandler operator, so that the writer instance commits the current table data corresponding to the table structure change DDL statement, and the updated data in the table is cached in real time until a preset condition is met, and the table structure change DDL statement is executed;
before step S3 is executed, determining whether there is a new or deleted table structure change DDL statement, and whether a table to which the table structure change DDL statement belongs receives a data change DML statement by using a streamdsilhandler;
if yes, executing a table structure change DDL statement; if not, executing the step S3;
judging whether repeated table structure change DDL sentences exist or not through a StreamDdlHandler;
if yes, discarding repeated table structure change DDL statement; if not, step S3 is performed.
The logic determination process is to execute the DDL statement only once to prevent the repeated creation or the de-duplication of the table.
It should be noted that the writer instance in step S3 is a preset related writer instance. In step S3, a writer commit flag is sent to all relevant writer instances, so that the writer instances commit the current table data corresponding to the table structure change DDL statement.
In this step, the table structure change DDL statement is executed first to maintain the table structure before writing data, for example, delete columns, add columns, and modify columns at column level; new tables at the table level, delete tables, modify table operations.
It should be noted that, the fact that the writer instance performs commit on the current table data corresponding to the table structure change DDL statement means that the file which is not written to the file system is written, and then the metadata is synchronized to the metadata center.
The preset conditions in step S3 are as follows: monitoring whether a preset number of success marks are generated by a writer instance in the distributed file system; or, exceeding the preset cache time; the buffer time length is set according to the interval time length of the checkpoint event.
In this embodiment, when the writer instance is made to commit the current table data corresponding to the table structure change DDL statement, the number of success marks generated by the writer instance in the distributed file system is monitored, and at the same time, updated data in the table is cached in real time.
The table structure change DDL statement is executed when a sufficient number of success markers are generated by the writer instance, or when a preconfigured cache time period is exceeded.
It should be noted that, the buffer duration is set according to the interval duration of the checkpoint event. Preferably, the buffer duration is three times the interval duration of the checkpoint event.
It should be noted that, after the checkpoint event is activated, the database write process is triggered to write out the dirty data block in the data buffer into the data file.
It should be noted that, after executing the table structure change DDL statement, the method further includes:
and notifying an operator processing cache of the summary submission result.
S4, data change DML statement and sending to a DynamicStreamWriter operator through a StreamDdlHandler operator; the DynamicStreamWriter operator is an operator that writes data to a distributed file system;
in this embodiment, the data change DML statement is modulo the single table parallelism to obtain the hashKey, the hash is performed on the data change DML statement by the hashKey, and step S4 is performed on the hashed data change DML statement.
In this embodiment, the data is changed to DML statement, and the parallelism of the single table is modulo by a primary key or other fields to obtain the hashKey. Unique identification of the primary key single bar data. Modulo the parallelism of a single table means that the single table is submitted to write different numbers of files each time according to the parallelism.
It should be noted that, the hashKey is generated after the single table parallelism is modulo, and the hashKey is obtained by modulo the single table parallelism through a main key or other fields, so as to write the single table through different operators, so as to speed up the writing speed.
S5, table data corresponding to the data change DML statement are written into a distributed file system through a dynamic stream writer operator, a writer instance is obtained or created in a cache map according to the table name of each table data, the table data in the writer instance are completed in each checkpoint, and a manifest is sent to a composer;
note that, complete in this embodiment refers to writing all the data cached in each checkpoint period to the external storage system, and returning the written statistics. Comprising the following steps: how many files are written, the specific path of each file, the type of each file already in size, etc. This information is used when changing the metadata of the Iceberg table.
It should be noted that the commander refers to a data submitter, that is, a dynamicfilesCommitter operator hereinafter.
It should be noted that, in this embodiment, a plurality of task slots are established, and each DynamicStreamWriter operator is placed in a different task slot to process different table data, so as to increase the processing speed.
Step S6, sending the manifest to a dynamicFilesCommitter operator through a dynamicStreamWriter operator; the dynamicfilesCommitter operator is an operator that submits metadata;
it should be noted that the method further includes: hash the table name of the Iceberg table;
step S6 is executed according to the table name hash of the Iceberg table.
It should be noted that the Iceberg table is a table in the Iceberg data lake.
It should be noted that the DynamicStreamWriter operator sends a manifest to the dynamicfilesCommitter operator through the Iceberg table name hash, and a table can only be processed by one dynamicfilesCommitter operator, otherwise, a lock preemption problem occurs.
Note that, the manifest in this embodiment is file metadata obtained by storing the current refresh data to the bottom layer.
Step S7, caching the manifest through a DynamicFilesCommitter operator, starting a new checkpoint, and storing the manifest of all the cached tables in the checkpoint State; after the current checkpoint is completed, submitting the manifest of each current checkpoint to an Iceberg table;
after receiving the upstream data (the processing result of the Iceberg table name hash and the upstream dynamicostreamwriter operator), dynamicofilesCommitter caches the manifest.
It should be noted that, when a new checkpoint snapshot is started, the thread pool may be used to write the manifest file of each table, so as to improve the write performance when the number of tables is too large.
Note that before storing the manifest of all the tables of the present cache in the checkpoints state, the checkpoints state needs to be emptied.
After all operators finish the current checkpoint, submitting the manifest of each current checkpoint to the Iceberg table by using a thread pool.
It should be noted that, in this embodiment, there is also a fault tolerance mechanism after the commit failure, including:
judging whether the table name of the Iceberg table is received for the first time or not through a DynamicFilesCommitter operator;
if yes, initializing a cache map, and acquiring the maximum submission ckpID in the table attribute of the Iceberg table when the current operation is snapshot restarting;
if the maximum submitted ckpID in the table attribute of the Iceberg table is smaller than the maximum submitted ckpID stored in the checkpointsState, submitting the manifest corresponding to the uncommitted ckpID in the checkpointsState to the Iceberg table.
It should be noted that the snapshot in this step may be a checkpoint or a savepoint.
Note that, the ckpID refers to an ID used to perform state saving on data flowing in real time at the time of periodic checkpoints of the link system, and may be used to automatically resume operation from a state, rather than from an initial state, when a job is abnormal. Each time of the checkpoints is marked by a unique ID and is monotonically increased, when each operator in the flow diagram receives the information of the checkpoints, the ID of the current checkpoint can be obtained from the information, and the operators can conveniently carry out logic judgment.
It should be noted that the method further includes:
if the step S7 is failed to be executed, the failure table is sent to a checkCommitSink operator with single parallelism;
printing the log through a checkCommitSink operator, throwing out an exception, ending the job, and waiting for restarting.
It should be noted that, if the execution of step S7 is successful, the current notch lake entering procedure is successfully ended.
It is to be understood that the same or similar parts in the above embodiments may be referred to each other, and that in some embodiments, the same or similar parts in other embodiments may be referred to.
It should be noted that in the description of the present application, the terms "first," "second," and the like are used for descriptive purposes only and are not to be construed as indicating or implying relative importance. Furthermore, in the description of the present application, unless otherwise indicated, the meaning of "plurality" means at least two.
Any process or method descriptions in flow charts or otherwise described herein may be understood as representing modules, segments, or portions of code which include one or more executable instructions for implementing specific logical functions or steps of the process, and further implementations are included within the scope of the preferred embodiment of the present application in which functions may be executed out of order from that shown or discussed, including substantially concurrently or in reverse order, depending on the functionality involved, as would be understood by those reasonably skilled in the art of the present application.
It is to be understood that portions of the present application may be implemented in hardware, software, firmware, or a combination thereof. In the above-described embodiments, the various steps or methods may be implemented in software or firmware stored in a memory and executed by a suitable instruction execution system. For example, if implemented in hardware, as in another embodiment, may be implemented using any one or combination of the following techniques, as is well known in the art: discrete logic circuits having logic gates for implementing logic functions on data signals, application specific integrated circuits having suitable combinational logic gates, programmable Gate Arrays (PGAs), field Programmable Gate Arrays (FPGAs), and the like.
Those of ordinary skill in the art will appreciate that all or a portion of the steps carried out in the method of the above-described embodiments may be implemented by a program to instruct related hardware, where the program may be stored in a computer readable storage medium, and where the program, when executed, includes one or a combination of the steps of the method embodiments.
In addition, each functional unit in the embodiments of the present application may be integrated in one processing module, or each unit may exist alone physically, or two or more units may be integrated in one module. The integrated modules may be implemented in hardware or in software functional modules. The integrated modules may also be stored in a computer readable storage medium if implemented in the form of software functional modules and sold or used as a stand-alone product.
The above-mentioned storage medium may be a read-only memory, a magnetic disk or an optical disk, or the like.
In the description of the present specification, a description referring to terms "one embodiment," "some embodiments," "examples," "specific examples," or "some examples," etc., means that a particular feature, structure, material, or characteristic described in connection with the embodiment or example is included in at least one embodiment or example of the present application. In this specification, schematic representations of the above terms do not necessarily refer to the same embodiments or examples. Furthermore, the particular features, structures, materials, or characteristics described may be combined in any suitable manner in any one or more embodiments or examples.
While embodiments of the present application have been shown and described above, it will be understood that the above embodiments are illustrative and not to be construed as limiting the application, and that variations, modifications, alternatives and variations may be made to the above embodiments by one of ordinary skill in the art within the scope of the application.

Claims (10)

1. A method for real-time whole-reservoir lake entering based on Flink is characterized by comprising the following steps:
step S1, a database supporting CDC is docked through a Source operator, and real-time change data of the database are obtained; the real-time change data includes: data change DML statements and table structure change DDL statements;
s2, performing keyBy on the real-time change data according to table names, and distributing the real-time change data with different table names to different StreamDdlHandler operators to process table structure change DDL sentences; the StreamDdlHandler operator is an operator for processing DDL sentences;
step S3, a writer commit mark is sent to a writer instance through a StreamDdlHandler operator, so that the writer instance commits the current table data corresponding to the table structure change DDL statement, and data updated in a table are cached in real time until a preset condition is met, and the table structure change DDL statement is executed;
s4, the data change DML statement is sent to a DynamicStreamWriter operator through a StreamDdlHandler operator; the DynamicStreamWriter operator is an operator for writing data to a distributed file system;
s5, writing the table data corresponding to the data change DML statement into a distributed file system through the dynamic stream writer operator, acquiring or creating a writer instance in a cache map according to the table name of each table data, completing the table data in the writer instance in each checkpoint, and sending a manifest to a compiler;
step S6, sending the manifest to the dynamicFilesCommitter operator through the dynamicStreamWriter operator; the dynamicfilesCommitter operator is an operator for submitting metadata;
step S7, caching the manifest through the DynamicFilesCommitter operator, starting a new checkpoint, and storing the manifest of all the cached tables in the checkpoint State; after the current checkpoint is completed, submitting the manifest of each current checkpoint to an Iceberg table;
wherein, the Source operator, the streamdlhandler operator, the dynamicstreamWriter operator, and the dynamicFilesCommitter operator are all operators in the Flink.
2. The method according to claim 1, wherein the method further comprises:
based on the Source operator in the Flink, interfacing with the Kafka of the external system with the CDC function, acquiring real-time change data in the Kafka of the external system.
3. The method according to claim 1, wherein the method further comprises:
judging whether a new or deleted table structure change DDL statement exists or whether a table to which the table structure change DDL statement belongs receives a data change DML statement or not through a streamDdlHandler;
if yes, executing a table structure change DDL statement; if not, executing the step S3;
judging whether repeated table structure change DDL sentences exist or not through a StreamDdlHandler;
if yes, discarding repeated table structure change DDL statement; if not, step S3 is performed.
4. The method according to claim 1, wherein the preset conditions are:
monitoring whether a preset number of success marks are generated by a writer instance in the distributed file system; or, exceeding the preset cache time; the buffer time length is set according to the interval time length of the checkpoint event.
5. The method of claim 1, wherein after executing the table structure change DDL statement, the method further comprises:
and notifying an operator processing cache of the summary submission result.
6. The method according to claim 1, wherein the method further comprises:
and (4) modulo the parallelism of the single table by the data change DML statement to obtain a hashKey, carrying out hash on the data change DML statement by the hashKey, and executing step S4 on the hashed data change DML statement.
7. The method of claim 1, wherein the Source operator, streamdlhandler operator, dynamicstreamWriter operator, dynamicfilesCommitter operator are all plural, each operator processing tasks in parallel;
the method further comprises the steps of:
and establishing a plurality of task slots, and placing each dynamic streamwriter operator in different task slots to process different table data.
8. The method according to claim 1, wherein the method further comprises:
hash the table name of the Iceberg table;
step S6 is executed according to the table name hash of the Iceberg table.
9. The method of claim 8, wherein the method further comprises:
judging whether the table name of the Iceberg table is received for the first time or not through the dynamicofilesCommitter operator;
if yes, initializing a cache map, and acquiring the maximum submission ckpID in the table attribute of the Iceberg table when the current operation is snapshot restarting;
if the maximum submitted ckpID in the table attribute of the Iceberg table is smaller than the maximum submitted ckpID stored in the checkpointsState, submitting the manifest corresponding to the uncommitted ckpID in the checkpointsState to the Iceberg table.
10. The method according to claim 1, wherein the method further comprises:
if the step S7 is failed to be executed, the failure table is sent to a checkCommitSink operator with single parallelism;
printing a log through the checkCommitSink operator, throwing out an exception, ending the job and waiting for restarting.
CN202311133058.9A 2023-09-05 2023-09-05 Flink-based real-time whole-reservoir lake entering method Active CN116881261B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202311133058.9A CN116881261B (en) 2023-09-05 2023-09-05 Flink-based real-time whole-reservoir lake entering method

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202311133058.9A CN116881261B (en) 2023-09-05 2023-09-05 Flink-based real-time whole-reservoir lake entering method

Publications (2)

Publication Number Publication Date
CN116881261A CN116881261A (en) 2023-10-13
CN116881261B true CN116881261B (en) 2023-11-07

Family

ID=88262433

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202311133058.9A Active CN116881261B (en) 2023-09-05 2023-09-05 Flink-based real-time whole-reservoir lake entering method

Country Status (1)

Country Link
CN (1) CN116881261B (en)

Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114860654A (en) * 2022-04-20 2022-08-05 北京滴普科技有限公司 Method and system for dynamically changing Iceberg table Schema based on Flink data stream
CN115098505A (en) * 2022-06-28 2022-09-23 平安银行股份有限公司 Method and device for changing table structure of database and electronic equipment

Family Cites Families (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US10585873B2 (en) * 2017-05-08 2020-03-10 Sap Se Atomic processing of compound database transactions that modify a metadata entity

Patent Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114860654A (en) * 2022-04-20 2022-08-05 北京滴普科技有限公司 Method and system for dynamically changing Iceberg table Schema based on Flink data stream
CN115098505A (en) * 2022-06-28 2022-09-23 平安银行股份有限公司 Method and device for changing table structure of database and electronic equipment

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
基于Flink的网络内容分析系统设计与实现;赵佳欣等;计算机时代(第12期);第24-17页 *

Also Published As

Publication number Publication date
CN116881261A (en) 2023-10-13

Similar Documents

Publication Publication Date Title
US11914572B2 (en) Adaptive query routing in a replicated database environment
US20220179860A1 (en) Database workload capture and replay
US8078588B2 (en) Recoverable execution
US11010262B2 (en) Database system recovery using preliminary and final slave node replay positions
JP5039891B2 (en) Apparatus and method for generating a replica of a database
US8825601B2 (en) Logical data backup and rollback using incremental capture in a distributed database
CN104423960B (en) A kind of method and system of project continuous integrating
EP2746971A2 (en) Replication mechanisms for database environments
US10572508B2 (en) Consistent query execution in hybrid DBMS
US20070288537A1 (en) Method and apparatus for replicating data across multiple copies of a table in a database system
US10089320B2 (en) Method and apparatus for maintaining data consistency in an in-place-update file system with data deduplication
US9542279B2 (en) Shadow paging based log segment directory
CN107111628A (en) Effective maintenance of row repository index on memory optimization table
US20200349172A1 (en) Managing code and data in multi-cluster environments
CN107665219B (en) Log management method and device
CN105573859A (en) Data recovery method and device of database
US11176004B2 (en) Test continuous log replay
WO2022002103A1 (en) Method for playing back log on data node, data node, and system
US20180046548A1 (en) Method and Apparatus for Tracking Objects in a First Memory
US9785516B2 (en) Method and system to remove a service from a distributed database system
US10382549B1 (en) Data management platform
US9015116B2 (en) Consistent replication of transactional updates
CN116881261B (en) Flink-based real-time whole-reservoir lake entering method
Schindler Profiling and analyzing the I/O performance of NoSQL DBs
Wang et al. Rect: Improving mapreduce performance under failures with resilient checkpointing tactics

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant