CN114968936B - Time line acquisition method and device based on data lake and execution node - Google Patents

Time line acquisition method and device based on data lake and execution node Download PDF

Info

Publication number
CN114968936B
CN114968936B CN202210603049.0A CN202210603049A CN114968936B CN 114968936 B CN114968936 B CN 114968936B CN 202210603049 A CN202210603049 A CN 202210603049A CN 114968936 B CN114968936 B CN 114968936B
Authority
CN
China
Prior art keywords
timeline
time line
metadata
node
management node
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202210603049.0A
Other languages
Chinese (zh)
Other versions
CN114968936A (en
Inventor
喻兆靖
郭俊
杨诗旻
罗旋
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing Volcano Engine Technology Co Ltd
Original Assignee
Beijing Volcano Engine Technology Co Ltd
Filing date
Publication date
Application filed by Beijing Volcano Engine Technology Co Ltd filed Critical Beijing Volcano Engine Technology Co Ltd
Priority to CN202210603049.0A priority Critical patent/CN114968936B/en
Publication of CN114968936A publication Critical patent/CN114968936A/en
Application granted granted Critical
Publication of CN114968936B publication Critical patent/CN114968936B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Abstract

The application discloses a time line acquisition method, a device, an execution node, electronic equipment, a computer readable medium and a computer program product based on a data lake, wherein the method comprises the following steps: when the execution node determines that a timeline pulling condition is met, a first metadata client is created by the execution node, and a timeline view is created in the first metadata client; and the executing node acquires the time line to be used from the built-in metadata service of the managing node by utilizing the time line view, so that the time line to be used can represent at least one transaction in the managing node, and the subsequent executing node can know the transaction needing to be executed from the time line to be used, thereby realizing the aim of assisting the managing node to realize the writing task by the executing node, and effectively improving the executing effect of the writing task realized by assisting the managing node by the executing node.

Description

Time line acquisition method and device based on data lake and execution node
Technical Field
The present application relates to the field of computer technology, and in particular, to a data lake-based timeline acquisition method, apparatus, execution node, electronic device, computer readable medium, and computer program product.
Background
Data lake (DataLake) refers to a system that stores data using a large binary object or file format; and the data lake is used to uniformly store data.
The data lake framework (Apache Hudi, hudi for short) is a streaming data lake platform; also Hudi can access large amounts of data (e.g., relational databases, logs, data of message queues, etc.) for data storage through a variety of tools (e.g., spark, flink, etc.).
However, some Hudi schemes (e.g., flank-based hudi, etc.) suffer from drawbacks, resulting in poor performance of the write tasks achieved with these Hudi schemes.
Disclosure of Invention
In order to solve the technical problems, the application provides a time line acquisition method, a time line acquisition device, an execution node, electronic equipment, a computer readable medium and a computer program product based on a data lake, which can effectively improve the execution effect of a writing task.
In order to achieve the above object, the technical solution provided by the embodiments of the present application is as follows:
the embodiment of the application provides a time line acquisition method based on a data lake, which is applied to an execution node based on the data lake, and comprises the following steps:
When a timeline pull condition is reached, creating a first metadata client and creating a timeline view in the first metadata client;
Acquiring a time line to be used from built-in metadata service of a management node by utilizing the time line view; wherein the time line to be used is stored in the built-in metadata service; the time line to be used is used for recording at least one transaction in the management node.
In one possible implementation, the at least one transaction includes at least one transaction in an incomplete state.
In one possible implementation, the built-in metadata service is used to store a real-time timeline in the management node.
In a possible implementation manner, the updating process of the time line to be used includes:
When a timeline update condition is reached, the management node creates a second metadata client;
the management node pulls a metadata timeline from a metadata system by using the second metadata client;
and the management node updates the time line to be used stored in the built-in metadata service by utilizing the metadata time line.
In one possible embodiment, the method further comprises:
and executing the transaction to be processed when the time line to be used indicates that the transaction to be processed is created.
The embodiment of the application also provides a time line acquisition device based on the data lake, which comprises the following steps:
a creation unit, configured to create a first metadata client when a timeline pull condition is reached, and create a timeline view in the first metadata client;
an obtaining unit, configured to obtain a time line to be used from a built-in metadata service of a management node by using the time line view; wherein the time line to be used is stored in the built-in metadata service; the time line to be used is used for recording at least one transaction in the management node.
The embodiment of the application also provides an execution node based on the data lake, which is used for creating a first metadata client when the time line pulling condition is reached and creating a time line view in the first metadata client;
the execution node is further configured to acquire a time line to be used from a built-in metadata service of the management node by using the time line view; wherein the time line to be used is stored in the built-in metadata service; the time line to be used is used for recording at least one transaction in the management node.
The embodiment of the application also provides electronic equipment, which comprises: a processor and a memory;
The memory is used for storing instructions or computer programs;
The processor is configured to execute the instructions or the computer program in the memory, so that the electronic device executes any implementation of the data lake-based timeline acquisition method provided by the embodiment of the present application.
The embodiment of the application also provides a computer readable medium, wherein instructions or a computer program are stored in the computer readable medium, and when the instructions or the computer program are run on a device, the device is caused to execute any implementation mode of the timeline acquisition method based on the data lake provided by the embodiment of the application.
The embodiment of the application also provides a computer program product, which when being run on a terminal device, causes the terminal device to execute any implementation mode of the time line acquisition method based on the data lake.
Compared with the prior art, the embodiment of the application has at least the following advantages:
In the technical scheme provided by the embodiment of the application, for an execution node (for example, a Flink) based on a data lake, when the execution node determines that a time line pulling condition is reached, a first metadata client is created by the execution node, and a time line view is created in the first metadata client; and the execution node acquires the time line to be used from the built-in metadata service of the management node by utilizing the time line view, so that the time line to be used can represent at least one transaction (such as at least one transaction in an unfinished state) in the management node, so that the subsequent execution node can know the transaction needing to be executed from the time line to be used, and the aim of assisting the management node to realize the writing task by the execution node can be realized.
The built-in metadata service of the management node can directly provide the timeline to the execution node based on the data lake through the timeline view, so that the execution node can directly acquire the timeline from the built-in metadata service of the management node, and the execution node does not need to request the timeline from the metadata system, thus adverse effects (such as unstable service, smaller task concurrency and the like of the metadata system) caused when the execution node directly requests the timeline from the metadata system can be effectively avoided, and the execution effect of writing tasks realized by the execution node auxiliary management node can be effectively improved.
Drawings
In order to more clearly illustrate the embodiments of the present application or the technical solutions in the prior art, the drawings that are required in the embodiments or the description of the prior art will be briefly described below, and it is obvious that the drawings in the following description are only some embodiments described in the present application, and other drawings may be obtained according to the drawings without inventive effort for a person skilled in the art.
FIG. 1 is a schematic diagram of a Spark-based transaction commit process according to an embodiment of the present application;
FIG. 2 is a schematic diagram of a transaction commit process based on a Flink provided in an embodiment of the present application;
FIG. 3 is a schematic diagram of a timeline acquisition process according to an embodiment of the present application;
FIG. 4 is a schematic diagram of another timeline acquisition process provided by an embodiment of the present application;
FIG. 5 is a flowchart of a method for obtaining a timeline based on a data lake according to an embodiment of the present application;
FIG. 6 is a comparison diagram of two timeline acquisition flows provided in an embodiment of the present application;
FIG. 7 is a schematic diagram of a timeline acquisition device based on a data lake according to an embodiment of the present application;
fig. 8 is a schematic structural diagram of an electronic device according to an embodiment of the present application.
Detailed Description
In order to facilitate understanding of the technical solution provided by the embodiments of the present application, some technical terms are first described below.
The data lake is a raw data layer and can store various structured, semi-structured, and even unstructured data.
Hudi for ingestion and/or management of large analytical datasets through a distributed file system (Hadoop Distributed FILE SYSTEM, HDFS) or cloud storage or the like; and Hudi can be used to construct a data lake.
In addition, in Hudi, a series of add-Delete (CRUD) operations for the table over time are referred to as a Timeline (Timeline); and a certain operation in the Timeline is called a transaction (instance).
In addition, the file system service (FILE SYSTEM) of Hudi includes at least a metadata (metadata) system. Wherein the metadata is used to maintain operational metadata for items on the data set in the form of a timeline (timeline) to support transient views of the data set.
A time line (Timeline) for maintaining the overall operation of the data lake table at different points in time.
Hoodie Table is an abstraction of the data table, and Hoodie Table is used to define the components and table operation interfaces on which the write hudi table depends. In addition, the user may access the Embedded timeline service (Embedded TIMELINE SERVICE) through the file system view (FILE SYSTEM VIEW) in Hoodie Table; and the time line required to be used to build FILE SYSTEM VIEW typically requires access to the metadata system via the metadata client.
Spark is a fast and versatile computing engine designed for large-scale data processing.
Flink is a computational engine based on an open source stream processing framework; and that the Flink executes any stream data program in a data parallel and pipelined manner. In addition, the Flink's pipeline runtime system may execute batch and stream processing procedures. Furthermore, the runtime itself of the flank also supports the execution of the iterative algorithm.
Based on the related content of the technical content, the technical scheme of the application is described below.
The inventors found in the studies directed to Hudi that since Hudi itself was originally designed based on Spark micro-batch model, spark-based Hudi transaction commit was done based on two-phase commit (transaction commit process as shown in fig. 1), each transaction was first generated by the management node for each batch of data; the information of each transaction is distributed to each task executor (i.e., spark engine) by the management node, so that after each task executor is determined to complete the commit, the management node finishes the transaction commit and consumes the next batch of data. Wherein the management node is configured to manage (e.g., create a transaction, record execution status information of the transaction, etc.) for a number of write tasks in Hudi.
The inventors have also found that in some cases, a transaction commit (a transaction commit process as shown in FIG. 2) may be performed with the Flink engine assistance Hudi to achieve completion of the write task in Hudi with the Flink engine. Wherein, because the Flink is a pure stream model, the data stream processed by the Flink is actually divided into micro-batches by check points (check points), so that the submission of micro-batch data is asynchronous; however, for the same Flink task written Hudi, the numerous transactions involved in the task are in strict order, and only after the last checkpoint is completed, the next transaction is created. It can be seen that for each task executor (i.e., the Flink engine) that completed a task ahead of time, the task executor will continually poll the timeline in the management node to obtain some transaction information (e.g., completion of the last transaction, or a new transaction that was opened, etc.).
The inventors have also found that the flank-based transaction commit process (such as the transaction commit process shown in fig. 2) differs from the Spark-based transaction commit process (such as the transaction commit process shown in fig. 1) and that the distinction between the two includes at least what is shown as ①-②:
① The manner in which transactions are obtained from the management node is different. Because the Spark engine uses the transaction information sent to it by the management node to retrieve the transaction, the Flink engine uses the timeline pulled from the management node to retrieve the transaction.
② The transaction commit is different. Because Spark engines accomplish transaction commitments based on two-phase commitments, flank engines do so directly to the management node.
The inventors have also found that, because Hudi is based on Spark design, there are problems as shown in (1) - (2) below in the context of transaction commit with the flank engine assist Hudi:
(1) Because the timeline required by the construction FILE SYSTEM VIEW can only be obtained by accessing the metadata system through the metadata client, the metadata system needs to be requested once when the link engine is used for creating HoodieTable each time, so that the metadata system needs to be frequently requested under a large number of concurrent task scenes, the service load of the metadata system is too high, and the problem of unstable service of the metadata system can be caused.
(2) Because of the limited number of requests that the metadata system can respond to, this results in a smaller number of concurrent tasks that Hudi can support based on the flank engine adjustment.
The inventors have also found that Hudi itself has an object that can implement the timeline storage function, i.e., an Embedded timeline service (Embedded TIMELINE SERVICE); however, since Hudi is designed based on Spark, and Spark does not require polling of timeline at task (task) granularity, this results in the object "assembled TIMELINE SERVICE" being used only to cache completed transactions and not in-progress transactions, thus resulting in the inability of assembled TIMELINE SERVICE to provide the flank engine with relevant information about outstanding transactions (e.g., just-opened transactions or executing transactions). In addition, as shown in fig. 3, because the access entry of the Embedded TIMELINE SERVICE is in FILE SYSTEM VIEW, each time the flight engine wants to acquire timeline from the Embedded TIMELINE SERVICE, the flight engine needs to acquire all the content of FILE SYSTEM VIEW first, and then acquire timeline from FILE SYSTEM VIEW by accessing the Embedded TIMELINE SERVICE. As can be seen, the flank engine cannot directly access the Embedded TIMELINE SERVICE through the metadata client, so that each time the flank engine wants to obtain timeline from the Embedded TIMELINE SERVICE, it is required to obtain not only the Embedded TIMELINE SERVICE but also other redundant contents except the Embedded TIMELINE SERVICE in FILE SYSTEM VIEW, which leads to resource waste.
Based on the above findings, in order to solve the technical problems shown in the background section, the embodiment of the present application provides a data lake-based timeline acquisition method applicable to a link, so that a link engine can directly acquire a real-time timeline in a management node from a built-in metadata service (as shown in fig. 4) by means of a newly added timeline view (TIMELINE VIEW). Wherein, the built-in metadata service is improved based on the above-mentioned Embedded TIMELINE SERVICE, and the improvement point is specifically: the time lines stored in the built-in metadata service remain synchronized with the actual time lines in the management node at all times.
In addition, the method for acquiring the timeline based on the data lake, which is applicable to the Flink, provided by the embodiment of the application can specifically include: when a data lake-based execution node (e.g., a Flink engine) determines that a timeline pull condition is reached, a first metadata client (e.g., the metadata client shown in fig. 4) is first created by the execution node and a timeline view (e.g., the timeline view shown in fig. 4) is created in the first metadata client; and the execution node acquires the time line to be used from the built-in metadata service (such as the built-in metadata service shown in fig. 4) of the management node by utilizing the time line view, so that the time line to be used can represent the actual time line in the management node, and the time line to be used can represent at least one transaction (such as at least one transaction in an incomplete state) in the management node, so that the subsequent execution node can know the transaction needing to be executed by the execution node from the time line to be used, and the purpose of assisting the management node in realizing the writing task by the execution node can be realized.
The built-in metadata service of the management node can directly provide the timeline to the execution node based on the data lake through the timeline view, so that the execution node can directly acquire the timeline from the built-in metadata service of the management node, and the execution node does not need to request the timeline from the metadata system, thus adverse effects (such as unstable service, smaller task concurrency and the like of the metadata system) caused when the execution node directly requests the timeline from the metadata system can be effectively overcome, and the execution effect of writing tasks realized by the execution node auxiliary management node can be effectively improved.
It should be noted that, with respect to fig. 4, a write client (write function) refers to an object in the link engine, so that the write client is used to perform a related transaction of writing data into Hudi.
In order to make the present application better understood by those skilled in the art, the following description will clearly and completely describe the technical solutions in the embodiments of the present application with reference to the accompanying drawings, and it is apparent that the described embodiments are only some embodiments of the present application, not all embodiments. All other embodiments, which can be made by those skilled in the art based on the embodiments of the application without making any inventive effort, are intended to be within the scope of the application.
In order to better understand the technical solution provided by the embodiments of the present application, a detailed description of the data lake-based timeline acquisition method provided by the embodiments of the present application is provided below with reference to fig. 4 to 6. FIG. 4 is a schematic diagram of another timeline acquisition process according to an embodiment of the present application; FIG. 5 is a flowchart of a method for obtaining a timeline based on a data lake according to an embodiment of the present application; fig. 6 is a comparison chart of two timeline acquisition flows according to an embodiment of the present application.
As shown in FIG. 5, the data lake-based time line acquisition method provided by the embodiment of the application comprises the following steps of S1-S2:
s1: upon reaching the timeline pull condition, the data lake-based execution node creates a first metadata client and creates a timeline view in the first metadata client.
The time line pulling condition refers to a triggering condition that a time line in a management node needs to be pulled at an execution node based on a data lake; moreover, embodiments of the present application are not limited to this timeline pulling condition, and may be implemented, for example, as any condition that can trigger the pulling of a timeline from a management node by the execution node, either existing or future occurrences. For another example, when the execution node pulls the timeline in the management node once every preset time interval, the timeline pulling condition may specifically be: the duration between the current time and the trigger time point of the time line in the last pulling management node reaches the preset duration. The preset duration may be preset, for example, 10 seconds.
The data lake-based execution node is used for executing the device of the writing task in Hudi; the embodiment of the present application is not limited to this execution node, and may be implemented by any device (for example, a Flink engine) capable of processing streaming data.
The first metadata client refers to a metadata client (META CLIENT) created at the data lake-based execution node side; and the first metadata client may directly access the built-in metadata service in the management node by means of the object of the timeline view created therein.
The timeline view refers to an object created in the first metadata client for retrieving a timeline from the management node; and the timeline view has direct access to built-in metadata services in the management node.
The built-in metadata service refers to an improved Embedded TIMELINE SERVICE (the built-in metadata service shown in fig. 6) in the management node; and the timeline stored in the built-in metadata service can be synchronized with the actual timeline in the management node so that the built-in metadata service can store the real-time timeline in the management node.
It can be seen that, for the built-in metadata service, the built-in metadata service can store the real-time timeline in the management node, so that the built-in metadata service can store not only completed transactions, but also incomplete transactions (for example, transactions just created and transactions in an execution state), so that the built-in metadata service can replace metadata system to process the timeline request (for example, the built-in metadata service can replace metadata system to respond to the timeline related requests of all execution nodes for assisting Hudi in executing the transactions), and further, the metadata system only needs to process the timeline request of the built-in metadata service, so that not only can service load of the metadata system be effectively reduced, but also the problem that the concurrent number of tasks caused by the limited number of requests that the metadata system can respond to is effectively solved, and further, the concurrent number of tasks that can be supported can be effectively improved (for example, the actual measurement can be raised from 2000 to 8 ten thousand concurrent writing tasks, and so on).
Based on the above description of S1, it is known that, for an execution node (e.g., flink, etc.) based on a data lake, the execution node may learn some transaction information (e.g., whether a transaction that needs to be executed by the execution node is created, execution state information of some transaction, etc.) by constantly polling the timeline in the management node. Based on this, when the executing node determines that a timeline pull condition is reached, the executing node may determine that it has a need to acquire a timeline from the managing node, so the executing node may create a first metadata client (e.g., a metadata client on the data lake-based executing node side as shown in fig. 6) and create a timeline view in the first metadata client so that the executing node can subsequently acquire the timeline from the managing node via the timeline view in the first metadata client.
S2: the executing node based on the data lake obtains a time line to be used from the built-in metadata service of the management node by utilizing the time line view.
Wherein, the time line to be used refers to a stored time line in the built-in metadata service; and the time line to be used is used to record at least one transaction (e.g., at least one transaction in a completed state and/or at least one transaction in an incomplete state) in the management node.
In addition, the time line stored in the built-in metadata service can keep synchronous with the actual time line in the management node, so that the built-in metadata service is used for storing the real-time line in the management node, the time line to be used stored in the built-in metadata service is the real-time line in the management node, and further, the time line to be used can more accurately represent all transactions in the management node and relevant information (such as transaction description information, execution state information and the like) of all transactions.
In addition, in order to ensure that the time line to be used is synchronous with the actual time line in the management node, the embodiment of the present application further provides a possible implementation manner of updating the time line to be used, which specifically may include steps 11-13:
Step 11: upon reaching the timeline update condition, the management node creates a second metadata client.
The time line updating condition refers to a condition which is reached when the time line stored in the built-in metadata service needs to be updated; moreover, the embodiment of the present application does not limit the time line update condition, and for example, it may specifically be: the actual timeline in the management node changes (e.g., the execution state of an existing transaction is modified, a new transaction is created, etc.).
Indeed, in some cases, for a management node, when the management node determines that an actual timeline in the management node changes, the management node may automatically trigger a timeline update request, so that a timeline to be used stored in a built-in metadata service of the management node is automatically updated, so as to ensure that the timeline to be used can keep synchronous with the actual timeline in the management node. Wherein the timeline update request is used for requesting that the actual timeline in the management node be synchronized with the timeline to be used stored in the built-in metadata service of the management node. It should be noted that the embodiment of the present application is not limited to the timeline update request, and may be, for example, an instruction of an Embedded TIMELINE SERVICE # sync.
Based on the situation shown in the previous paragraph, in one possible implementation, the timeline update condition may specifically be: a timeline update request is triggered.
The second metadata client refers to a metadata client at the management node side; and the second metadata client can pull timeline directly from the metadata system of the management node.
Step 12: the management node pulls the metadata timeline from the metadata system with the second metadata client. The metadata timeline refers to an actual timeline in the management node.
Step 13: the management node updates the time line to be used stored in the built-in metadata service by using the metadata time line.
It should be noted that, the embodiment of the present application is not limited to the implementation of step 13, and for example, it may specifically be: and directly replacing the stored time line to be used in the built-in metadata service by using the metadata time line to obtain an updated time line to be used. As another example, step 13 may specifically be: and according to the difference characterization data between the metadata time line and the time line to be used, adjusting the time line to be used stored in the built-in metadata service to obtain an updated time line to be used, so that the updated time line to be used is consistent with the metadata time line.
Based on the related content of steps 11 to 13, for the management node, when the management node determines that the time line updating condition is reached, the management node may determine that the update processing needs to be performed on the time line to be used stored in the built-in metadata service, so the management node may first create a second metadata client, then pull the metadata time line from the metadata system by means of the second metadata client, so that the metadata time line can represent the time line stored by the metadata system at the current time, and finally, update the time line to be used stored in the built-in metadata service by using the metadata time line by the management node, so that the updated time line to be used can keep synchronous with the actual time line in the management node, and the built-in metadata service can better replace the metadata system to feed back the real time line in the management node to each execution node based on the data lake.
Based on the above description of S2, it is known that, for an execution node based on a data lake, after the execution node creates a timeline view in a first metadata client, the execution node may directly access a built-in metadata service of a management node by means of the timeline view, so that the timeline view can directly obtain a time line to be used from the built-in metadata service (such as "obtain time line" shown in fig. 6), so that the time line to be used can represent some transactions (for example, some incomplete transactions, some completed transactions, etc.) in the management node and relevant information of the transactions, so that the subsequent execution node can learn relevant information of the transactions from the time line to be used.
Based on the above description of S1 to S2, in the method for obtaining a timeline based on a data lake according to the embodiment of the present application, when an execution node (for example, a Flink engine) based on the data lake determines that a timeline pulling condition is reached, a first metadata client (such as the metadata client shown in fig. 4) is created by the execution node, and a timeline view (such as the timeline view shown in fig. 4) is created in the first metadata client; and the execution node acquires the time line to be used from the built-in metadata service (such as the built-in metadata service shown in fig. 4) of the management node by utilizing the time line view, so that the time line to be used can represent the actual time line in the management node, and the time line to be used can represent at least one transaction (such as at least one transaction in an incomplete state) in the management node, so that the subsequent execution node can know the transaction needing to be executed by the execution node from the time line to be used, and the purpose of assisting the management node in realizing the writing task by the execution node can be realized.
The built-in metadata service of the management node can directly provide the timeline to the execution node based on the data lake through the timeline view, so that the execution node can directly acquire the timeline from the built-in metadata service of the management node, and the execution node does not need to request the timeline from the metadata system, thus adverse effects (such as unstable service, smaller task concurrency and the like of the metadata system) caused when the execution node directly requests the timeline from the metadata system can be effectively overcome, and the execution effect of writing tasks realized by the execution node auxiliary management node can be effectively improved.
In effect, after the data lake-based execution node obtains the timeline from the management node, the execution node also analyzes the timeline for transactions that need to be completed by the execution node. Based on this, the embodiment of the present application also provides a possible implementation manner of the timeline acquisition method based on the data lake, in this implementation manner, the timeline acquisition method may include not only the above S1-S2, but also S3:
s3: when the pending use timeline indicates that a pending transaction has been created, the data lake-based execution node executes the pending transaction.
Wherein, the transaction to be processed refers to the transaction created by the management node and needing to be assisted by the execution node based on the data lake. For example, if the data lake-based execution node is a flank, the pending transaction may refer to a transaction created by the management node that needs to be performed with the assistance of the flank.
Based on the above description of S3, it is known that, for the execution node based on the data lake, after the execution node obtains the timeline from the management node, if the to-be-used timeline indicates that the to-be-processed transaction has been created, the execution node may determine that the to-be-processed transaction needs to be executed by the execution node, so that the execution node may execute the to-be-processed transaction, and thus the execution node can assist the management node to implement a writing task for a certain data table.
Based on the above-mentioned time line obtaining method based on the data lake, the embodiment of the application further provides a time line obtaining device based on the data lake, which is explained and illustrated below with reference to fig. 7. Fig. 7 is a schematic structural diagram of a timeline acquiring device based on a data lake according to an embodiment of the present application. In addition, for technical details of the timeline acquisition device, please refer to the relevant content of the timeline acquisition method based on the data lake.
As shown in fig. 7, a data lake-based timeline acquisition apparatus 700 provided in an embodiment of the present application includes:
A creating unit 701, configured to create a first metadata client when a timeline pull condition is reached, and create a timeline view in the first metadata client;
an obtaining unit 702, configured to obtain a timeline to be used from a built-in metadata service of a management node by using the timeline view; wherein the time line to be used is stored in the built-in metadata service; the time line to be used is used for recording at least one transaction in the management node.
In one possible implementation, the at least one transaction includes at least one transaction in an incomplete state.
In one possible implementation, the built-in metadata service is used to store a real-time timeline in the management node.
In one possible implementation manner, the updating process of the time line to be used includes:
When a timeline update condition is reached, the management node creates a second metadata client;
the management node pulls a metadata timeline from a metadata system by using the second metadata client;
and the management node updates the time line to be used stored in the built-in metadata service by utilizing the metadata time line.
In one possible implementation, the data lake-based timeline acquisition device 700 further includes:
and the execution unit is used for executing the transaction to be processed when the time line to be used indicates that the transaction to be processed is created.
Based on the above description of the timeline acquisition device 700 based on the data lake, in the timeline acquisition device 700 based on the data lake provided by the embodiment of the application, when it is determined that a timeline pull condition is reached, a first metadata client is created, and a timeline view is created in the first metadata client; and then the timeline view is utilized to acquire a timeline to be used from the built-in metadata service of the management node, so that the timeline to be used can represent at least one transaction (for example, at least one transaction in an unfinished state, etc.) in the management node, so that the subsequent timeline acquisition device 700 can acquire the transaction needing to be executed by the timeline from the timeline to be used, and the purpose of assisting the management node to realize the writing task by the timeline acquisition device 700 can be realized.
Wherein, because the built-in metadata service of the management node can directly provide the timeline to the timeline acquisition device 700 based on the data lake through the timeline view, the timeline acquisition device 700 can directly acquire the timeline from the built-in metadata service of the management node, so that the timeline acquisition device 700 does not need to request the timeline from the metadata system, adverse effects (such as unstable service problem, smaller task concurrency and the like of the metadata system) caused when the timeline acquisition device 700 directly requests the timeline from the metadata system can be effectively avoided, and thus the execution effect of the writing task realized by the assistance of the timeline acquisition device 700 to the management node can be effectively improved.
Based on the data lake-based time line acquisition method, the embodiment of the application also provides an execution node based on the data lake, and the execution node can assist the management node in realizing the writing task.
In addition, data lake based executing nodes need to learn some transaction information in the management node by continually polling the timelines in the management node.
In addition, the execution node based on the data lake can utilize any implementation mode of the time line acquisition method based on the data lake provided by the embodiment of the application to achieve the purpose of acquiring the time line from the management node. For ease of understanding, some possible implementations of the execution node are described below.
In one possible implementation, an execution node based on a data lake is configured to create a first metadata client upon reaching a timeline pull condition, and create a timeline view in the first metadata client;
the execution node is further configured to acquire a time line to be used from a built-in metadata service of the management node by using the time line view; wherein the time line to be used is stored in the built-in metadata service; the time line to be used is used for recording at least one transaction in the management node.
In one possible implementation, the at least one transaction includes at least one transaction in an incomplete state.
In one possible implementation, the built-in metadata service is used to store a real-time timeline in the management node.
In a possible implementation manner, the updating process of the time line to be used includes:
When a timeline update condition is reached, the management node creates a second metadata client;
the management node pulls a metadata timeline from a metadata system by using the second metadata client;
and the management node updates the time line to be used stored in the built-in metadata service by utilizing the metadata time line.
In a possible implementation manner, the executing node is further configured to execute the pending transaction when the pending use timeline indicates that the pending transaction has been created.
In one possible implementation, the execution node is a Flink engine.
Based on the related content of the execution node, for the execution node based on the data lake provided by the embodiment of the application, when the execution node determines that the timeline pulling condition is reached, a first metadata client is created by the execution node, and a timeline view is created in the first metadata client; and the execution node acquires the time line to be used from the built-in metadata service of the management node by utilizing the time line view, so that the time line to be used can represent at least one transaction (such as at least one transaction in an unfinished state) in the management node, so that the subsequent execution node can know the transaction needing to be executed from the time line to be used, and the aim of assisting the management node to realize the writing task by the execution node can be realized.
The built-in metadata service of the management node can directly provide the timeline to the execution node based on the data lake through the timeline view, so that the execution node can directly acquire the timeline from the built-in metadata service of the management node, and the execution node does not need to request the timeline from the metadata system, thus adverse effects (such as unstable service, smaller task concurrency and the like of the metadata system) caused when the execution node directly requests the timeline from the metadata system can be effectively avoided, and the execution effect of writing tasks realized by the execution node auxiliary management node can be effectively improved.
In addition, the embodiment of the application also provides electronic equipment, which comprises a processor and a memory: the memory is used for storing instructions or computer programs; the processor is configured to execute the instructions or the computer program in the memory, so that the electronic device executes any implementation of the data lake-based timeline acquisition method provided by the embodiment of the present application.
Referring to fig. 8, a schematic structural diagram of an electronic device 800 suitable for use in implementing embodiments of the present disclosure is shown. The terminal devices in the embodiments of the present disclosure may include, but are not limited to, mobile terminals such as mobile phones, notebook computers, digital broadcast receivers, PDAs (personal digital assistants), PADs (tablet computers), PMPs (portable multimedia players), in-vehicle terminals (e.g., in-vehicle navigation terminals), and the like, and stationary terminals such as digital TVs, desktop computers, and the like. The electronic device shown in fig. 8 is merely an example and should not be construed to limit the functionality and scope of use of the disclosed embodiments.
As shown in fig. 8, the electronic device 800 may include a processing means (e.g., a central processor, a graphics processor, etc.) 801, which may perform various appropriate actions and processes according to a program stored in a Read Only Memory (ROM) 802 or a program loaded from a storage means 808 into a Random Access Memory (RAM) 803. In the RAM803, various programs and data required for the operation of the electronic device 800 are also stored. The processing device 801, the ROM 802, and the RAM803 are connected to each other by a bus 804. An input/output (I/O) interface 805 is also connected to the bus 804.
In general, the following devices may be connected to the I/O interface 805: input devices 806 including, for example, a touch screen, touchpad, keyboard, mouse, camera, microphone, accelerometer, gyroscope, and the like; an output device 807 including, for example, a Liquid Crystal Display (LCD), speakers, vibrators, etc.; storage 808 including, for example, magnetic tape, hard disk, etc.; communication means 809. The communication means 809 may allow the electronic device 800 to communicate wirelessly or by wire with other devices to exchange data. While fig. 8 shows an electronic device 800 having various means, it is to be understood that not all of the illustrated means are required to be implemented or provided. More or fewer devices may be implemented or provided instead.
In particular, according to embodiments of the present disclosure, the processes described above with reference to flowcharts may be implemented as computer software programs. For example, embodiments of the present disclosure include a computer program product comprising a computer program embodied on a non-transitory computer readable medium, the computer program comprising program code for performing the method shown in the flow chart. In such an embodiment, the computer program may be downloaded and installed from a network via communication device 809, or installed from storage device 808, or installed from ROM 802. The above-described functions defined in the methods of the embodiments of the present disclosure are performed when the computer program is executed by the processing device 801.
The electronic device provided by the embodiment of the present disclosure belongs to the same inventive concept as the method provided by the above embodiment, and technical details not described in detail in the present embodiment can be seen in the above embodiment, and the present embodiment has the same beneficial effects as the above embodiment.
The embodiment of the application also provides a computer readable medium, wherein instructions or a computer program are stored in the computer readable medium, and when the instructions or the computer program are run on a device, the device is caused to execute any implementation mode of the timeline acquisition method based on the data lake provided by the embodiment of the application.
It should be noted that the computer readable medium described in the present disclosure may be a computer readable signal medium or a computer readable storage medium, or any combination of the two. The computer readable storage medium can be, for example, but not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or a combination of any of the foregoing. More specific examples of the computer-readable storage medium may include, but are not limited to: an electrical connection having one or more wires, a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing. In the context of this disclosure, a computer-readable storage medium may be any tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device. In the present disclosure, however, the computer-readable signal medium may include a data signal propagated in baseband or as part of a carrier wave, with the computer-readable program code embodied therein. Such a propagated data signal may take any of a variety of forms, including, but not limited to, electro-magnetic, optical, or any suitable combination of the foregoing. A computer readable signal medium may also be any computer readable medium that is not a computer readable storage medium and that can communicate, propagate, or transport a program for use by or in connection with an instruction execution system, apparatus, or device. Program code embodied on a computer readable medium may be transmitted using any appropriate medium, including but not limited to: electrical wires, fiber optic cables, RF (radio frequency), and the like, or any suitable combination of the foregoing.
In some embodiments, the clients, servers may communicate using any currently known or future developed network protocol, such as HTTP (Hyper Text Transfer Protocol ), and may be interconnected with any form or medium of digital data communication (e.g., a communication network). Examples of communication networks include a local area network ("LAN"), a wide area network ("WAN"), the internet (e.g., the internet), and peer-to-peer networks (e.g., ad hoc peer-to-peer networks), as well as any currently known or future developed networks.
The computer readable medium may be contained in the electronic device; or may exist alone without being incorporated into the electronic device.
The computer readable medium carries one or more programs which, when executed by the electronic device, cause the electronic device to perform the method described above.
Computer program code for carrying out operations of the present disclosure may be written in one or more programming languages, including, but not limited to, an object oriented programming language such as Java, smalltalk, C ++ and conventional procedural programming languages, such as the "C" programming language or similar programming languages. The program code may execute entirely on the user's computer, partly on the user's computer, as a stand-alone software package, partly on the user's computer and partly on a remote computer or entirely on the remote computer or server. In the case of a remote computer, the remote computer may be connected to the user's computer through any kind of network, including a Local Area Network (LAN) or a Wide Area Network (WAN), or may be connected to an external computer (for example, through the Internet using an Internet service provider).
The flowcharts and block diagrams in the figures illustrate the architecture, functionality, and operation of possible implementations of systems, methods and computer program products according to various embodiments of the present disclosure. In this regard, each block in the flowchart or block diagrams may represent a module, segment, or portion of code, which comprises one or more executable instructions for implementing the specified logical function(s). It should also be noted that, in some alternative implementations, the functions noted in the block may occur out of the order noted in the figures. For example, two blocks shown in succession may, in fact, be executed substantially concurrently, or the blocks may sometimes be executed in the reverse order, depending upon the functionality involved. It will also be noted that each block of the block diagrams and/or flowchart illustration, and combinations of blocks in the block diagrams and/or flowchart illustration, can be implemented by special purpose hardware-based systems which perform the specified functions or acts, or combinations of special purpose hardware and computer instructions.
The units involved in the embodiments of the present disclosure may be implemented by means of software, or may be implemented by means of hardware. Where the names of the units/modules do not constitute a limitation of the units themselves in some cases.
The functions described above herein may be performed, at least in part, by one or more hardware logic components. For example, without limitation, exemplary types of hardware logic components that may be used include: a Field Programmable Gate Array (FPGA), an Application Specific Integrated Circuit (ASIC), an Application Specific Standard Product (ASSP), a system on a chip (SOC), a Complex Programmable Logic Device (CPLD), and the like.
In the context of this disclosure, a machine-readable medium may be a tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device. The machine-readable medium may be a machine-readable signal medium or a machine-readable storage medium. The machine-readable medium may include, but is not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any suitable combination of the foregoing. More specific examples of a machine-readable storage medium would include an electrical connection based on one or more wires, a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing.
The embodiment of the application also provides a computer program product, which when being run on a terminal device, causes the terminal device to execute any implementation mode of the time line acquisition method based on the data lake.
It should be noted that, in the present description, each embodiment is described in a progressive manner, and each embodiment is mainly described in a different manner from other embodiments, and identical and similar parts between the embodiments are all enough to refer to each other. For the system or device disclosed in the embodiments, since it corresponds to the method disclosed in the embodiments, the description is relatively simple, and the relevant points refer to the description of the method section.
It should be understood that in the present application, "at least one (item)" means one or more, and "a plurality" means two or more. "and/or" for describing the association relationship of the association object, the representation may have three relationships, for example, "a and/or B" may represent: only a, only B and both a and B are present, wherein a, B may be singular or plural. The character "/" generally indicates that the context-dependent object is an "or" relationship. "at least one of" or the like means any combination of these items, including any combination of single item(s) or plural items(s). For example, at least one (one) of a, b or c may represent: a, b, c, "a and b", "a and c", "b and c", or "a and b and c", wherein a, b, c may be single or plural.
It is further noted that relational terms such as first and second, and the like are used solely to distinguish one entity or action from another entity or action without necessarily requiring or implying any actual such relationship or order between such entities or actions. Moreover, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising one … …" does not exclude the presence of other like elements in a process, method, article, or apparatus that comprises the element.
The steps of a method or algorithm described in connection with the embodiments disclosed herein may be embodied directly in hardware, in a software module executed by a processor, or in a combination of the two. The software modules may be disposed in Random Access Memory (RAM), memory, read Only Memory (ROM), electrically programmable ROM, electrically erasable programmable ROM, registers, hard disk, a removable disk, a CD-ROM, or any other form of storage medium known in the art.
The previous description of the disclosed embodiments is provided to enable any person skilled in the art to make or use the present application. Various modifications to these embodiments will be readily apparent to those skilled in the art, and the generic principles defined herein may be applied to other embodiments without departing from the spirit or scope of the application. Thus, the present application is not intended to be limited to the embodiments shown herein but is to be accorded the widest scope consistent with the principles and novel features disclosed herein.

Claims (10)

1. A data lake-based timeline acquisition method, applied to data lake-based execution nodes, the method comprising:
When a timeline pull condition is reached, creating a first metadata client and creating a timeline view in the first metadata client;
Acquiring a time line to be used from built-in metadata service of a management node by utilizing the time line view; wherein the time line to be used is stored in the built-in metadata service; the time line to be used is used for recording at least one transaction in the management node.
2. The method of claim 1, wherein the at least one transaction comprises at least one transaction in an incomplete state.
3. The method of claim 1, wherein the built-in metadata service is used to store a real-time timeline in the management node.
4. The method of claim 1, wherein the updating of the time line to be used comprises:
When a timeline update condition is reached, the management node creates a second metadata client;
the management node pulls a metadata timeline from a metadata system by using the second metadata client;
and the management node updates the time line to be used stored in the built-in metadata service by utilizing the metadata time line.
5. The method according to claim 1, wherein the method further comprises:
and executing the transaction to be processed when the time line to be used indicates that the transaction to be processed is created.
6. A data lake-based timeline acquisition device, comprising:
a creation unit, configured to create a first metadata client when a timeline pull condition is reached, and create a timeline view in the first metadata client;
an obtaining unit, configured to obtain a time line to be used from a built-in metadata service of a management node by using the time line view; wherein the time line to be used is stored in the built-in metadata service; the time line to be used is used for recording at least one transaction in the management node.
7. An execution node based on a data lake, wherein the execution node is configured to create a first metadata client and create a timeline view in the first metadata client when a timeline pull condition is reached;
the execution node is further configured to acquire a time line to be used from a built-in metadata service of the management node by using the time line view; wherein the time line to be used is stored in the built-in metadata service; the time line to be used is used for recording at least one transaction in the management node.
8. An electronic device, the device comprising: a processor and a memory;
The memory is used for storing instructions or computer programs;
the processor for executing the instructions or computer program in the memory to cause the electronic device to perform the method of any of claims 1-5.
9. A computer readable medium, characterized in that it has stored therein instructions or a computer program which, when run on a device, causes the device to perform the method of any of claims 1-5.
10. A computer program product, characterized in that the computer program product, when run on a terminal device, causes the terminal device to perform the method of any of claims 1-5.
CN202210603049.0A 2022-05-30 Time line acquisition method and device based on data lake and execution node Active CN114968936B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202210603049.0A CN114968936B (en) 2022-05-30 Time line acquisition method and device based on data lake and execution node

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202210603049.0A CN114968936B (en) 2022-05-30 Time line acquisition method and device based on data lake and execution node

Publications (2)

Publication Number Publication Date
CN114968936A CN114968936A (en) 2022-08-30
CN114968936B true CN114968936B (en) 2024-07-02

Family

ID=

Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112115152A (en) * 2020-09-15 2020-12-22 招商局金融科技有限公司 Data increment updating and querying method and device, electronic equipment and storage medium
CN114341999A (en) * 2019-08-30 2022-04-12 通用电气精准医疗有限责任公司 System and method for graphical user interface for medical device trending

Patent Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114341999A (en) * 2019-08-30 2022-04-12 通用电气精准医疗有限责任公司 System and method for graphical user interface for medical device trending
CN112115152A (en) * 2020-09-15 2020-12-22 招商局金融科技有限公司 Data increment updating and querying method and device, electronic equipment and storage medium

Similar Documents

Publication Publication Date Title
CN109189835B (en) Method and device for generating data wide table in real time
US12019652B2 (en) Method and device for synchronizing node data
CN107256206B (en) Method and device for converting character stream format
WO2021203918A1 (en) Method for processing model parameters, and apparatus
CN111857720B (en) User interface state information generation method and device, electronic equipment and medium
CN112965945A (en) Data storage method and device, electronic equipment and computer readable medium
CN111338944B (en) Remote Procedure Call (RPC) interface testing method, device, medium and equipment
WO2023029850A1 (en) Data processing method and apparatus, and electronic device and medium
CN111400625A (en) Page processing method and device, electronic equipment and computer readable storage medium
CN114116842A (en) Multi-dimensional medical data real-time acquisition method and device, electronic equipment and storage medium
CN111163336B (en) Video resource pushing method and device, electronic equipment and computer readable medium
CN111881216A (en) Data acquisition method and device based on shared template
CN113051055A (en) Task processing method and device
CN114968936B (en) Time line acquisition method and device based on data lake and execution node
WO2022151835A1 (en) Sample message processing method and apparatus
CN114036107B (en) Medical data query method and device based on hudi snapshot
CN115658171A (en) Method and system for solving dynamic refreshing of java distributed application configuration in lightweight mode
US10366060B2 (en) Augmenting database schema using information from multiple sources
CN110879818B (en) Method, device, medium and electronic equipment for acquiring data
CN114968936A (en) Data lake-based timeline acquisition method and device, and execution node
CN112115154A (en) Data processing and data query method, device, equipment and computer readable medium
CN111580890A (en) Method, apparatus, electronic device, and computer-readable medium for processing features
CN110727694A (en) Data processing method and device, electronic equipment and storage medium
CN110795670A (en) Webpage image monitoring method and device, electronic equipment and readable storage medium
CN111143464A (en) Data acquisition method and device and electronic equipment

Legal Events

Date Code Title Description
PB01 Publication
SE01 Entry into force of request for substantive examination
GR01 Patent grant