CN117216034A - Data blood margin analysis method and system - Google Patents
Data blood margin analysis method and system Download PDFInfo
- Publication number
- CN117216034A CN117216034A CN202311219895.3A CN202311219895A CN117216034A CN 117216034 A CN117216034 A CN 117216034A CN 202311219895 A CN202311219895 A CN 202311219895A CN 117216034 A CN117216034 A CN 117216034A
- Authority
- CN
- China
- Prior art keywords
- data
- task
- task instance
- parsing
- blood
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
- 238000004458 analytical method Methods 0.000 title claims abstract description 65
- 239000008280 blood Substances 0.000 title abstract description 44
- 210000004369 blood Anatomy 0.000 title abstract description 44
- 238000000034 method Methods 0.000 claims abstract description 99
- 238000012544 monitoring process Methods 0.000 claims abstract description 12
- 238000012545 processing Methods 0.000 abstract description 14
- 238000005516 engineering process Methods 0.000 abstract description 3
- 238000007726 management method Methods 0.000 description 21
- 238000006243 chemical reaction Methods 0.000 description 13
- 238000011161 development Methods 0.000 description 9
- 230000006870 function Effects 0.000 description 8
- 230000008569 process Effects 0.000 description 8
- 238000012550 audit Methods 0.000 description 4
- 230000008859 change Effects 0.000 description 3
- 238000010586 diagram Methods 0.000 description 3
- 238000013515 script Methods 0.000 description 3
- 238000011144 upstream manufacturing Methods 0.000 description 3
- 238000013459 approach Methods 0.000 description 2
- 238000010276 construction Methods 0.000 description 2
- 238000007405 data analysis Methods 0.000 description 2
- 230000001419 dependent effect Effects 0.000 description 2
- 239000003999 initiator Substances 0.000 description 2
- 241000533950 Leucojum Species 0.000 description 1
- 230000005856 abnormality Effects 0.000 description 1
- 230000002776 aggregation Effects 0.000 description 1
- 238000004220 aggregation Methods 0.000 description 1
- 238000004422 calculation algorithm Methods 0.000 description 1
- 238000004364 calculation method Methods 0.000 description 1
- 238000004590 computer program Methods 0.000 description 1
- 238000013523 data management Methods 0.000 description 1
- 238000013461 design Methods 0.000 description 1
- 238000001514 detection method Methods 0.000 description 1
- 238000009826 distribution Methods 0.000 description 1
- 238000001914 filtration Methods 0.000 description 1
- 230000008676 import Effects 0.000 description 1
- 230000010354 integration Effects 0.000 description 1
- 238000012423 maintenance Methods 0.000 description 1
- 230000007246 mechanism Effects 0.000 description 1
- 230000005012 migration Effects 0.000 description 1
- 238000013508 migration Methods 0.000 description 1
- 238000012986 modification Methods 0.000 description 1
- 230000004048 modification Effects 0.000 description 1
- 238000005457 optimization Methods 0.000 description 1
- 238000013439 planning Methods 0.000 description 1
- 238000013468 resource allocation Methods 0.000 description 1
- 238000003860 storage Methods 0.000 description 1
- 238000005211 surface analysis Methods 0.000 description 1
Landscapes
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
Abstract
The invention provides a data blood edge analysis method and a system, which are used for analyzing data blood edge information of a task instance and comprise the following steps: monitoring the state of the task instance, and selecting a corresponding analysis method according to the parameters of the task instance; the parameters at least comprise the type of the task instance; parsing the task instance using the parsing method to obtain an input/output table; and constructing data blood-edge information according to the input table/output table. The scheme provided by the invention can improve the compatibility of the data blood-source technology, so that the method has wider range of applicability and better data processing precision.
Description
Technical Field
The present invention relates to the field of data processing, and in particular, to a method and system for analyzing data blood edges.
Background
The demand of the intelligent age on data processing is increasing day by day, and accurate tracking and tracing of data can effectively improve the positioning of data problems in a big data system and improve the data quality at the same time. Data blood-edge technology is a method of tracking and recording the flow and change of data throughout the life of the data. The methods for generating data blood edges are commonly used at present:
1. manually inputting data blood margin: a data developer writes a document to record information such as blood relationship, conversion logic and the like among the data tables, and the data entities and the blood relationship thereof are manually created and managed in a mode such as drag-and-drop operation, table input and the like. This method relies on manual operations, is prone to errors and omissions, and is difficult to maintain and update.
2. Using the log and audit functions of the data warehouse: some data warehouse and database management systems (e.g., snowflake, redshift, mySQL, etc.) provide log and audit functions that record data manipulation and change history, and can also analyze these log and audit records by writing scripts or tools to extract data lineage information. However, this approach is limited to a specific system that supports both log and audit functions, with limited compatibility.
3. Parsing SQL produces data edges: data blood edge information is automatically extracted and generated by analyzing SQL statements in the data processing task. This approach can infer the blood-lineage relationships between data by parsing table names, field names, join operations, etc. in SQL statements. However, it relies on the correct SQL statement and the exact table and field naming convention and is not applicable to other types of tasks or development languages.
4. Using a data exploration tool: data exploration tools (e.g., trifacta, dataRobot, etc.) can self-analyze the statistical information, distribution, association, etc. characteristics of the data sets and infer their blood-lineage relationships. These tools can automatically explore and infer data blood edges, reducing the effort of manual input and parsing. However, these tools may have compatibility issues with specific development languages or data warehouses, and do not work with scheduling systems, failing to utilize blood-lineage information to control task scheduling.
It can be seen that the above prior art solutions have some inherent drawbacks, such as:
1. the degree of automation is low, is difficult to maintain: manual entry of the blood-lineage information or methods that rely on log analysis require manual operations or script writing, and maintenance and updating of the blood-lineage information is difficult.
2. Only for specific development languages or data warehouses, the compatibility is poor: some methods are only applicable to specific development languages or data warehouses and have limited compatibility with other systems.
3. And the dispatching system is not opened, and the dispatching of tasks cannot be controlled by blood edge information: the prior art scheme lacks tight integration between the blood-margin information and the task scheduling system, and can not control the scheduling and execution of the task through the blood-margin information.
Therefore, how to provide a method for analyzing data blood edges with high adaptability and accuracy is a problem to be solved in the art.
Disclosure of Invention
The invention provides a data blood edge analysis method and a data blood edge analysis system, which aim to improve the compatibility of data blood edge technology, so that the data blood edge analysis method and the data blood edge analysis system have wider applicability and better data processing precision.
In order to achieve the above purpose, the invention adopts a technical scheme that: a data lineage resolution method for resolving data lineage information for a task instance, comprising: monitoring the state of the task instance, and selecting a corresponding analysis method according to the parameters of the task instance; the parameters at least comprise the type of the task instance; parsing the task instance using the parsing method to obtain an input/output table; and constructing data blood-edge information according to the input table/output table.
In a preferred embodiment, the parsing method includes at least a first parsing method for task instances based on non-database management and operating languages; the database management and operation language comprises at least an SQL class language.
In a preferred embodiment, the first parsing method includes: generating a logic execution plan tree according to the state of the task instance; traversing the logic execution plan tree and searching a first relevant node, and generating the input table/output table according to the first relevant node.
In a preferred embodiment, the parsing method includes at least a second parsing method for task instances based on the database management and operation language.
In a preferred embodiment, the second parsing method includes: constructing a language parser; parsing the task instance into an Abstract Syntax Tree (AST) using the language parser; traversing the Abstract Syntax Tree (AST) and looking for a second relevant node, generating the input/output table from the second relevant node.
In a preferred embodiment, the parsing method includes a third parsing method for task instances of the log, the third parsing method including: invoking a bottom computing engine, inquiring and analyzing the execution information of the task instance; and generating the input table/output table according to the execution information.
In a preferred embodiment, said constructing data blood-address information from said input/output table comprises: data blood-edge information is stored by a relational database and/or a graph database.
In order to achieve the above purpose, another technical scheme adopted by the invention is as follows: a data lineage resolution system for performing any of the methods described above; the data blood-edge analysis system selects corresponding analysis methods according to the types of the task examples, and the corresponding analysis methods are respectively and independently designed into the form of blood-edge analysis plug-ins.
In order to achieve the above purpose, another technical scheme adopted by the invention is as follows: a scheduling system utilizing the method of any of the above or utilizing the data lineage resolution system of the above, the scheduling system resolving data lineage information for the task instance using the method or the data lineage resolution system.
In a preferred embodiment, when the dependency rules of the task instance include data-based dependencies, the scheduling system parses the dependency rules of the task instance based on the data lineage information.
Compared with the prior art, the invention has the advantages that: (1) According to the invention, through task classification and corresponding establishment of an analysis strategy, multiple data development languages and multiple data sources can be easily supported; (2) The method can analyze the task based on non-database management and operation language, and breaks through the limitation of task types; (3) The invention can realize the blood margin analysis at the column level and improve the accuracy of the execution of the data processing task; (4) According to the invention, the blood margin analysis and the dispatching system are tightly combined, so that the complete blood margin relation of the whole data warehouse is automatically constructed; (5) Compared with the result of only showing the data blood-edge relation, the combination of the blood-edge analysis and the scheduling system can effectively apply the data blood-edge relation to various different scenes of a large data platform, so that the data development is improved.
Drawings
In order to more clearly illustrate the embodiments of the invention or the technical solutions in the prior art, the drawings that are required in the embodiments or the description of the prior art will be briefly described, it being obvious that the drawings in the following description are only some embodiments of the invention, and that other drawings may be obtained according to these drawings without inventive effort for a person skilled in the art.
Fig. 1: the invention provides a flow diagram of a data blood-edge analysis method.
Fig. 2: the invention provides a flow diagram of a data blood-edge analysis method.
Fig. 3: the invention provides a flow diagram for determining task instance dependency rules by a scheduling system.
Detailed Description
The following description of the technical solutions in the embodiments of the present invention will be clear and complete, and it is obvious that the described embodiments are only some embodiments of the present invention, but not all embodiments. All other embodiments, which can be made by those skilled in the art based on the embodiments of the invention without making any inventive effort, are intended to be within the scope of the invention.
As used herein, "at least one" means one or more than one item.
The terms "comprises," "comprising," "including," and "having," are intended to be inclusive and mean that there may be additional features, integers, steps, operations, elements, and/or components described, but not preclude the presence or addition of one or more other features, integers, steps, operations, elements, components, and/or groups thereof.
As used herein, a "task instance" refers to performing a particular task operation to accomplish a desired data processing, computing, analysis, or other business logic. A task instance is a specific execution unit of a task that is responsible for executing a predefined task flow or operation and generating a corresponding result. Illustratively, in a data warehouse platform, task instances include, but are not limited to, data cleansing tasks, data analysis tasks, data import/export tasks, resource scheduling tasks, business process tasks.
As used herein, a "scheduling system" refers to a software or service that manages and coordinates task execution in a computer system. It is responsible for triggering, scheduling and monitoring the execution of tasks according to predetermined rules and conditions, according to a set schedule or event trigger.
In this document, a "development language" refers to a formal language for writing computer programs that is used to interact with a computer, instructing the computer to perform specific tasks, including, but not limited to, the following elements: grammar rules, semantic rules, keywords and identifiers, data types, control structures, libraries and frameworks, and compilations or interpretations.
As used herein, a "data source" refers to a source or carrier that provides data or information, including an accessible collection of data, database, file, sensor, web service, or other data provider.
Herein, "SQL class language" refers to a standard query language for managing relational databases, or a programming language with similar syntax and features as SQL (Structured Query Language), for storing, retrieving, modifying, and managing data in databases.
As used herein, a "dependency/dependency" refers to a relationship between a task or operation and other tasks, wherein the execution or completion of one task depends on the state, result, or order of execution of the other task; common dependencies include task-based dependencies, time window-based dependencies, state-based dependencies, and the like.
The invention provides a data blood edge analysis method and a system, which can analyze data blood edges aiming at any type of task examples without being limited by development language, data sources and the like of the task examples.
Furthermore, the invention divides the data blood-edge analysis method into a plurality of types, so that the matched data blood-edge analysis method can be selected according to the type of the task instance, thereby avoiding the limitation of the task type on the data blood-edge analysis.
Furthermore, the invention provides a data blood edge analysis method for the task instance based on the non-database management and operation language, which can meet the task instance based on the non-database management and operation language, such as unstructured data processing, file processing, API call and the like.
Still further, the method for analyzing the data blood edges of the task instances based on the non-database management and operation language is matched and combined with the method for analyzing the data blood edges of the task instances based on the database management and operation language and the method for analyzing the data blood edges of the task instances based on the log, so that the method can meet the data blood edge analysis requirements of the task instances of all types.
Referring to fig. 1, fig. 1 is a flow chart of a data blood edge analyzing method according to an embodiment of the invention, the method includes the following steps:
s1-1: the state of the task instance is monitored, and a corresponding analysis method is selected according to the parameters of the task instance.
Specifically, the embodiment selects a corresponding parsing method according to parameters of the task instance, where the parameters of the task instance at least include a type of the task instance. At the same time, the present embodiment monitors the status of the task instance.
In alternative embodiments, parameters of a task instance may generally include an instance ID, a task type, a task configuration, a scheduling configuration, a timestamp, an owner, etc., which may be used as a basis for selecting a parsing method. In this embodiment, the data blood-edge analysis method is divided according to task types, for example, for Spark computing tasks, the physical execution plan of the tasks is analyzed, so as to infer an input table and an output table of the tasks.
In an alternative embodiment, one or more data blood-edge analysis methods are preset in the system, and after the parameters of the task instance to be analyzed are determined, the corresponding data blood-edge analysis method can be selected for analysis.
The task instance may have multiple states during the creation and execution process, and the embodiment monitors the states to (1) ensure the accuracy of the data blood edges: if the task fails or an abnormality occurs, incomplete or wrong data blood edge information can be caused; by monitoring the state of the task instance, the problems in task execution can be found and processed in time, so that the accuracy of data blood edges is ensured; (2) track data changes and flows: the state of the task instance may provide detailed information about the data processing procedure, including the start time, end time, execution progress, etc. of the task; by monitoring the task state, the change and the flow path of the data can be tracked, and how the data is converted and processed from the input table to the output table can be known, so that the complete data blood-edge relationship can be constructed conveniently. In an alternative embodiment, the monitoring of task instances may be accomplished by a scheduling system, see in particular the description of the embodiments below. In an alternative embodiment, the monitoring of the state of the task instance is a pre-judging step of the data blood-edge analyzing method, that is, after the task instance is successfully run, the data blood-edge analyzing method is executed; if the task instance fails to run, the data blood-edge analysis method is not executed.
S1-2: the task instance is parsed using a parsing method to obtain an input/output table.
The present embodiment parses the task instance using a parsing method to obtain an input/output table. Specifically, after the task instance is run, the embodiment parses the task instance according to the selected data blood-source parsing method, where the parsing content includes configuration, definition, metadata, and the like of the task instance, so as to determine an input table and an output table of the task instance respectively.
Different types of task instances have different characteristics and requirements, and the data processing manners and techniques involved are different, and different types of task instances may require the use of different parsing methods and logic to parse the input and output of the task instances. According to different task types, the invention selects the analysis method suitable for the task type to ensure that correct data blood-edge information is acquired; meanwhile, the compatibility and adaptability of data blood-edge analysis are improved.
S1-3: and constructing data blood-edge information according to the input table/output table.
The present embodiment constructs data blood-address information according to the input/output table. Specifically, the relevant information of the input table/output table generally includes table names, fields, data types, sources, etc., and the embodiment analyzes logic of task instances, data conversion rules or other metadata information through the information, and determines the relationship between the input table/output table, thereby obtaining data blood-edge information.
In an alternative embodiment, building data lineage information from an input/output table includes storing data lineage information by a relational database and/or a graph database. In particular, the data lineage information can be expressed and stored in a variety of ways, for example, the source, destination, and conversion path of the data can be visualized by constructing a lineage graph. In an alternative embodiment, after obtaining the input/output tables for multiple task instances, the data information in the tables (which may be accurate to the table level and field level) is structured into a global directed acyclic graph that is stored for use in a graph database. In alternative embodiments, the data lineage information can be used for at least lineage map querying, lineage map searching, task dependent prompting, impact surface analysis, and the like.
Referring to fig. 2, fig. 2 is a flow chart of a data blood edge analyzing method according to an embodiment of the invention, and the method includes the following steps:
s2-1: the state of the task instance is monitored, and a corresponding analysis method is selected according to the parameters of the task instance.
Specifically, the embodiment selects a corresponding parsing method according to parameters of the task instance, where the parameters of the task instance at least include a type of the task instance. At the same time, the present embodiment monitors the status of the task instance.
In this embodiment, the parsing methods include at least a first parsing method for task instances based on a non-database management and operation language, a second parsing method for task instances based on a database management and operation language, and a third parsing method for task instances of a log; wherein the database management and operation language at least comprises SQL class language.
In an alternative embodiment, when the type of the task instance is not a database management or operation class or a log class, the parsing method will select the first parsing method, and execute steps S2-2 to S2-3; in an alternative embodiment, when the type of the task instance is a database management or operation class, the parsing method will select the second parsing method, and steps S2-4 to S2-6 are performed; in an alternative embodiment, when the type of task instance is a log class, the parsing method will select the third parsing method, and steps S2-7 to S2-8 are performed.
S2-2: a logical execution plan tree is generated based on the state of the task instance.
In this embodiment, generating a logical execution plan tree from the state of a task instance includes: acquiring a task instance state, analyzing the configuration of the task instance and constructing a logic execution plan tree.
Specifically, the present embodiment monitors the state of the task instance, and thus, when the task instance is executed, the current state of the task instance including the running state of the task instance, execution progress and procedure, execution result, and the like can be easily obtained. Configuration of task instances typically includes input parameters, output parameters, data source connections, conversion rules, etc., and by parsing the configuration of task instances in conjunction with the running process of task instances, a logical execution plan tree may be constructed.
In an alternative embodiment, a logical execution plan tree for the task instance is constructed based on the configuration and state information for the task instance. The logic execution plan tree consists of operation nodes and edges of task instances, each node represents a task operation step, and the edges represent the dependency relationship between task operations.
S2-3: traversing the logic execution plan tree and searching for a first relevant node, and generating an input table/output table according to the first relevant node.
In this embodiment, the first relevant node may be a node that reflects a flow and conversion relationship of data in a task instance in the logic execution plan tree; after generating the input/output table from the first correlation node, steps S2-9 are performed.
In an alternative embodiment, the first correlation node comprises at least a data manipulation node, a data conversion node, etc. Wherein the data operation node represents a data read or write operation related to the task instance; for example, a node reading an input table may be represented as a data manipulation node for obtaining input data of a task; similarly, a node written to the output table may also be denoted as a data manipulation node for writing output data of a task to a specified target table. The data conversion node represents data conversion operation in the task instance, and processes, calculates or converts input data to generate output data. Data conversion nodes typically involve a series of data processing steps such as data cleansing, data filtering, data aggregation, data computation, and the like.
In alternative embodiments, the manner in which the traversal logic performs the planning tree may be selected based on actual requirements, data structures, and/or characteristics of the programming language, and may be implemented, for example, by recursion, iteration, and specific data structures (e.g., stacks or queues), among others.
S2-4: a language parser is built.
In this embodiment, for the task instance based on the database management and operation language, a language parser is used to parse the syntax structure and semantic information of the task instance.
In an alternative embodiment, existing tools and techniques may be selected for construction based on the grammar and semantic rules of the task instance development language, e.g., using ANTLR, javaCC, etc. tools, the parser code is automatically generated based on the grammar rules.
S2-5: the task instance is parsed into an abstract syntax tree using a language parser.
The present embodiment uses a language parser to parse task instances into Abstract Syntax Trees (AST). Wherein each node of the abstract syntax tree represents a syntax structure of the language, such as a sentence, an expression, a function declaration, etc.
S2-6: traversing the abstract syntax tree and searching for a second related node, and generating the input table/output table according to the second related node.
In this embodiment, the second related node may be a node related to data read and write operations in an abstract syntax tree; after generating the input/output table from the second phase joint, steps S2-9 are performed.
In an alternative embodiment, the second phase joint includes at least a read node, a write node, an expression node, a function call node, and a variable node. Wherein a read node represents an operation to read data from an input table or data source; the node typically contains information about the table or data source read, such as table name, file path, connection information, etc. The write node represents an operation of writing data into an output table or a target data source; the node contains information about the table or data source being written, such as table name, file path, connection information, etc. The expression node represents an expression that operates or processes the input table data; the nodes may contain logic that filters, converts, aggregates, etc. columns in the input table. The function call node represents an operation of calling a specific function or method to process input table data; these nodes may include performing some specific data manipulation, conversion, or computation on the input table. The variable node represents an operation that references a particular variable or column in the input table; these nodes are used to represent data fields or columns in the input table.
S2-7: and calling a bottom computing engine, and inquiring and analyzing the execution information of the task instance.
The embodiment uses the underlying computing engine to query and parse the execution information of the task instance. Specifically, after establishing a connection with the underlying compute engine, the task instance is parsed by specific instructions.
S2-8: an input/output table is generated from the execution information.
Each step in the execution information provides relevant information such as input tables, output tables, connection operations, etc. By analyzing the execution information, the tables involved in the task instance and the operations between them can be understood.
In analyzing the execution information, attention may be paid to specific operators (e.g., JOIN, UNION, etc.) and input-output tables of operators to understand the input and output of task instances. After the input/output table is generated, steps S2-9 are performed.
S2-9: and constructing data blood-edge information according to the input table/output table.
In this embodiment, constructing data blood-address information from the input/output table includes: data blood-edge information is stored by a relational database and/or a graph database. Specifically, the relevant information of the input table/output table generally includes table names, fields, data types, sources, etc., and the embodiment analyzes logic of task instances, data conversion rules or other metadata information through the information, and determines the relationship between the input table/output table, thereby obtaining data blood-edge information. After the data blood edge analysis result is obtained, the obtained data blood edge information is stored in a database for use. The scheduling system selects one or more modes of a relational database and a graph database to store according to specific requirements and scenes.
The method for analyzing data blood edges provided in any embodiment of the present invention is not limited to the present invention, and examples of the SQL task instance (representing task instances based on database management and operation languages), the log task instance, and the Spark task instance (representing task instances based on non-database management and operation languages) are described below.
According to the embodiment, corresponding analysis methods are selected according to the parameters of the task examples, wherein the analysis methods at least comprise a first analysis method for the task examples based on the non-database management and operation language, a second analysis method for the task examples based on the database management and operation language, and a third analysis method for the task examples of the log; wherein the database management and operation language at least comprises SQL class language.
Specifically, for the three task instances illustrated in the present embodiment, spark task instance, SQL task instance, and log task instance, according to the type of task instance, the first parsing method, the second parsing method, and the third parsing method will be used for parsing, respectively.
In an alternative embodiment, the method for parsing Spark task instances using the first parsing method is as follows: the logic execution plan generated in the Spark engine execution process is acquired through registering an Event Listener of the Spark platform. Specifically, during Spark task execution, a custom Event Listener may be registered for listening to events generated by the Spark engine, including the generation of a logical execution plan. The present embodiment can obtain the generated logical execution plan tree by registering the Listener. Traversing the logic execution plan by utilizing a monitor mode of the Spark platform, acquiring a first related node of the task, and generating an input table/output table. Specifically, the embodiment traverses the logical execution plan tree of the Spark task by applying the initiator mode to access the nodes therein. During the blood margin parsing, the nodes related to the input and output of the task are concerned; these nodes may be data source nodes, conversion operation nodes, output nodes, or the like.
In an alternative embodiment, the method for parsing the SQL task instance using the second parsing method comprises: constructing an SQL Parser language Parser by using an Antlr4 tool; parsing the SQL task instance into an Abstract Syntax Tree (AST) using an SQL Parser language Parser; traversing the abstract syntax tree with a initiator mode, searching for a second related node, and generating an input table/output table.
In an alternative embodiment, the method for parsing the log task instance using the third parsing method includes: analyzing and inquiring the execution information by using a Presto bottom-layer calculation engine through an EXPLAIN command, wherein the execution information comprises execution details of the Query so as to obtain an analysis result of the Query; the analysis log is parsed and an input/output table is generated.
In an alternative embodiment, in the step, the data blood edge information is constructed by analyzing the attribute of the input/output node to determine the mutual generation relationship of the input/output columns. Specifically, once the nodes associated with the task input/output are determined, the attributes of those nodes, such as column name, column type, etc., need to be analyzed. By analyzing the attributes, a mutual generation relationship between columns of data, i.e., which columns are generated by other columns, can be determined. From these generated relationships, a data blood-lineage map can be constructed for storage to represent blood-lineage relationships between data. In an alternative embodiment, the data blood-edge information is stored by using a mature relational database (PostgreSQL), a graph database (such as Neo4 j) and a full-text search engine (elastic search) in combination, so that the requirements of different modes of inquiry and search can be met.
It should be understood that the task examples described above are only for illustrating the implementation steps of the present invention, and the method provided by the present invention may be applied to a wider range of task examples.
The invention also provides a data blood edge analysis system for executing the method provided by any embodiment. The data blood-edge analysis system can select corresponding analysis methods according to the types of the task examples, and the corresponding analysis methods are respectively and independently designed into the form of blood-edge analysis plug-ins.
In alternative embodiments, the corresponding parsing algorithms, logic or rules needed to parse a particular task type are designed in the plug-in. These parsing logic may involve parsing code, scripts, instructions, configuration files, etc. of the task. Meanwhile, the design of the plug-in considers the data processing characteristics of specific task types, such as data conversion, data migration, data analysis and the like, so as to ensure the accuracy and the integrity of analysis.
In alternative embodiments, the blood-margin parsing logic for different task types may be designed as separate plug-ins, such that the system has a well-modular structure. When a new task type is introduced, the extension can be performed by writing a new plug-in without modifying the entire parsing system. Meanwhile, each plug-in can carry out customized analysis logic aiming at the characteristics and the requirements of a specific task type, so that an input table and an output table of the task can be analyzed more accurately, and a more accurate blood-margin relation is provided. In an alternative embodiment, the plug-in can interact with the scheduling system through a unified interface, so that analysis results of different task types can be processed and stored according to a unified format and mode, and consistent blood-margin information output and use modes are provided.
The invention also provides a dispatching system which utilizes the data blood-edge analysis method provided by any embodiment or the data blood-edge analysis system provided by any embodiment to analyze the data blood-edge information of the task instance, thereby completing the dispatching of the system.
The scheduling system provided by the invention is suitable for various scenes needing task scheduling and task dependency management, such as big data processing, distributed computing, data warehouse construction and the like, and is particularly suitable for environments of cloud servers and the like. The method can help manage and track the blood-edge relationship between tasks, and accurately determine the dependency relationship and the data flow condition between task instances. The method has important significance in the aspects of task scheduling, performance optimization, fault detection, data tracing and the like. In addition, the invention tightly combines the blood margin analysis and the dispatching system, and can automatically construct the complete blood margin relation of the whole data management system.
Referring to fig. 3, fig. 3 is a schematic flow chart of determining a task instance dependency rule by a scheduling system according to an embodiment of the present invention.
In this embodiment, the scheduling system is responsible for creation, scheduling and monitoring of tasks, determining when tasks are running and the period of execution. And task instances are created by the scheduling system according to the scheduling period. The invention establishes the dependency relationship between the tasks by analyzing the dependency rules of the tasks, thereby establishing and executing task examples according to the correct sequence and the dependency relationship, ensuring the correct execution sequence of the tasks, avoiding data loss or errors and improving the reliability and efficiency of the whole task flow.
Specifically, after creating a task instance (e.g., step S3-1), the scheduling system parses the dependency rules of the task instance (e.g., step S3-2) in order to determine the upstream task instance on which the task instance depends. The scheduling system will begin running the current task instance when the upstream dependencies of the task instance are all satisfied (i.e., the dependent task instance is all completed). For the avoidance of doubt, the dependency relationship between tasks and the blood relationship between data are independent (although there is a close relationship between the two) and only define the execution sequence between the tasks (specifically, task examples), and the derived relationship between the data produced by the tasks is obtained by blood relationship analysis. The various dependency rules mentioned in the present invention refer to the dependency relationship between tasks. The dependency relationship of the task can be different from the dependency relationship of the blood source, and for a scheduling system, the task scheduling system only depends on the dependency relationship of the task.
When the dependency rules of the task instance include data-based dependencies, the scheduling system parses the dependency rules of the task instance from the data blood-source information (e.g., step S3-3). Specifically, the dependency rules of the present embodiment include at least data-based dependency rules. Where data-based dependencies include relationships between tasks and data, for example, some tasks may require the use of data generated by other tasks as input, thus requiring that the availability of data be guaranteed before these tasks are performed; in this case, the scheduling system may use the data blood-edge analysis method or system provided by the present invention to analyze the data blood-edge information between tasks and determine the upstream task instance on which the current task instance depends, so as to ensure the correct availability of the data.
More specifically, the scheduling system synchronously determines the data blood-edge analysis method of the task instance when the task instance is created. Wherein the scheduling system comprises at least one task scheduler and a dependency resolution engine; the task scheduler is used for performing scheduling operation of a full life cycle on the system, and comprises resource allocation, dependency analysis, state tracking, log collection, error analysis, alarm and the like; the dependency parsing engine is scheduled by the task scheduler for parsing the task dependency rules and determining the final task dependencies. In alternative embodiments, the task scheduler may be implemented by a known scheduling framework, such as Argo works flow or Prefect, etc.
In an alternative embodiment, the scheduling system optionally sets a task execution monitoring mechanism for capturing the execution process and related information of the task instance. The monitoring can be realized by means of log record, event monitoring, hook function and the like.
The foregoing has outlined rather broadly the more so that the principles and embodiments of the present invention may be better understood, and in order that the present invention may be better understood, and in order that the present invention may be better suited for use in conjunction with the detailed description that follows; also, it will be apparent to those skilled in the art from this disclosure that various changes and modifications can be made herein without departing from the spirit and scope of the invention as defined by the appended claims.
Claims (10)
1. A data lineage resolution method, wherein the method is used for resolving data lineage information of a task instance, comprising:
monitoring the state of the task instance, and selecting a corresponding analysis method according to the parameters of the task instance; the parameters at least comprise the type of the task instance;
parsing the task instance using the parsing method to obtain an input/output table;
and constructing data blood-edge information according to the input table/output table.
2. The method of claim 1, wherein the parsing method comprises at least a first parsing method for non-database management and operation language based task instances; the database management and operation language comprises at least an SQL class language.
3. The method of claim 2, wherein the first parsing method comprises:
generating a logic execution plan tree according to the state of the task instance;
traversing the logic execution plan tree and searching a first relevant node, and generating the input table/output table according to the first relevant node.
4. A method according to any of claims 1-3, characterized in that the parsing method comprises at least a second parsing method for task instances based on the database management and operation language.
5. The method of claim 4, wherein the second parsing method comprises:
constructing a language parser;
parsing the task instance into an Abstract Syntax Tree (AST) using the language parser;
traversing the Abstract Syntax Tree (AST) and looking for a second relevant node, generating the input/output table from the second relevant node.
6. The method of any of claims 1-5, wherein the parsing method comprises a third parsing method for task instances of the log, the third parsing method comprising:
invoking a bottom computing engine, inquiring and analyzing the execution information of the task instance;
and generating the input table/output table according to the execution information.
7. The method of any one of claims 1-6, wherein said constructing data lineage information from the input/output tables includes: data blood-edge information is stored by a relational database and/or a graph database.
8. A data lineage resolution system, characterized in that the data lineage resolution system is adapted to perform the method of any one of claims 1-7; the data blood-edge analysis system selects corresponding analysis methods according to the types of the task examples, and the corresponding analysis methods are respectively and independently designed into the form of blood-edge analysis plug-ins.
9. A scheduling system using the method of any one of claims 1-7 or using the data lineage resolution system of claim 8, wherein the scheduling system uses the method or the data lineage resolution system to resolve data lineage information for the task instance.
10. The scheduling system of claim 9, wherein when the dependency rules for the task instance include data-based dependencies, the scheduling system parses the dependency rules for the task instance based on the data lineage information.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202311219895.3A CN117216034A (en) | 2023-09-20 | 2023-09-20 | Data blood margin analysis method and system |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202311219895.3A CN117216034A (en) | 2023-09-20 | 2023-09-20 | Data blood margin analysis method and system |
Publications (1)
Publication Number | Publication Date |
---|---|
CN117216034A true CN117216034A (en) | 2023-12-12 |
Family
ID=89047812
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202311219895.3A Pending CN117216034A (en) | 2023-09-20 | 2023-09-20 | Data blood margin analysis method and system |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN117216034A (en) |
-
2023
- 2023-09-20 CN CN202311219895.3A patent/CN117216034A/en active Pending
Similar Documents
Publication | Publication Date | Title |
---|---|---|
US11360950B2 (en) | System for analysing data relationships to support data query execution | |
Ali et al. | From conceptual design to performance optimization of ETL workflows: current state of research and open problems | |
Gupta et al. | Cloud computing and big data analytics: what is new from databases perspective? | |
JP6434960B2 (en) | Support for a combination of flow-based ETL and entity relationship-based ETL | |
US20080263104A1 (en) | Updating a data warehouse schema based on changes in an observation model | |
Molderez et al. | Mining change histories for unknown systematic edits | |
US11893026B2 (en) | Advanced multiprovider optimization | |
Gathani et al. | Debugging database queries: A survey of tools, techniques, and users | |
Foidl et al. | Technical debt in data-intensive software systems | |
CN114238463A (en) | Calculation engine control method and device for distributed index calculation | |
Fisun et al. | Query parsing in order to optimize distributed DB structure | |
CN116301755A (en) | Automatic batch flow data marking framework construction method based on directed calculation graph | |
CN113641739B (en) | Spark-based intelligent data conversion method | |
US8825596B2 (en) | Systems and methods for robust data source access | |
Sheth et al. | Complex relationships and knowledge discovery support in the InfoQuilt system | |
Gombos et al. | P-Spar (k) ql: SPARQL evaluation method on Spark GraphX with parallel query plan | |
US11847118B1 (en) | Query set optimization in a data analytics pipeline | |
Zhu et al. | Ontology-based mission modeling and analysis for system of systems | |
CN117216034A (en) | Data blood margin analysis method and system | |
Böhm et al. | Towards Self-Optimization of Message Transformation Processes. | |
Püroja | LDBC Social Network Benchmark Interactive v2 | |
Simon et al. | ’SQL Code Complexity Analysis’ | |
Mordinyi et al. | Evaluating software architectures using ontologies for storing and versioning of engineering data in heterogeneous systems engineering environments | |
Böhm et al. | Model-driven development of complex and data-intensive integration processes | |
Paul | Model-driven reverse engineering of software processes managing business object relationships in an enterprise system |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
REG | Reference to a national code |
Ref country code: HK Ref legal event code: DE Ref document number: 40103235 Country of ref document: HK |