CN110555032A

CN110555032A - Data blood relationship analysis method and system based on metadata

Info

Publication number: CN110555032A
Application number: CN201910850181.XA
Authority: CN
Inventors: 郑波; 张强; 饶鑫淞; 杨川明
Original assignee: Beijing Sohu New Media Information Technology Co Ltd
Current assignee: Beijing Sohu New Media Information Technology Co Ltd
Priority date: 2019-09-09
Filing date: 2019-09-09
Publication date: 2019-12-10

Abstract

the invention discloses a data blood relationship analysis method and system based on metadata, wherein the method comprises the following steps: defining a lexical rule and a grammatical rule of the structured query language through an open source grammar analyzer, analyzing the lexical rule and the grammatical rule of the structured query language, and converting the structured query language into an abstract syntax tree; traversing the abstract syntax tree to abstract out the basic composition unit of the query; traversing the basic composition units of the query to generate an execution operation tree; performing an execution operation tree transformation by a logic layer optimizer; traversing the execution operation tree and translating into a task tree; converting the task tree through a physical layer optimizer to generate a final execution plan; and analyzing the statement of the Hive query language based on the final execution plan to obtain an input and output table, a field and a corresponding processing condition. The invention can effectively complete the relationship combing among the data tables and the fields and analyze the blood relationship of the data.

Description

Data blood relationship analysis method and system based on metadata

Technical Field

the invention relates to the technical field of data processing, in particular to a data blood relationship analysis method and system based on metadata.

Background

The data blood relationship is simply the upstream and downstream source-to-destination relationship between data, and the data input source and output source. The important roles of the data relationship are self-evident, such as: if one data has a problem, the data can be checked upstream according to the blood relationship to see which link has the problem. In addition, the dependency relationship between tasks producing the data can be established through the blood relationship of the data, so as to assist the work scheduling of the scheduling system, or be used for judging which downstream data a failed or wrong task may affect, and the like. Metadata management becomes more and more important as data warehouse access to tables and models built increases, and metadata table consanguineous relationships maintain the relationships between tables. And the good metadata management can clearly and definitely see the relation between each table and the model. The mining of the blood relationship of the metadata plays an important role in tracking the data flow direction, troubleshooting the business problem, reducing the maintenance cost, improving the development efficiency and the like.

At present, several types of problems are often encountered in the warehouse:

1. The two data reports are compared, the results are very different, and the dimension information of the analysis indexes needs to be manually checked, for example, where the data indexes come from the beginning and what the processing conditions are, and finally, the problem reasons can be analyzed.

2. The basic data table needs to modify fields for some reason, needs to evaluate the influence of the fields on the log bins, is time-consuming and labor-consuming, and then is used for scheme making.

At present, only manual maintenance can be relied on, and once a script is changed, the manual maintenance is omitted or is not timely, and inaccurate relation can be caused.

Therefore, how to effectively complete the relationship combing among the data tables and the fields and analyze the blood relationship of the data is an urgent problem to be solved.

Disclosure of Invention

In view of this, the present invention provides a data relationship analysis method based on metadata, which can effectively complete relationship combing among data tables and fields, and analyze the relationship of the data.

the invention provides a data blood relationship analysis method based on metadata, which comprises the following steps:

Defining a lexical rule and a grammatical rule of a structured query language through an open source grammar analyzer, analyzing the lexical rule and the grammatical rule of the structured query language, and converting the structured query language into an abstract syntax tree;

traversing the abstract syntax tree to abstract a basic composition unit of the query;

traversing the basic composition units of the query to generate an execution operation tree;

Performing the execution operation tree transformation by a logic layer optimizer;

traversing the execution operation tree and translating into a task tree;

Transforming the task tree through a physical layer optimizer to generate a final execution plan;

and analyzing the statement of the Hive query language based on the final execution plan to obtain an input/output table, a field and a corresponding processing condition.

Preferably, the basic building block of the query is a building block of a structured query language, and includes: input sources, computing processes, and outputs.

preferably, traversing the abstract syntax tree to abstract out the basic building blocks of the query includes:

and traversing the abstract syntax tree in a sequencing mode, and storing different Token nodes into corresponding attributes.

preferably, traversing the basic building blocks of the query, generating an execution operation tree, comprises:

Traversing the attributes of the stored syntax of the QB and QBBParseInfo objects generated in the previous process, and generating an execution operation tree.

A metadata-based data relationship analysis system, comprising:

The system comprises a first analysis module, a second analysis module and a third analysis module, wherein the first analysis module is used for defining the lexical rule and the grammatical rule of the structured query language through an open source grammar analyzer, analyzing the lexical rule and the grammatical rule of the structured query language and converting the structured query language into an abstract syntax tree;

the first traversal module is used for traversing the abstract syntax tree and abstracting a basic composition unit of the query;

the second traversal module is used for traversing the basic composition unit of the query and generating an execution operation tree;

A first transformation module for performing the execution operation tree transformation by a logic layer optimizer;

The third traversal module is used for traversing the execution operation tree and translating the execution operation tree into a task tree;

the second transformation module is used for transforming the task tree through the physical layer optimizer to generate a final execution plan;

and the second analysis module is used for analyzing the statement of the Hive query language based on the final execution plan to analyze an input/output table, a field and a corresponding processing condition.

preferably, the first traversal module is specifically configured to:

Preferably, the second traversal module is specifically configured to:

in summary, the present invention discloses a data relationship analysis method based on metadata, when the relationship of the data relationship needs to be analyzed, firstly, defining lexical rules and grammar rules of the structured query language through an open source grammar analyzer, and analyzing the lexical rules and grammar rules of the structured query language, converting the structured query language into an abstract syntax tree, then traversing the abstract syntax tree, abstracting the basic composition unit of the query, traversing the basic composition unit of the query, generating an execution operation tree, performing execution operation tree transformation through a logic layer optimizer, traversing the execution operation tree, translating into a task tree, and transforming the task tree through a physical layer optimizer to generate a final execution plan, analyzing the statement of the Hive query language based on the final execution plan, and analyzing an input/output table, a field and a corresponding processing condition. The invention can effectively complete the relationship combing among the data tables and the fields and analyze the blood relationship of the data.

drawings

in order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings used in the description of the embodiments or the prior art will be briefly described below, it is obvious that the drawings in the following description are only some embodiments of the present invention, and for those skilled in the art, other drawings can be obtained according to the drawings without creative efforts.

FIG. 1 is a flowchart of a method of an embodiment 1 of a method for analyzing data relationship based on metadata according to the present invention;

Fig. 2 is a schematic structural diagram of a data relationship analysis system 1 based on metadata according to an embodiment of the present invention.

Detailed Description

The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.

As shown in fig. 1, which is a flowchart of a method of embodiment 1 of a metadata-based data relationship analysis method disclosed in the present invention, the method may include the following steps:

s101, defining a lexical rule and a grammatical rule of the structured query language through an open source grammar analyzer, analyzing the lexical rule and the grammatical rule of the structured query language, and converting the structured query language into an abstract syntax tree;

When the blood relationship of data needs to be analyzed, firstly, lexical rules and syntax rules of SQL (Structured query language) are analyzed through Antlr (the Antlr is an open source syntax analyzer which can automatically generate a syntax tree according to input and can be visually displayed), and the SQL is converted into an abstract syntax tree ASTTree.

S102, traversing an abstract syntax tree to abstract a basic composition unit of a query;

Then, the AST Tree is traversed, and a basic composition unit QueryBlock of the query is abstracted.

S103, traversing the basic composition units of the query to generate an execution operation tree;

After abstracting out a basic composition unit QueryBlock of the query, generating a logic execution plan, namely traversing the QueryBlock and translating into an execution operation tree Opera Tree.

s104, performing operation tree transformation through a logic layer optimizer;

After the execution operation tree is generated, a logic execution plan is optimized, namely, the logic layer optimizer carries out Operatorrree transformation, and unnecessary reduce SinkOperator is combined, so that the amount of shuffle data is reduced.

S105, traversing the execution operation tree and translating the operation tree into a task tree;

And traversing the OperatorTree, and analyzing and optimizing the physical execution plan to generate the tasktree.

S106, transforming the task tree through a physical layer optimizer to generate a final execution plan;

then, the physical execution plan is optimized, that is, the physical layer optimizer performs tasktree transformation to generate a final execution plan.

and S107, analyzing the sentence of the Hive query language based on the final execution plan, and analyzing an input and output table, a field and a corresponding processing condition.

finally, by using a Hive (Hive is a data warehouse tool based on Hadoop), a structured data file can be mapped into a database table, and an Application Programming Interface (API) of the SQL-like query function is provided to obtain a final execution plan of an HQL (Hive query Language), so as to obtain the number of jobs of the HQL, and then, a statement of the Hive query Language is analyzed according to the final execution plan, so as to analyze an input/output table, a field, and a corresponding processing condition, which are used as metadata management and leading edge analysis information of the Hive table.

To sum up, in the above embodiment, when the blood-related relationship of the data needs to be analyzed, the lexical rule and the syntactic rule of the structured query language are firstly analyzed by the open-source parser, the structured query language is converted into the abstract syntactic tree, then the abstract syntactic tree is traversed, the basic composition unit of the query is abstracted, the basic composition unit of the query is traversed, the execution operation tree is generated, the execution operation tree transformation is performed by the logic layer optimizer, the execution operation tree is traversed and translated into the task tree, the task tree is transformed by the physical layer optimizer, the final execution plan is generated, the sentence of the Hive query language is analyzed based on the final execution plan, and the input/output table, the field and the corresponding processing condition are analyzed. The invention can effectively complete the relationship combing among the data tables and the fields and analyze the blood relationship of the data.

specifically, in the above embodiment, the AST depth is traversed first, when a token of an operation is encountered, the current operation is determined, and when a clause is encountered, the current processing is pushed, and the clause is processed. And after the clauses are processed, popping the stack. In the process of processing the words, when a child query is encountered, the information of the current child query is stored, the relation with the parent query is judged, and finally a tree structure is formed; and when the field or condition processing is met, recording the current field and condition information, forming Block and nesting calling.

Hive uses Antlr to realize the lexical and syntactic parsing of SQL. Only one grammar file is required to be written and the lexical method and the grammar replacement rule are defined by knowing that a specific language is constructed by using the Antlr, and the Antlr completes the processes of lexical analysis, grammar analysis, semantic analysis and intermediate code generation. The SQL parsing scheme takes Hive as an example. Firstly defining a lexical rule and a grammar rule file, then using Antlr to realize the lexical and grammar analysis of SQL, generating an AST grammar tree, and traversing the AST grammar tree to finish the subsequent operation. After lexical and syntactic analyses, if the expression needs to be further processed, an Abstract Syntax Tree Syntax Abstract Syntax Tree of Antlr is used, the input statement is converted into an Abstract Syntax Tree during syntactic analysis, and then further processing is completed when the Syntax Tree is traversed. The code analyzed by the Antlr on the Hive SQL is as follows, and HiveLexerX and HiveParser are lexical analysis and grammar analysis classes which are automatically generated after the Antlr compiles the grammar file Hive.

it should be noted that the inner-layer sub-query also generates a TOK _ DESTINATION node, which is a node intentionally added in syntax rewrite. The reason is that all the queried data in Hive are stored in the temporary file of the HDFS, and the Insert statement finally writes the data into the HDFS directory where the table is located no matter the intermediate sub-query or the final query result. In detail, after expanding the from clause of the memory sub-query, the following AST Tree is obtained, each table generates a TOK _ TABREF node, and the Join condition generates a "═ node. Other SQL parts are similar and not detailed.

The conversion of the AST Tree into QueryBlock is to abstract and structure SQL. QueryBlock is the most basic component unit of SQL, and comprises three parts: input source, calculation process and output. Simply speaking, a QueryBlock is a sub-query. The process of generating the QueryBlock by the AST Tree is a recursive process, the AST Tree is traversed in a precedent manner, different Token nodes are encountered and stored in corresponding attributes, and the generation of the Operator Tree by the QueryBlock is the traversal of the attributes of the stored syntax of the QB and QBCiseInfo objects generated in the previous process. Most of logic layer optimizers achieve the purposes of reducing MapReduce Job and reducing the amount of shuffle data by transforming OperatORTree and combining operators.

in summary, in Atlas and navigator, data-related metadata and blood-related information are mainly obtained by a runtime hook supported by a computing framework itself, for example, the hook of hive is in a syntax parsing stage, and the hook of storm is in a topologic submit stage. This has the advantage that the blood-based tracking analysis is based on the information of the real running task, and if the plug-in deployment is complete, the missing problem is not likely to happen, but the problem which is solved in this way is not good, such as:

(1) how to update a relationship of blood relationship which has dependence and no longer depends later;

(2) for a task which is not yet run, the blood relationship information cannot be obtained in advance;

(3) contamination of the blood-related relationship data by temporary scripts or wrong script logic;

In a word, because the blood relationship is collected based on the information in the running process, due to the lack of the assistance of static business information, how to discriminate and update the life cycle and the effectiveness of the blood relationship is a troublesome problem, and the application range is limited to a certain extent. According to the technical scheme provided by the invention, the acquisition of the blood relationship information is not carried out during running, but a timing task form is configured, and all task scripts configured on a scheduling system are periodically acquired. Because the dispatching uniformly manages the task scripts of all users, the invention can perform static analysis on the scripts, and the execution condition and the life cycle of the scripts are known to the development platform by adding the service information of the scripts, so the invention can solve the problems to a certain extent.

As shown in fig. 2, which is a schematic structural diagram of an embodiment 1 of a metadata-based data relationship analysis system disclosed in the present invention, the system may include:

The first parsing module 201 is configured to define lexical rules and grammar rules of the structured query language through the open source parser, parse the lexical rules and the grammar rules of the structured query language, and convert the structured query language into an abstract syntax tree;

The first traversal module 202 is configured to traverse the abstract syntax tree to abstract out a basic composition unit of the query;

the second traversal module 203 is used for traversing the basic composition units of the query and generating an execution operation tree;

A first transformation module 204, configured to perform an operation tree transformation by a logic layer optimizer;

a third traversal module 205, configured to traverse the execution operation tree and translate the execution operation tree into a task tree;

a second transformation module 206, configured to perform transformation on the task tree through the physical layer optimizer to generate a final execution plan;

and then, optimizing the physical execution plan, namely, converting the tasktree task by the physical layer optimizer to generate a final execution plan.

and the second analysis module 207 is used for analyzing the statement of the Hive query language based on the final execution plan to obtain the input and output table, the field and the corresponding processing condition.

Finally, by using a Hive (Hive is a data warehouse tool based on Hadoop), a structured data file can be mapped into a database table, and an Application Programming Interface (API) of the SQL-like Query function is provided to obtain a final execution plan of an HQL (Hive Query Language), so as to obtain the number of jobs of the HQL, and then, a sentence of the Hive Query Language is analyzed according to the final execution plan, so as to analyze an input/output table, a field, and a corresponding processing condition as metadata management and blood margin analysis information of the Hive table.

In summary, in Atlas and navigator, data-related metadata and blood-related information are mainly obtained by a runtime hook supported by a computing framework itself, for example, hook of Hive is in a syntax parsing stage, and hook of storm is in a topologic submit stage. This has the advantage that the blood-based tracking analysis is based on the information of the real running task, and if the plug-in deployment is complete, the missing problem is not likely to happen, but the problem which is solved in this way is not good, such as:

The embodiments in the present description are described in a progressive manner, each embodiment focuses on differences from other embodiments, and the same and similar parts among the embodiments are referred to each other. The device disclosed by the embodiment corresponds to the method disclosed by the embodiment, so that the description is simple, and the relevant points can be referred to the method part for description.

Those of skill would further appreciate that the various illustrative elements and algorithm steps described in connection with the embodiments disclosed herein may be implemented as electronic hardware, computer software, or combinations of both, and that the various illustrative components and steps have been described above generally in terms of their functionality in order to clearly illustrate this interchangeability of hardware and software. Whether such functionality is implemented as hardware or software depends upon the particular application and design constraints imposed on the implementation. Skilled artisans may implement the described functionality in varying ways for each particular application, but such implementation decisions should not be interpreted as causing a departure from the scope of the present invention.

The steps of a method or algorithm described in connection with the embodiments disclosed herein may be embodied directly in hardware, in a software module executed by a processor, or in a combination of the two. A software module may reside in Random Access Memory (RAM), memory, Read Only Memory (ROM), electrically programmable ROM, electrically erasable programmable ROM, registers, hard disk, a removable disk, a CD-ROM, or any other form of storage medium known in the art.

The previous description of the disclosed embodiments is provided to enable any person skilled in the art to make or use the present invention. Various modifications to these embodiments will be readily apparent to those skilled in the art, and the generic principles defined herein may be applied to other embodiments without departing from the spirit or scope of the invention. Thus, the present invention is not intended to be limited to the embodiments shown herein but is to be accorded the widest scope consistent with the principles and novel features disclosed herein.

Claims

1. a data blood relationship analysis method based on metadata is characterized by comprising the following steps:

Traversing the execution operation tree and translating into a task tree;

2. the method of claim 1, wherein the basic building blocks of the query are building blocks of a structured query language base, comprising: input sources, computing processes, and outputs.

3. the method of claim 1, wherein traversing the abstract syntax tree to abstract a base building block of a query comprises:

4. the method of claim 1, wherein traversing the base building blocks of the query generates a tree of execution operations comprising:

5. a metadata-based data relationship analysis system, comprising:

6. The system of claim 5, wherein the basic building blocks of the query are building blocks of a structured query language base, comprising: input sources, computing processes, and outputs.

7. The system of claim 5, wherein the first traversal module is specifically configured to:

8. the system of claim 6, wherein the second traversal module is specifically configured to: