CN110555032A - Data blood relationship analysis method and system based on metadata - Google Patents

Data blood relationship analysis method and system based on metadata Download PDF

Info

Publication number
CN110555032A
CN110555032A CN201910850181.XA CN201910850181A CN110555032A CN 110555032 A CN110555032 A CN 110555032A CN 201910850181 A CN201910850181 A CN 201910850181A CN 110555032 A CN110555032 A CN 110555032A
Authority
CN
China
Prior art keywords
tree
traversing
query language
query
execution operation
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN201910850181.XA
Other languages
Chinese (zh)
Inventor
郑波
张强
饶鑫淞
杨川明
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing Sohu New Media Information Technology Co Ltd
Original Assignee
Beijing Sohu New Media Information Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing Sohu New Media Information Technology Co Ltd filed Critical Beijing Sohu New Media Information Technology Co Ltd
Priority to CN201910850181.XA priority Critical patent/CN110555032A/en
Publication of CN110555032A publication Critical patent/CN110555032A/en
Pending legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/22Indexing; Data structures therefor; Storage structures
    • G06F16/2228Indexing structures
    • G06F16/2246Trees, e.g. B+trees
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/24Querying
    • G06F16/242Query formulation
    • G06F16/2433Query languages
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/28Databases characterised by their database models, e.g. relational or object models
    • G06F16/284Relational databases

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Databases & Information Systems (AREA)
  • Physics & Mathematics (AREA)
  • Data Mining & Analysis (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Software Systems (AREA)
  • Mathematical Physics (AREA)
  • Computational Linguistics (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

the invention discloses a data blood relationship analysis method and system based on metadata, wherein the method comprises the following steps: defining a lexical rule and a grammatical rule of the structured query language through an open source grammar analyzer, analyzing the lexical rule and the grammatical rule of the structured query language, and converting the structured query language into an abstract syntax tree; traversing the abstract syntax tree to abstract out the basic composition unit of the query; traversing the basic composition units of the query to generate an execution operation tree; performing an execution operation tree transformation by a logic layer optimizer; traversing the execution operation tree and translating into a task tree; converting the task tree through a physical layer optimizer to generate a final execution plan; and analyzing the statement of the Hive query language based on the final execution plan to obtain an input and output table, a field and a corresponding processing condition. The invention can effectively complete the relationship combing among the data tables and the fields and analyze the blood relationship of the data.

Description

Data blood relationship analysis method and system based on metadata
Technical Field
the invention relates to the technical field of data processing, in particular to a data blood relationship analysis method and system based on metadata.
Background
The data blood relationship is simply the upstream and downstream source-to-destination relationship between data, and the data input source and output source. The important roles of the data relationship are self-evident, such as: if one data has a problem, the data can be checked upstream according to the blood relationship to see which link has the problem. In addition, the dependency relationship between tasks producing the data can be established through the blood relationship of the data, so as to assist the work scheduling of the scheduling system, or be used for judging which downstream data a failed or wrong task may affect, and the like. Metadata management becomes more and more important as data warehouse access to tables and models built increases, and metadata table consanguineous relationships maintain the relationships between tables. And the good metadata management can clearly and definitely see the relation between each table and the model. The mining of the blood relationship of the metadata plays an important role in tracking the data flow direction, troubleshooting the business problem, reducing the maintenance cost, improving the development efficiency and the like.
At present, several types of problems are often encountered in the warehouse:
1. The two data reports are compared, the results are very different, and the dimension information of the analysis indexes needs to be manually checked, for example, where the data indexes come from the beginning and what the processing conditions are, and finally, the problem reasons can be analyzed.
2. The basic data table needs to modify fields for some reason, needs to evaluate the influence of the fields on the log bins, is time-consuming and labor-consuming, and then is used for scheme making.
At present, only manual maintenance can be relied on, and once a script is changed, the manual maintenance is omitted or is not timely, and inaccurate relation can be caused.
Therefore, how to effectively complete the relationship combing among the data tables and the fields and analyze the blood relationship of the data is an urgent problem to be solved.
Disclosure of Invention
In view of this, the present invention provides a data relationship analysis method based on metadata, which can effectively complete relationship combing among data tables and fields, and analyze the relationship of the data.
the invention provides a data blood relationship analysis method based on metadata, which comprises the following steps:
Defining a lexical rule and a grammatical rule of a structured query language through an open source grammar analyzer, analyzing the lexical rule and the grammatical rule of the structured query language, and converting the structured query language into an abstract syntax tree;
traversing the abstract syntax tree to abstract a basic composition unit of the query;
traversing the basic composition units of the query to generate an execution operation tree;
Performing the execution operation tree transformation by a logic layer optimizer;
traversing the execution operation tree and translating into a task tree;
Transforming the task tree through a physical layer optimizer to generate a final execution plan;
and analyzing the statement of the Hive query language based on the final execution plan to obtain an input/output table, a field and a corresponding processing condition.
Preferably, the basic building block of the query is a building block of a structured query language, and includes: input sources, computing processes, and outputs.
preferably, traversing the abstract syntax tree to abstract out the basic building blocks of the query includes:
and traversing the abstract syntax tree in a sequencing mode, and storing different Token nodes into corresponding attributes.
preferably, traversing the basic building blocks of the query, generating an execution operation tree, comprises:
Traversing the attributes of the stored syntax of the QB and QBBParseInfo objects generated in the previous process, and generating an execution operation tree.
A metadata-based data relationship analysis system, comprising:
The system comprises a first analysis module, a second analysis module and a third analysis module, wherein the first analysis module is used for defining the lexical rule and the grammatical rule of the structured query language through an open source grammar analyzer, analyzing the lexical rule and the grammatical rule of the structured query language and converting the structured query language into an abstract syntax tree;
the first traversal module is used for traversing the abstract syntax tree and abstracting a basic composition unit of the query;
the second traversal module is used for traversing the basic composition unit of the query and generating an execution operation tree;
A first transformation module for performing the execution operation tree transformation by a logic layer optimizer;
The third traversal module is used for traversing the execution operation tree and translating the execution operation tree into a task tree;
the second transformation module is used for transforming the task tree through the physical layer optimizer to generate a final execution plan;
and the second analysis module is used for analyzing the statement of the Hive query language based on the final execution plan to analyze an input/output table, a field and a corresponding processing condition.
preferably, the basic building block of the query is a building block of a structured query language, and includes: input sources, computing processes, and outputs.
preferably, the first traversal module is specifically configured to:
And traversing the abstract syntax tree in a sequencing mode, and storing different Token nodes into corresponding attributes.
Preferably, the second traversal module is specifically configured to:
Traversing the attributes of the stored syntax of the QB and QBBParseInfo objects generated in the previous process, and generating an execution operation tree.
in summary, the present invention discloses a data relationship analysis method based on metadata, when the relationship of the data relationship needs to be analyzed, firstly, defining lexical rules and grammar rules of the structured query language through an open source grammar analyzer, and analyzing the lexical rules and grammar rules of the structured query language, converting the structured query language into an abstract syntax tree, then traversing the abstract syntax tree, abstracting the basic composition unit of the query, traversing the basic composition unit of the query, generating an execution operation tree, performing execution operation tree transformation through a logic layer optimizer, traversing the execution operation tree, translating into a task tree, and transforming the task tree through a physical layer optimizer to generate a final execution plan, analyzing the statement of the Hive query language based on the final execution plan, and analyzing an input/output table, a field and a corresponding processing condition. The invention can effectively complete the relationship combing among the data tables and the fields and analyze the blood relationship of the data.
drawings
in order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings used in the description of the embodiments or the prior art will be briefly described below, it is obvious that the drawings in the following description are only some embodiments of the present invention, and for those skilled in the art, other drawings can be obtained according to the drawings without creative efforts.
FIG. 1 is a flowchart of a method of an embodiment 1 of a method for analyzing data relationship based on metadata according to the present invention;
Fig. 2 is a schematic structural diagram of a data relationship analysis system 1 based on metadata according to an embodiment of the present invention.
Detailed Description
The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.
As shown in fig. 1, which is a flowchart of a method of embodiment 1 of a metadata-based data relationship analysis method disclosed in the present invention, the method may include the following steps:
s101, defining a lexical rule and a grammatical rule of the structured query language through an open source grammar analyzer, analyzing the lexical rule and the grammatical rule of the structured query language, and converting the structured query language into an abstract syntax tree;
When the blood relationship of data needs to be analyzed, firstly, lexical rules and syntax rules of SQL (Structured query language) are analyzed through Antlr (the Antlr is an open source syntax analyzer which can automatically generate a syntax tree according to input and can be visually displayed), and the SQL is converted into an abstract syntax tree ASTTree.
S102, traversing an abstract syntax tree to abstract a basic composition unit of a query;
Then, the AST Tree is traversed, and a basic composition unit QueryBlock of the query is abstracted.
S103, traversing the basic composition units of the query to generate an execution operation tree;
After abstracting out a basic composition unit QueryBlock of the query, generating a logic execution plan, namely traversing the QueryBlock and translating into an execution operation tree Opera Tree.
s104, performing operation tree transformation through a logic layer optimizer;
After the execution operation tree is generated, a logic execution plan is optimized, namely, the logic layer optimizer carries out Operatorrree transformation, and unnecessary reduce SinkOperator is combined, so that the amount of shuffle data is reduced.
S105, traversing the execution operation tree and translating the operation tree into a task tree;
And traversing the OperatorTree, and analyzing and optimizing the physical execution plan to generate the tasktree.
S106, transforming the task tree through a physical layer optimizer to generate a final execution plan;
then, the physical execution plan is optimized, that is, the physical layer optimizer performs tasktree transformation to generate a final execution plan.
and S107, analyzing the sentence of the Hive query language based on the final execution plan, and analyzing an input and output table, a field and a corresponding processing condition.
finally, by using a Hive (Hive is a data warehouse tool based on Hadoop), a structured data file can be mapped into a database table, and an Application Programming Interface (API) of the SQL-like query function is provided to obtain a final execution plan of an HQL (Hive query Language), so as to obtain the number of jobs of the HQL, and then, a statement of the Hive query Language is analyzed according to the final execution plan, so as to analyze an input/output table, a field, and a corresponding processing condition, which are used as metadata management and leading edge analysis information of the Hive table.
To sum up, in the above embodiment, when the blood-related relationship of the data needs to be analyzed, the lexical rule and the syntactic rule of the structured query language are firstly analyzed by the open-source parser, the structured query language is converted into the abstract syntactic tree, then the abstract syntactic tree is traversed, the basic composition unit of the query is abstracted, the basic composition unit of the query is traversed, the execution operation tree is generated, the execution operation tree transformation is performed by the logic layer optimizer, the execution operation tree is traversed and translated into the task tree, the task tree is transformed by the physical layer optimizer, the final execution plan is generated, the sentence of the Hive query language is analyzed based on the final execution plan, and the input/output table, the field and the corresponding processing condition are analyzed. The invention can effectively complete the relationship combing among the data tables and the fields and analyze the blood relationship of the data.
specifically, in the above embodiment, the AST depth is traversed first, when a token of an operation is encountered, the current operation is determined, and when a clause is encountered, the current processing is pushed, and the clause is processed. And after the clauses are processed, popping the stack. In the process of processing the words, when a child query is encountered, the information of the current child query is stored, the relation with the parent query is judged, and finally a tree structure is formed; and when the field or condition processing is met, recording the current field and condition information, forming Block and nesting calling.
Hive uses Antlr to realize the lexical and syntactic parsing of SQL. Only one grammar file is required to be written and the lexical method and the grammar replacement rule are defined by knowing that a specific language is constructed by using the Antlr, and the Antlr completes the processes of lexical analysis, grammar analysis, semantic analysis and intermediate code generation. The SQL parsing scheme takes Hive as an example. Firstly defining a lexical rule and a grammar rule file, then using Antlr to realize the lexical and grammar analysis of SQL, generating an AST grammar tree, and traversing the AST grammar tree to finish the subsequent operation. After lexical and syntactic analyses, if the expression needs to be further processed, an Abstract Syntax Tree Syntax Abstract Syntax Tree of Antlr is used, the input statement is converted into an Abstract Syntax Tree during syntactic analysis, and then further processing is completed when the Syntax Tree is traversed. The code analyzed by the Antlr on the Hive SQL is as follows, and HiveLexerX and HiveParser are lexical analysis and grammar analysis classes which are automatically generated after the Antlr compiles the grammar file Hive.
it should be noted that the inner-layer sub-query also generates a TOK _ DESTINATION node, which is a node intentionally added in syntax rewrite. The reason is that all the queried data in Hive are stored in the temporary file of the HDFS, and the Insert statement finally writes the data into the HDFS directory where the table is located no matter the intermediate sub-query or the final query result. In detail, after expanding the from clause of the memory sub-query, the following AST Tree is obtained, each table generates a TOK _ TABREF node, and the Join condition generates a "═ node. Other SQL parts are similar and not detailed.
The conversion of the AST Tree into QueryBlock is to abstract and structure SQL. QueryBlock is the most basic component unit of SQL, and comprises three parts: input source, calculation process and output. Simply speaking, a QueryBlock is a sub-query. The process of generating the QueryBlock by the AST Tree is a recursive process, the AST Tree is traversed in a precedent manner, different Token nodes are encountered and stored in corresponding attributes, and the generation of the Operator Tree by the QueryBlock is the traversal of the attributes of the stored syntax of the QB and QBCiseInfo objects generated in the previous process. Most of logic layer optimizers achieve the purposes of reducing MapReduce Job and reducing the amount of shuffle data by transforming OperatORTree and combining operators.
in summary, in Atlas and navigator, data-related metadata and blood-related information are mainly obtained by a runtime hook supported by a computing framework itself, for example, the hook of hive is in a syntax parsing stage, and the hook of storm is in a topologic submit stage. This has the advantage that the blood-based tracking analysis is based on the information of the real running task, and if the plug-in deployment is complete, the missing problem is not likely to happen, but the problem which is solved in this way is not good, such as:
(1) how to update a relationship of blood relationship which has dependence and no longer depends later;
(2) for a task which is not yet run, the blood relationship information cannot be obtained in advance;
(3) contamination of the blood-related relationship data by temporary scripts or wrong script logic;
In a word, because the blood relationship is collected based on the information in the running process, due to the lack of the assistance of static business information, how to discriminate and update the life cycle and the effectiveness of the blood relationship is a troublesome problem, and the application range is limited to a certain extent. According to the technical scheme provided by the invention, the acquisition of the blood relationship information is not carried out during running, but a timing task form is configured, and all task scripts configured on a scheduling system are periodically acquired. Because the dispatching uniformly manages the task scripts of all users, the invention can perform static analysis on the scripts, and the execution condition and the life cycle of the scripts are known to the development platform by adding the service information of the scripts, so the invention can solve the problems to a certain extent.
As shown in fig. 2, which is a schematic structural diagram of an embodiment 1 of a metadata-based data relationship analysis system disclosed in the present invention, the system may include:
The first parsing module 201 is configured to define lexical rules and grammar rules of the structured query language through the open source parser, parse the lexical rules and the grammar rules of the structured query language, and convert the structured query language into an abstract syntax tree;
when the blood relationship of data needs to be analyzed, firstly, lexical rules and syntax rules of SQL (Structured query language) are analyzed through Antlr (the Antlr is an open source syntax analyzer which can automatically generate a syntax tree according to input and can be visually displayed), and the SQL is converted into an abstract syntax tree ASTTree.
The first traversal module 202 is configured to traverse the abstract syntax tree to abstract out a basic composition unit of the query;
Then, the AST Tree is traversed, and a basic composition unit QueryBlock of the query is abstracted.
the second traversal module 203 is used for traversing the basic composition units of the query and generating an execution operation tree;
After abstracting out a basic composition unit QueryBlock of the query, generating a logic execution plan, namely traversing the QueryBlock and translating into an execution operation tree Opera Tree.
A first transformation module 204, configured to perform an operation tree transformation by a logic layer optimizer;
After the execution operation tree is generated, a logic execution plan is optimized, namely, the logic layer optimizer carries out Operatorrree transformation, and unnecessary reduce SinkOperator is combined, so that the amount of shuffle data is reduced.
a third traversal module 205, configured to traverse the execution operation tree and translate the execution operation tree into a task tree;
and traversing the OperatorTree, and analyzing and optimizing the physical execution plan to generate the tasktree.
a second transformation module 206, configured to perform transformation on the task tree through the physical layer optimizer to generate a final execution plan;
and then, optimizing the physical execution plan, namely, converting the tasktree task by the physical layer optimizer to generate a final execution plan.
and the second analysis module 207 is used for analyzing the statement of the Hive query language based on the final execution plan to obtain the input and output table, the field and the corresponding processing condition.
Finally, by using a Hive (Hive is a data warehouse tool based on Hadoop), a structured data file can be mapped into a database table, and an Application Programming Interface (API) of the SQL-like Query function is provided to obtain a final execution plan of an HQL (Hive Query Language), so as to obtain the number of jobs of the HQL, and then, a sentence of the Hive Query Language is analyzed according to the final execution plan, so as to analyze an input/output table, a field, and a corresponding processing condition as metadata management and blood margin analysis information of the Hive table.
to sum up, in the above embodiment, when the blood-related relationship of the data needs to be analyzed, the lexical rule and the syntactic rule of the structured query language are firstly analyzed by the open-source parser, the structured query language is converted into the abstract syntactic tree, then the abstract syntactic tree is traversed, the basic composition unit of the query is abstracted, the basic composition unit of the query is traversed, the execution operation tree is generated, the execution operation tree transformation is performed by the logic layer optimizer, the execution operation tree is traversed and translated into the task tree, the task tree is transformed by the physical layer optimizer, the final execution plan is generated, the sentence of the Hive query language is analyzed based on the final execution plan, and the input/output table, the field and the corresponding processing condition are analyzed. The invention can effectively complete the relationship combing among the data tables and the fields and analyze the blood relationship of the data.
Specifically, in the above embodiment, the AST depth is traversed first, when a token of an operation is encountered, the current operation is determined, and when a clause is encountered, the current processing is pushed, and the clause is processed. And after the clauses are processed, popping the stack. In the process of processing the words, when a child query is encountered, the information of the current child query is stored, the relation with the parent query is judged, and finally a tree structure is formed; and when the field or condition processing is met, recording the current field and condition information, forming Block and nesting calling.
hive uses Antlr to realize the lexical and syntactic parsing of SQL. Only one grammar file is required to be written and the lexical method and the grammar replacement rule are defined by knowing that a specific language is constructed by using the Antlr, and the Antlr completes the processes of lexical analysis, grammar analysis, semantic analysis and intermediate code generation. The SQL parsing scheme takes Hive as an example. Firstly defining a lexical rule and a grammar rule file, then using Antlr to realize the lexical and grammar analysis of SQL, generating an AST grammar tree, and traversing the AST grammar tree to finish the subsequent operation. After lexical and syntactic analyses, if the expression needs to be further processed, an Abstract Syntax Tree Syntax Abstract Syntax Tree of Antlr is used, the input statement is converted into an Abstract Syntax Tree during syntactic analysis, and then further processing is completed when the Syntax Tree is traversed. The code analyzed by the Antlr on the Hive SQL is as follows, and HiveLexerX and HiveParser are lexical analysis and grammar analysis classes which are automatically generated after the Antlr compiles the grammar file Hive.
It should be noted that the inner-layer sub-query also generates a TOK _ DESTINATION node, which is a node intentionally added in syntax rewrite. The reason is that all the queried data in Hive are stored in the temporary file of the HDFS, and the Insert statement finally writes the data into the HDFS directory where the table is located no matter the intermediate sub-query or the final query result. In detail, after expanding the from clause of the memory sub-query, the following AST Tree is obtained, each table generates a TOK _ TABREF node, and the Join condition generates a "═ node. Other SQL parts are similar and not detailed.
the conversion of the AST Tree into QueryBlock is to abstract and structure SQL. QueryBlock is the most basic component unit of SQL, and comprises three parts: input source, calculation process and output. Simply speaking, a QueryBlock is a sub-query. The process of generating the QueryBlock by the AST Tree is a recursive process, the AST Tree is traversed in a precedent manner, different Token nodes are encountered and stored in corresponding attributes, and the generation of the Operator Tree by the QueryBlock is the traversal of the attributes of the stored syntax of the QB and QBCiseInfo objects generated in the previous process. Most of logic layer optimizers achieve the purposes of reducing MapReduce Job and reducing the amount of shuffle data by transforming OperatORTree and combining operators.
In summary, in Atlas and navigator, data-related metadata and blood-related information are mainly obtained by a runtime hook supported by a computing framework itself, for example, hook of Hive is in a syntax parsing stage, and hook of storm is in a topologic submit stage. This has the advantage that the blood-based tracking analysis is based on the information of the real running task, and if the plug-in deployment is complete, the missing problem is not likely to happen, but the problem which is solved in this way is not good, such as:
(1) How to update a relationship of blood relationship which has dependence and no longer depends later;
(2) For a task which is not yet run, the blood relationship information cannot be obtained in advance;
(3) contamination of the blood-related relationship data by temporary scripts or wrong script logic;
in a word, because the blood relationship is collected based on the information in the running process, due to the lack of the assistance of static business information, how to discriminate and update the life cycle and the effectiveness of the blood relationship is a troublesome problem, and the application range is limited to a certain extent. According to the technical scheme provided by the invention, the acquisition of the blood relationship information is not carried out during running, but a timing task form is configured, and all task scripts configured on a scheduling system are periodically acquired. Because the dispatching uniformly manages the task scripts of all users, the invention can perform static analysis on the scripts, and the execution condition and the life cycle of the scripts are known to the development platform by adding the service information of the scripts, so the invention can solve the problems to a certain extent.
The embodiments in the present description are described in a progressive manner, each embodiment focuses on differences from other embodiments, and the same and similar parts among the embodiments are referred to each other. The device disclosed by the embodiment corresponds to the method disclosed by the embodiment, so that the description is simple, and the relevant points can be referred to the method part for description.
Those of skill would further appreciate that the various illustrative elements and algorithm steps described in connection with the embodiments disclosed herein may be implemented as electronic hardware, computer software, or combinations of both, and that the various illustrative components and steps have been described above generally in terms of their functionality in order to clearly illustrate this interchangeability of hardware and software. Whether such functionality is implemented as hardware or software depends upon the particular application and design constraints imposed on the implementation. Skilled artisans may implement the described functionality in varying ways for each particular application, but such implementation decisions should not be interpreted as causing a departure from the scope of the present invention.
The steps of a method or algorithm described in connection with the embodiments disclosed herein may be embodied directly in hardware, in a software module executed by a processor, or in a combination of the two. A software module may reside in Random Access Memory (RAM), memory, Read Only Memory (ROM), electrically programmable ROM, electrically erasable programmable ROM, registers, hard disk, a removable disk, a CD-ROM, or any other form of storage medium known in the art.
The previous description of the disclosed embodiments is provided to enable any person skilled in the art to make or use the present invention. Various modifications to these embodiments will be readily apparent to those skilled in the art, and the generic principles defined herein may be applied to other embodiments without departing from the spirit or scope of the invention. Thus, the present invention is not intended to be limited to the embodiments shown herein but is to be accorded the widest scope consistent with the principles and novel features disclosed herein.

Claims (8)

1. a data blood relationship analysis method based on metadata is characterized by comprising the following steps:
defining a lexical rule and a grammatical rule of a structured query language through an open source grammar analyzer, analyzing the lexical rule and the grammatical rule of the structured query language, and converting the structured query language into an abstract syntax tree;
Traversing the abstract syntax tree to abstract a basic composition unit of the query;
traversing the basic composition units of the query to generate an execution operation tree;
performing the execution operation tree transformation by a logic layer optimizer;
Traversing the execution operation tree and translating into a task tree;
Transforming the task tree through a physical layer optimizer to generate a final execution plan;
And analyzing the statement of the Hive query language based on the final execution plan to obtain an input/output table, a field and a corresponding processing condition.
2. the method of claim 1, wherein the basic building blocks of the query are building blocks of a structured query language base, comprising: input sources, computing processes, and outputs.
3. the method of claim 1, wherein traversing the abstract syntax tree to abstract a base building block of a query comprises:
and traversing the abstract syntax tree in a sequencing mode, and storing different Token nodes into corresponding attributes.
4. the method of claim 1, wherein traversing the base building blocks of the query generates a tree of execution operations comprising:
traversing the attributes of the stored syntax of the QB and QBBParseInfo objects generated in the previous process, and generating an execution operation tree.
5. a metadata-based data relationship analysis system, comprising:
the system comprises a first analysis module, a second analysis module and a third analysis module, wherein the first analysis module is used for defining the lexical rule and the grammatical rule of the structured query language through an open source grammar analyzer, analyzing the lexical rule and the grammatical rule of the structured query language and converting the structured query language into an abstract syntax tree;
the first traversal module is used for traversing the abstract syntax tree and abstracting a basic composition unit of the query;
The second traversal module is used for traversing the basic composition unit of the query and generating an execution operation tree;
A first transformation module for performing the execution operation tree transformation by a logic layer optimizer;
the third traversal module is used for traversing the execution operation tree and translating the execution operation tree into a task tree;
The second transformation module is used for transforming the task tree through the physical layer optimizer to generate a final execution plan;
And the second analysis module is used for analyzing the statement of the Hive query language based on the final execution plan to analyze an input/output table, a field and a corresponding processing condition.
6. The system of claim 5, wherein the basic building blocks of the query are building blocks of a structured query language base, comprising: input sources, computing processes, and outputs.
7. The system of claim 5, wherein the first traversal module is specifically configured to:
and traversing the abstract syntax tree in a sequencing mode, and storing different Token nodes into corresponding attributes.
8. the system of claim 6, wherein the second traversal module is specifically configured to:
Traversing the attributes of the stored syntax of the QB and QBBParseInfo objects generated in the previous process, and generating an execution operation tree.
CN201910850181.XA 2019-09-09 2019-09-09 Data blood relationship analysis method and system based on metadata Pending CN110555032A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201910850181.XA CN110555032A (en) 2019-09-09 2019-09-09 Data blood relationship analysis method and system based on metadata

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201910850181.XA CN110555032A (en) 2019-09-09 2019-09-09 Data blood relationship analysis method and system based on metadata

Publications (1)

Publication Number Publication Date
CN110555032A true CN110555032A (en) 2019-12-10

Family

ID=68739637

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201910850181.XA Pending CN110555032A (en) 2019-09-09 2019-09-09 Data blood relationship analysis method and system based on metadata

Country Status (1)

Country Link
CN (1) CN110555032A (en)

Cited By (17)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111158653A (en) * 2019-12-30 2020-05-15 上海金仕达软件科技有限公司 SQL language-based integrated development and execution system for real-time computing program
CN111400338A (en) * 2020-03-04 2020-07-10 平安医疗健康管理股份有限公司 SQ L optimization method, device, storage medium and computer equipment
CN111399843A (en) * 2020-03-11 2020-07-10 中国邮政储蓄银行股份有限公司 Method, system and electronic device for mapping SQ L operation information to SQ L file
CN111538743A (en) * 2020-04-22 2020-08-14 电子科技大学 SQL-based data blood relationship analysis method and system
CN111782265A (en) * 2020-06-28 2020-10-16 中国工商银行股份有限公司 Software resource system based on field level blood relationship and establishment method thereof
CN111813796A (en) * 2020-06-15 2020-10-23 北京邮电大学 Data column level blood margin processing system and method based on Hive data warehouse
CN111859929A (en) * 2020-08-05 2020-10-30 杭州安恒信息技术股份有限公司 Data visualization method and device and related equipment
CN112035416A (en) * 2020-08-31 2020-12-04 北京嘀嘀无限科技发展有限公司 Data blood margin analysis method and device, electronic equipment and storage medium
CN112328667A (en) * 2020-07-17 2021-02-05 四川长宁天然气开发有限责任公司 Shale gas field ground engineering digital handover method based on data blooding margin
CN113032642A (en) * 2019-12-24 2021-06-25 医渡云(北京)技术有限公司 Data processing method, device and medium for target object and electronic equipment
CN113032362A (en) * 2021-03-18 2021-06-25 广州虎牙科技有限公司 Data blood margin analysis method and device, electronic equipment and storage medium
CN113177057A (en) * 2021-04-28 2021-07-27 深圳依时货拉拉科技有限公司 SQL statement syntax visualization analysis method, system and computer readable storage medium
CN113467785A (en) * 2021-07-19 2021-10-01 上海红阵信息科技有限公司 SQL translation method and system for mimicry database
CN113515285A (en) * 2020-04-10 2021-10-19 北京沃东天骏信息技术有限公司 Method and device for generating real-time calculation logic data
CN113760960A (en) * 2020-06-01 2021-12-07 北京搜狗科技发展有限公司 Information generation method and device for generating information
CN114861229A (en) * 2022-06-08 2022-08-05 杭州比智科技有限公司 Hive dynamic desensitization method and system
CN116010428A (en) * 2023-02-24 2023-04-25 杭州比智科技有限公司 Data blood margin analysis method and device

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107644073A (en) * 2017-09-18 2018-01-30 广东中标数据科技股份有限公司 A kind of field consanguinity analysis method, system and device based on depth-first traversal
CN107688660A (en) * 2017-09-08 2018-02-13 上海达梦数据库有限公司 The execution method and device of parallel executive plan
CN108052618A (en) * 2017-12-15 2018-05-18 北京搜狐新媒体信息技术有限公司 Data managing method and device
CN109614432A (en) * 2018-12-05 2019-04-12 北京百分点信息科技有限公司 A kind of system and method for the acquisition data genetic connection based on syntactic analysis
CN110196888A (en) * 2019-05-27 2019-09-03 深圳前海微众银行股份有限公司 Data-updating method, device, system and medium based on Hadoop

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107688660A (en) * 2017-09-08 2018-02-13 上海达梦数据库有限公司 The execution method and device of parallel executive plan
CN107644073A (en) * 2017-09-18 2018-01-30 广东中标数据科技股份有限公司 A kind of field consanguinity analysis method, system and device based on depth-first traversal
CN108052618A (en) * 2017-12-15 2018-05-18 北京搜狐新媒体信息技术有限公司 Data managing method and device
CN109614432A (en) * 2018-12-05 2019-04-12 北京百分点信息科技有限公司 A kind of system and method for the acquisition data genetic connection based on syntactic analysis
CN110196888A (en) * 2019-05-27 2019-09-03 深圳前海微众银行股份有限公司 Data-updating method, device, system and medium based on Hadoop

Non-Patent Citations (4)

* Cited by examiner, † Cited by third party
Title
CXY的大数据实践田: "大数据血缘分析系统设计(四)", 《HTTP://CXY7.COM/ARTICLES/2018/05/26/1527312732448.HTML#B3_SOLO_H4_7》 *
I000ZHENG: "Hive SQL转化为MapReduce执行计划深度解析", 《HTTPS://BLOG.CSDN.NET/I000ZHENG/ARTICLE/DETAILS/81082774/》 *
THOMAS0YANG: "HIVE仓库数据血缘分析工具-SQL解析", 《HTTPS://BLOG.CSDN.NET/THOMAS0YANG/ARTICLE/DETAILS/49449723/》 *
THY822: "[一起学Hive]之十九-使用Hive API分析HQL的执行计划、Job数量和表的血缘关系", 《HTTPS://BLOG.CSDN.NET/THY822/ARTICLE/DETAILS/72420900》 *

Cited By (26)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113032642A (en) * 2019-12-24 2021-06-25 医渡云(北京)技术有限公司 Data processing method, device and medium for target object and electronic equipment
CN113032642B (en) * 2019-12-24 2024-02-09 医渡云(北京)技术有限公司 Data processing method and device for target object, medium and electronic equipment
CN111158653A (en) * 2019-12-30 2020-05-15 上海金仕达软件科技有限公司 SQL language-based integrated development and execution system for real-time computing program
CN111400338A (en) * 2020-03-04 2020-07-10 平安医疗健康管理股份有限公司 SQ L optimization method, device, storage medium and computer equipment
CN111400338B (en) * 2020-03-04 2022-11-22 深圳平安医疗健康科技服务有限公司 SQL optimization method, device, storage medium and computer equipment
CN111399843A (en) * 2020-03-11 2020-07-10 中国邮政储蓄银行股份有限公司 Method, system and electronic device for mapping SQ L operation information to SQ L file
CN111399843B (en) * 2020-03-11 2023-08-01 中国邮政储蓄银行股份有限公司 Method, system and electronic equipment for mapping SQL running information to SQL file
CN113515285A (en) * 2020-04-10 2021-10-19 北京沃东天骏信息技术有限公司 Method and device for generating real-time calculation logic data
CN111538743A (en) * 2020-04-22 2020-08-14 电子科技大学 SQL-based data blood relationship analysis method and system
CN111538743B (en) * 2020-04-22 2023-08-18 电子科技大学 SQL-based data blood relationship analysis method and system
CN113760960A (en) * 2020-06-01 2021-12-07 北京搜狗科技发展有限公司 Information generation method and device for generating information
CN111813796B (en) * 2020-06-15 2022-11-18 北京邮电大学 Data column level blood margin processing system and method based on Hive data warehouse
CN111813796A (en) * 2020-06-15 2020-10-23 北京邮电大学 Data column level blood margin processing system and method based on Hive data warehouse
CN111782265A (en) * 2020-06-28 2020-10-16 中国工商银行股份有限公司 Software resource system based on field level blood relationship and establishment method thereof
CN111782265B (en) * 2020-06-28 2024-02-02 中国工商银行股份有限公司 Software resource system based on field-level blood-relation and establishment method thereof
CN112328667A (en) * 2020-07-17 2021-02-05 四川长宁天然气开发有限责任公司 Shale gas field ground engineering digital handover method based on data blooding margin
CN112328667B (en) * 2020-07-17 2023-09-08 四川长宁天然气开发有限责任公司 Shale gas field ground engineering digital handover method based on data blood margin
CN111859929B (en) * 2020-08-05 2024-04-09 杭州安恒信息技术股份有限公司 Data visualization method and device and related equipment thereof
CN111859929A (en) * 2020-08-05 2020-10-30 杭州安恒信息技术股份有限公司 Data visualization method and device and related equipment
CN112035416A (en) * 2020-08-31 2020-12-04 北京嘀嘀无限科技发展有限公司 Data blood margin analysis method and device, electronic equipment and storage medium
CN113032362A (en) * 2021-03-18 2021-06-25 广州虎牙科技有限公司 Data blood margin analysis method and device, electronic equipment and storage medium
CN113032362B (en) * 2021-03-18 2024-01-19 广州虎牙科技有限公司 Data blood edge analysis method, device, electronic equipment and storage medium
CN113177057A (en) * 2021-04-28 2021-07-27 深圳依时货拉拉科技有限公司 SQL statement syntax visualization analysis method, system and computer readable storage medium
CN113467785A (en) * 2021-07-19 2021-10-01 上海红阵信息科技有限公司 SQL translation method and system for mimicry database
CN114861229A (en) * 2022-06-08 2022-08-05 杭州比智科技有限公司 Hive dynamic desensitization method and system
CN116010428A (en) * 2023-02-24 2023-04-25 杭州比智科技有限公司 Data blood margin analysis method and device

Similar Documents

Publication Publication Date Title
CN110555032A (en) Data blood relationship analysis method and system based on metadata
US9122540B2 (en) Transformation of computer programs and eliminating errors
CN109614432B (en) System and method for acquiring data blood relationship based on syntactic analysis
US10162613B1 (en) Re-usable rule parser for different runtime engines
CN108255837B (en) SQL parser and method
CN110019314B (en) Dynamic data packaging method based on data item analysis, client and server
CN110673854A (en) SAS language compiling method, device, equipment and readable storage medium
CN111143330A (en) Method and device for realizing multi-mode database analysis engine
US20130060753A1 (en) Optimization Method And Apparatus
CN111367893A (en) Method and device for database version iteration
CN117093599A (en) Unified SQL query method for heterogeneous data sources
CN112270175A (en) ANTLR-based complex report formula analysis method and device
Panchenko et al. Precise and scalable querying of syntactical source code patterns using sample code snippets and a database
CN110851514A (en) ETL (extract transform and load) processing method based on FLINK (Linear rotation index)
CN113204593A (en) ETL job development system and computer equipment based on big data calculation engine
Boukhari et al. The role of user requirements in data repository design
CN113032366A (en) SQL syntax tree analysis method based on Flex and Bison
Gombos et al. P-Spar (k) ql: SPARQL evaluation method on Spark GraphX with parallel query plan
Alexandru et al. Rapid multi-purpose, multi-commit code analysis
WO2017097125A1 (en) Executive code generation method and device
CN112445867A (en) Intelligent analysis method and system for data relationship
Lövei et al. Refactoring module structure
Putrycz et al. Connecting legacy code, business rules and documentation
CN114461454A (en) Data recovery method and device, storage medium and electronic equipment
CN116166718B (en) Data blood margin acquisition method and device

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
RJ01 Rejection of invention patent application after publication
RJ01 Rejection of invention patent application after publication

Application publication date: 20191210