CN115757525A - Column operator blood relationship construction method, server and computer readable storage medium - Google Patents

Column operator blood relationship construction method, server and computer readable storage medium Download PDF

Info

Publication number
CN115757525A
CN115757525A CN202211526166.8A CN202211526166A CN115757525A CN 115757525 A CN115757525 A CN 115757525A CN 202211526166 A CN202211526166 A CN 202211526166A CN 115757525 A CN115757525 A CN 115757525A
Authority
CN
China
Prior art keywords
scope
column
operator
child
node
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202211526166.8A
Other languages
Chinese (zh)
Inventor
王明灿
张乐
徐保荣
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Zhejiang Daying Technology Co ltd
Original Assignee
Zhejiang Daying Technology Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Zhejiang Daying Technology Co ltd filed Critical Zhejiang Daying Technology Co ltd
Priority to CN202211526166.8A priority Critical patent/CN115757525A/en
Publication of CN115757525A publication Critical patent/CN115757525A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02DCLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
    • Y02D10/00Energy efficient computing, e.g. low power processors, power management or thermal management

Landscapes

  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The invention provides a column operator blood margin construction method, which is applied to the technical field of computers and comprises the following steps of S1) generating an analytic tree ParseTree through Antlr analysis SQL; s2) designing an abstract syntax tree AST for constructing a blood vessel boundary link from an input column to an output column and a middle processing logic; s3) recursively traversing the parse tree ParseTree obtained in S1) to construct an abstract syntax tree AST designed in S2); s4) traversing the constructed abstract syntax tree AST in the S3) and extracting a column operator blood relationship model; s5) traversing the column operator blood margin model, constructing a point-edge relation, and storing the column operator blood margin into a database. The SQL parsing capability with the finest granularity is provided, the user is helped to know operator-level blood relationship of the data assets, including direct column source, used function, processing caliber and indirect influence, and the problem that the user cannot effectively enable business due to incomprehensible understanding of the data assets is solved.

Description

Column operator blood relationship construction method, server and computer readable storage medium
Technical Field
The invention relates to the technical field of computers, in particular to a column operator blood relationship construction method, a server and a computer readable storage medium.
Background
More and more enterprises need to be transformed digitally to enable business development through data analysis, which is a necessary path for enterprise development in the future. However, in order to adapt to enterprise development, data processing modes and processing logics inside enterprises are more and more complex, and as time goes on, users often cannot quickly and conveniently understand the processing logics of historical data, and cannot quickly respond to more and more agile business requirements. In order to not affect the agile implementation of data analysis, a user needs to write an annotation for a process script or write a complete document to back up the thinking logic of the process data at that time, so as to avoid as much as possible that an analysis result is wrong due to incomprehensible understanding of the data when a data analyst analyzes the data. But this relies on the user's consciousness and knowledge within the enterprise is not currently scalable.
The blooding relationship of the data can be inquired, and the upstream and downstream blooding relationship of the data assets can be inquired, and the prior art can only probe the upstream and downstream paths of the data assets, such as the upstream and the downstream of a library, a table and a column; the existing solutions are basically: analyzing the processing logic script of the data assets to form a table-level or column-level data consanguinity, so that a user can timely know the source and application of data in a visual mode, and a basis is provided for subsequent data analysis; the disadvantages of the above method are as follows:
INSERT INTO T1(C1)
SELECT MAX(T2.C1+T3.C1)as C1 FROM T2,T3 WHERE T2.C2>1;
the table run-out can only show the processing relationship between tables, for example, the T1 table is derived from the upstream T2 and T3 tables, the column run-out can show that T1.C1 is derived from T2.C1 and T3.C1, and it cannot be known that T1.C1 is obtained by adding T2.C1 and T3.C1 and then obtaining the MAX function, and it cannot be known that the data range of T1.C1 is affected by the screening of T2. C2.
Therefore, the processing logics of different data assets may be consistent in use source, for example, the same table or column is used, but the real business logic of the data assets has a large number of conversion rules, and a user cannot intuitively know how the data is affected by the dependency relationships of each conversion, calculation, movement and the like of the data through the existing mode, so that the user understanding is incomplete, and still needs to consult a developer online, so that the existing agile business requirement cannot be met, namely, only where the column comes from is known, but not so.
On the other hand, the direct data source of the data asset and the data source which is depended on by indirect processing represent different data meanings and business meanings, and through the existing mode, a user can develop large-scale expansion of the range of upstream sources and downstream applications, and cannot intuitively perceive the direct sources of the data, so that data analysis cannot be normally carried out, and a manual query path is required to record processing logic and query where the data sources are;
furthermore, label diffusion based on column-level bloods only is inaccurate, since data itself carries variable meaning through layer-by-layer processing, since the program cannot know what processing the data has resulted from, and the accuracy of diffusion is greatly reduced by layer-by-layer downward of the label diffusion layer.
Disclosure of Invention
In order to solve the technical problems, the invention provides the SQL analysis capability with the finest granularity, helps the user to know operator-level blood relationship of the data assets, including direct column source, used function, processing caliber and indirect influence, and solves the problem that the user cannot understand the data assets and cannot effectively enable services.
In order to solve the main technical problems, the following technical scheme is adopted:
the column operator blood margin construction method comprises the following steps,
s1) generating a ParseTree of an analytic tree by analyzing SQL through Antler;
s2) designing an abstract syntax tree AST for constructing a blood margin link from an input column to an output column and a middle processing logic;
s3) recursively traversing the parse tree ParseTree obtained in S1) to construct an abstract syntax tree AST designed in S2);
s4) traversing the constructed abstract syntax tree AST in the S3) and extracting a column operator blood relationship model;
s5) traversing the column operator blood margin model, constructing a point-edge relation, and storing the column operator blood margin into a database.
Preferably, S2) comprises the steps of:
s21) based on relational algebra, dividing a complete SQL into at least one paragraph, wherein each paragraph is a trunk, designing a tree structure Scope for abstracting the corresponding trunk of the SQL, and establishing a hierarchical relationship between the trunks;
s22) dividing the Scope into an input Scope and an output Scope;
wherein: the input type Scope is divided into Projectscope, joinscope, unionscope and Scanscope and is used for abstracting an input part in the SQL sentence;
the output type Scope is divided into CreateAsscope and Insertscope and is used for abstracting an output part in the SQL statement;
wherein: each Scope comprises the type of the Scope, the alias of the Scope and the externally exposed field of the Scope, the parent-child relationship among different scopes is recorded through indexes, and the outer layer trunk set of the Scope is set to be a parentsscopelist, namely a father Scope set; setting an inner-layer trunk set of scopes as a child Scope List, namely a child Scope set, so as to construct parent-child relations among different scopes;
s23) defining the 6 scopes divided in the S22) and corresponding to the AST;
s24) defining the externally exposed field of each Scope as the holding field set of the Scope, wherein the holding field set of each Scope can be referred by the parent Scope in the parentgeoList of the Scope;
s25) adding a source attribute to each holding field in the holding field set
Expressionoorigin to record the source information of the holding field for constructing a bloodline link between the input column and the output column;
s26) establishing an expression for recording the processing logic of the column and the source of the column;
s27) each Scope uses the holding field information of the child Scope in the child Scope list to fill the holding field information of the current Scope, and then transmits the holding field information of the current Scope to the parent Scope, so that a processing link between an output column and a source column is constructed, and further the blood relationship information of the column operator is extracted.
Preferably, S3) comprises the steps of:
s31) recursively traversing the resolution tree ParseTree generated in S1) from the root node;
s32) according to the type of SQL, the root node is interpreted as CreateAsScope or InsertScope, and then the child nodes are traversed;
s33) if the node type is the query node SELECT xx from, interpreting the node type as project scope, and constructing expressions after SELECT, WHERE, GROUP BY, HAVING and ORDER BY; analyzing the child node after FROM as the corresponding Scope, adding the child node into the child Scope list of the current Projectscope, waiting for the holding field information of the Scope in the child Scope list to fill the source information of the expression of the current Projectscope, and then filling the holding field of the current Projectscope with the expression information;
s34) if the child node after the FROM is the physical table node, creating a Scope as a current Scope, and filling a holding field of the current Scope according to metadata information;
if the child node after the FROM is the child query node, jumping to S33) to perform recursion processing;
if the child node after FROM contains more than one JOIN node, creating a JoinScope as the current Scope, constructing all expressions after ON condition, and jumping the JoinItems ON two sides of each JOIN one by one to S33) for recursive processing; finally, the holding field of the JoinScope is constructed in a mode that the holding field of each sub-Scope in the child ScopeList is widened, and the source information of all ON expressions is filled;
if the child node after FROM comprises more than one UNION node, creating a Unionscope as the current Scope, and jumping UnionItems on two sides of each UNION one by one to S33) for recursive processing; finally, the holding field of the UnionScope is constructed in a mode of overlapping the holding field of each sub-Scope in the child ScopeList;
s35) traversing the ParseTree by S32) -S34) to construct the abstract syntax tree AST designed by S2).
Preferably, S4) comprises the steps of:
s41) extracting information from Scope to construct a Column Operator blood margin model, and dividing the Column Operator blood margin model into an Operator, a physical Column and a virtual Column Virtualcolumn;
the Operator is divided into 6 types, including:
SELECT operator: extracting the content of each projection item, namely the content before AS after the SELECT, from the Projectscope;
the WHERE operator: extracting the contents after WHERE from Projectscope;
GROUP operator: extracting the content after GROUP from Projectscope;
the HAVING operator: HAVING content extracted from ProjectScope;
JOIN operator: extracting the content after the ON condition from the JoinScope;
UNION operator: the method is used for aggregating a plurality of SELECT statements and extracting the SELECT statements from the Unionscope;
wherein, the SELECT operator and the UNION operator belong to a direct blood relationship and are used for tracing the processing logic and the processing link of the column, and the WHERE operator, the GROUP operator, the HAVING operator and the JOIN operator belong to an indirect blood relationship and are used for analyzing the influence surface of the column;
the physical Column comprises a real input physical Column and an output physical Column, belongs to metadata information, the input physical Column is extracted from ScanScope, and the output physical Column is extracted from CreateAsScope or InsertScope;
virtual column VirtualColumn: the alias part in the SELECT item AS alias in the subquery is extracted from the ProjectScope.
In a second aspect of the present invention, a server is further provided, which includes a memory, a processor, and a computer program stored in the memory and executable on the processor, and when the processor executes the computer program, the steps of the method are implemented.
In a third aspect of the present invention, a computer-readable storage medium is also presented, on which a computer program is stored, which program, when being executed by a processor, is adapted to carry out the steps of the above-mentioned method.
The invention has the beneficial effects that: compared with the prior art, the column operator blood margin construction method 1) judges detailed logics of all direct and indirect dependency relationships among data assets in an enterprise complex data environment through deep analysis of processing logic scripts of the data assets, translates real data processing codes into user-friendly expression capacity, and avoids the problem that self-service data analysis cannot be carried out due to incomprehensible data or expanded data source application in the data analysis process to influence agile service demand response;
2) The column operator blood margin is a blood margin with finer granularity, and the direct source, the used function, the processing caliber and the indirect influence can be analyzed through the column operator blood margin.
Drawings
In order to more clearly illustrate the technical solutions of the embodiments of the present invention, the drawings needed to be used in the embodiments will be briefly described below, and it is obvious that the drawings in the following description are only some examples of the present invention, and it is obvious for those skilled in the art that other drawings can be obtained according to the drawings without inventive labor.
FIG. 1 is a flow chart of the system of the present invention;
FIG. 2 is a system diagram of 6 Scope frameworks according to the present invention;
FIG. 3 is a relational diagram of 6 scopes of the present invention;
FIG. 4 is a flowchart illustrating the step 3) of recursive traversal according to the present invention;
FIG. 5 is a diagram of Scope construction and relationship between blood margin filling according to the present invention;
FIG. 6 is a flowchart of extracting operators in Scope according to the present invention;
FIG. 7 is a schematic diagram illustrating the structure comparison between the extracted operators in Scope and Scope according to the present invention;
FIG. 8 is a schematic diagram of point-edge storage in a graph database.
Detailed Description
The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be obtained by a person skilled in the art without making any creative effort based on the embodiments in the present invention, belong to the protection scope of the present invention.
Example 1
Referring to fig. 1, the method for constructing the blood relationship of the column operator includes the following steps:
s1) generating a ParseTree of an analytic tree by analyzing SQL through Antlr; the ParseTree generated after SQL is analyzed by the Antlr is only an abstract display of information of each node of the original SQL, and column operator blood relationship information cannot be directly extracted from the original SQL. Therefore, an abstract syntax tree AST needs to be redesigned according to the basic syntax of the relational algebra and the SQL, the abstract syntax tree AST decomposes the SQL into a plurality of trunks and expressions, and at the same time, the blood-border relations between the trunks and between the expressions are recorded, so that the purpose of extracting the blood-border of the column operators is achieved.
The SQL for extracting the blood relationship of the column operator requires that the SQL has an input part and an output part, namely a query part and an output part, so the SQL for extracting the blood relationship of the column operator can be divided into two types: INSERT INTO TABLE SELECT 8230and CREATE TABLE AS SELECT 8230. The input part is DQL-like statements and the output part is INSERT or CREATE.
S2) designing an abstract syntax tree AST for constructing a blood margin link from an input column to an output column and a middle processing logic;
the structure of the abstract syntax tree AST is as follows:
1) Based on relational algebra, a complete SQL is divided into a plurality of paragraphs, each paragraph is a main trunk, and a new tree structure Scope is defined in the patent for abstracting the main trunk corresponding to the SQL and reflecting the hierarchical relationship between the main trunk and the main trunk. Each Scope comprises the type of the Scope, the alias of the Scope and the externally exposed field of the Scope, the parent-child relationship among different scopes is recorded through indexes, and the outer layer trunk set of the Scope is set to be a parentsscope list, namely a father Scope set; setting an inner layer trunk set of Scope to be a child Scope List, namely a child Scope set, thereby constructing a parent-child relationship between different scopes;
2) The SQL for extracting the blood relationship of the column operator comprises an input part and an output part. Therefore, scope is divided into an input Scope and an output Scope; the input type Scope is divided into Projectscope, joinscope, unionscope and Scanscope, and the input part in the SQL statement is mainly abstracted; the output type Scope is divided into CreateAsscope and Insertscope, and two main output parts in the SQL statement are mainly abstracted;
3) Please refer to fig. 2, the 6 scopes declared in fig. 2) are defined, if SQL is CREATE TBALE table _ name AS, it is interpreted AS CREATE asscope, since the CREATE TBALE table _ name AS is the beginning position in a segment of SQL AS the output part, it is the root node in the Scope of the tree structure, the part behind AS is interpreted AS the Scope of other types, and the index records the parent-child relationship. If SQL is INSERT INTO table _ name (col 1, col2 \8230;) SELECT, the part before SELECT is interpreted AS Insertscope, similarly, because INSERT INTO table _ name (col 1, col2 \8230;) is also used AS the output part in a segment of SQL, the part after AS is interpreted AS the root node in the tree structure Scope, and the parent-child relationship is recorded by the index. The ProjectScope is used to abstract the structure of the SELECT statement, and comprises a projection part, a WHERE part, a GROUP BY part, a HAVING part and an ORDER BY part of the SELECT statement. If the SELECT statement is ended and then is UNION SELECT 8230, abstracting the first SELECT statement part to the last SELECT statement part into UnionScope together, wherein each SELECT statement part is a single ProjectScope and is recorded into a child ScopeList set of the UnionScope through an index. In project Scope, the contents after FROM and before WHERE are interpreted as a new Scope and recorded into the child Scope set of the Scope of the present layer, i.e. the inner layer backbone, by the index. If the FROM is followed by a physical table name table _ name, the physical table name table _ name is interpreted as a ScanScope, and then the parent-child relationship is recorded through an index, and in the tree structure Scope designed by the invention, leaf nodes are all the ScanScope; if the JOIN part exists behind the physical table name behind the FROM, the part behind the FROM and before the WHERE is interpreted as the JoinScope, the alias of the JoinScope is empty, the contents ON the two sides of each JOIN are continuously analyzed as the corresponding Scope, and meanwhile, the ON condition content of each JOIN is recorded. If FROM is followed by a sub-query (subQuery) AS alias, then this subQuery is likewise interpreted AS ProjectScreen and the alias is recorded AS alias, and then the parent-child relationship is recorded by the index, if FROM is followed by multiple sub-query UNIONs together, i.e. (subQuery UNION subQuery 8230;) AS alias, then this portion is interpreted AS UnionScreen, and so on.
Referring to fig. 3, in the Scope abstract syntax tree AST, parent-child relationships of different scopes are as follows:
CreateAsScope or InsertScope, their parentsscopelist must be empty, childrenScopeList must not be empty and is one of projescope or UnionScope.
The parentsscope list of ProjectScreen is not empty and is one of CreateAsScreen, insertScreen, joinScreen, projectScreen, and UnionScreen, and the childrenScopeList is not empty and is one of JoinScreen, projectScreen, unionScreen, and ScanScreen.
The parentsscope list of the JoinScope is not empty and must be ProjectScope, and the childrenscope list is not empty and has at least two or more, which are any combination of ProjectScope, unionScope and ScanScope.
The parentsscope list of the UnionScreen is not empty and is one of CreateAsScreen, insertScreen, joinScreen, projectScreen and UnionScreen, and the childrenScopelist is not empty and has at least more than two elements which are any combination of ProjectScreen and JionScreen.
The parentsscopelist of the ScanScope is not empty, and is one of the ProjectScope and the JoinScope, and the child ScanScope list is definitely empty.
4) Defining the externally exposed field of each Scope as a held field set threshold of the Scope, wherein the held field set of each Scope can be referred by a parentgeoList thereof; the set of held fields of the ScanScope is metadata information, that is, the set of column names under the physical table is the set of held fields of the ScanScope. For example, the t1 table has three columns of a, b, and c, then { "t1" [ "a", "b", "c" ] } is the set of holding fields of the ScanScope. The holding field set of the ProjectScope is the projection part in the SELECT statement, such as SELECT a, b, c as c1 FROM, alias, if not, null string, then the holding field set of the ProjectScope is { "alias": [ "a", "b", "c1" ] }. The set of holding fields of the UnionScope is the superposition of the holding fields of each sub-Scope, e.g., (SELECT d, e, f) UNION (SELECT g, h, i), then the set of holding fields of the UnionScope is { "alias": D "," e "," f "] }. The holding field of the JoinScope is stored in a width mode after the holding field set of each sub-Scope is collected, for example, t1 JOIN (SELECT d, e, f), wherein t1 has three columns of a, b and c, then the holding field set of the JoinScope is { "[ a, b, c, d, e, f ] }, and the alias of the JoinScope is determined to be empty, so the key of the holding field set is empty.
The holding field set of CreateAsScope is the holding field set of childrenScope, and the columns are, for example, CRETE TABLE TABLE _ name AS SELECT a, b, c, then the holding field set of Scope is { "TABLE _ name" [ "a", "b", "c" ] }. The holding field set of an InsertScope is a target table metadata column information set, such as INSERT INTO table _ name (a, b, c) SELECT d, e, f, then the holding field set of the Scope is { "table _ name" [ "a", "b", "c" ] }.
5) The Scope designed above has parent-child relationship, i.e. source relationship, and the holding field of each Scope also has source relationship, where the holding field in the holding field set is added with a source attribute expressorigin, i.e. which holding field of which Scope in childrenscope list is originated from, so that it can trace from the holding field of root Scope, i.e. CreateAsScope or InsertScope, to the holding field of physical layer Scope, thereby constructing the bloodline link between the input column and the output column.
6) The column's processing logic cannot be viewed directly by holding the field alone, and the patent defines an Expression to record the Expression form of the column. The expression may be a projection term, such AS SELECT MAX (a + b) AS col1, c AS col2,1AS col3, where MAX (a + b) is an expression and c and 1 are also expressions. Similarly, the content after the WHERE is also an expression, for example, WHERE c1> c2 AND c3= c4 OR c5+ c6-MAX (c 7) is an expression as a whole.
One for each sub-part after GROUP BY and ORDER BY. The expression is designed as tree structure on the storage structure, each node has the type of the current expression, the reference of each sub-expression and the source attribute expressionooriginof the expression, which is used to store the source information of the expression, and is the same attribute with the source attribute expressionooriginof the holding field. The processing logic of the columns can be clearly known through the expression, the source of the columns can be known through the source information of the expression, finally, the processing link from the output columns to the source columns can be constructed through upward transmission of the holding fields layer by layer, and the blood margin of the column operators can be further extracted.
S3) recursively traversing the parse tree ParseTree obtained in S1) to construct an abstract syntax tree AST designed in S2); the procedure for constructing Scope and Expression is as follows: constructing the column operator consanguinity between the output column and the source column can be converted into constructing the dependency relationship between the columns in the present layer Scope and the inner layer Scope in the childrenScopeList, further, each holding field can only be derived from his "subquery", namely, from the childrenScopeList, so that the "column operator consanguinity construction" can be converted into a "filling problem of the subquery holding original field". The specific process is as follows:
with an emphasis on referring to figures 4 and 5,
s31) recursively traversing the resolution tree ParseTree generated in S1) from the root node;
s32) according to the type of SQL, the root node is interpreted as CreateAsScope or InsertScope, and then the child nodes are traversed;
s33) if the node type is the query node SELECT xx from, interpreting the node type as project scope, and constructing expressions after SELECT, WHERE, GROUP BY, HAVING and ORDER BY; continuously explaining the child nodes behind the FROM as corresponding scopes, adding the child nodes into a child Scope list of the current Projectscope, waiting for the holding field information of the Scope in the child Scope list to fill the source information of the expression of the current Projectscope, and then filling the holding field of the current Projectscope with the expression information;
s34) if the child node behind the FROM is a physical table node, creating a Scope and setting the Scope as a current Scope, and filling a holding field of the current Scope according to metadata information;
if the child node after the FROM is the child query node, jumping to S33) to perform recursive processing;
if the child node after FROM contains more than one JOIN node, creating a JoinScope as the current Scope, constructing all expressions after ON condition, and jumping the JoinItems ON two sides of each JOIN one by one to S33) for recursive processing; finally, the holding field of the JoinScope is constructed in a mode that the holding field of each sub-Scope in the child ScopeList is widened, and the source information of all ON expressions is filled;
if the child node after FROM comprises more than one UNION node, creating a Unionscope as the current Scope, and jumping UnionItems on two sides of each UNION one by one to S33) for recursive processing; finally, the holding field of the UnionScope is constructed in a mode of overlapping the holding field of each sub-Scope in the child ScopeList;
please refer to fig. 6 and 7, S4) traverse S3) the column blood margin link model to extract the column operator blood margin; after the abstract syntax tree AST is obtained, a column operator blood relationship model needs to be extracted from the root Scope in a traversal mode, and the column operator blood relationship model comprises three contents: operator, physical Column and virtual Column VirtualColumn.
Operators in the column Operator blood margin are named as operators in 6 types:
define the SELECT operator: extracting the content of each projection item, namely the content after the SELECT and before the AS, from the project scope; the WHERE operator: extracting the contents after WHERE from Projectscope; GROUP operator: extracting the content after GROUP from Projectscope; HAVING operator: havinging content extracted from the project scope; JOIN operator: extracting the content after the ON condition from the JoinScope; UNION operator: used for aggregating multiple SELECT statements, extracted from Unionscope.
The SELECT operator and the UNION operator belong to a direct blood relationship and are used for tracing the processing logic and the processing link of the column, and the WHERE operator, the GROUP operator, the HAVING operator and the JOIN operator belong to an indirect blood relationship and are used for analyzing the influence of the column.
The physical columns Column are real input physical columns and output physical columns, and belong to metadata information. The alias part in the SELECT item AS alias in the sub-query is defined AS the virtual column VirtualColumn.
Please refer to fig. 7 with emphasis, for example: INSERT INTO t2 (a, b) SELECT tt1.A AS a, tt1.B AS b FROM (SELECT a AS a, MAX (b + c) AS b FROM t1 WHERE d > 1) AS tt1.
The column operator blood margin model is an input physical column, a group of operators and virtual columns are output after each layer of sub-query, wherein the operators are output to the virtual columns, the operators queried at the outermost layer are directly output to the output physical columns and are not output to the virtual columns, and links of the column operator blood margin model are as follows:
input physical column- > (operator- > virtual column) - > (\8230;) - > operator- > output physical column.
Namely, column- - > (Operator- - > VirtualColumn) - - > (\8230; - - > Operator- - > Column.
Please refer to fig. 8, S5) traversing operator blood margins to construct point-edge relationships, and storing the column operator blood margins into a graph database.
In a graph database, three point types of a corresponding physical column, an operator and a virtual column are created, a column operator blood relationship model obtained in S4) is traversed, a corresponding point-edge relationship is constructed, and the point-edge relationship is stored in the graph database:
INSERT INTO t2(a,b)SELECT tt1.a AS a,tt1.b AS b FROM(SELECT a AS a,MAX(b+c)AS b FROM t1 WHERE d>1)AS tt1。
FIG. 8 is a schematic diagram of the SQL column operator blood relationship stored in the graph database.
Example 2
The invention provides a server comprising a memory, a processor and a computer program stored on the memory and executable on the processor, the processor implementing the steps of the method of embodiment 1 when executing the program.
Example 3
The invention provides a computer-readable storage medium, on which a computer program is stored which, when being executed by a processor, carries out the steps of the method as described in embodiment 1.
As will be appreciated by one skilled in the art, embodiments of the present invention may be provided as a method, apparatus, or computer program product. Accordingly, embodiments of the present invention may take the form of an entirely hardware embodiment, an entirely software embodiment or an embodiment combining software and hardware aspects. Furthermore, embodiments of the present invention may take the form of a computer program product embodied on one or more computer-usable storage media (including, but not limited to, disk storage, CD-ROM, optical storage, and the like) having computer-usable program code embodied therein.
Although the present invention has been described in detail with reference to the foregoing embodiments, it will be apparent to those skilled in the art that modifications may be made to the embodiments or portions thereof without departing from the spirit and scope of the invention.

Claims (6)

1. The column operator blood margin construction method is characterized by comprising the following steps,
s1) generating a ParseTree of an analytic tree by analyzing SQL through Antler;
s2) designing an abstract syntax tree AST for constructing a blood margin link from an input column to an output column and a middle processing logic;
s3) recursively traversing the parse tree ParseTree obtained in S1) to construct an abstract syntax tree AST designed in S2);
s4) traversing the constructed abstract syntax tree AST in the S3) and extracting a column operator blood relationship model;
s5) traversing the column operator blood relationship model, constructing a point-edge relation, and storing the column operator blood relationship into a database.
2. The column operator blood relationship construction method according to claim 1, wherein S2) comprises the steps of:
s21) based on relational algebra, dividing a complete SQL into at least one paragraph, wherein each paragraph is a trunk, designing a tree structure Scope for abstracting the corresponding trunk of the SQL, and establishing a hierarchical relationship between the trunks;
s22) Scope is divided into an input Scope and an output Scope;
wherein: the input type Scope is divided into Projectscope, joinscope, unionscope and Scanscope and is used for abstracting an input part in the SQL statement;
the output type Scope is divided into CreateAsScope and InsertScope and is used for abstracting an output part in an SQL statement;
wherein: each Scope comprises the type of the Scope, the alias of the Scope and the externally exposed field of the Scope, the parent-child relationship among different scopes is recorded through indexes, and the outer layer trunk set of the Scope is set to be a parentsscopelist, namely a father Scope set; setting an inner layer trunk set of Scope to be a child Scope List, namely a child Scope set, thereby constructing a parent-child relationship between different scopes;
s23) defining the 6 scopes divided in the S22) and corresponding to the AST;
s24) defining the externally exposed field of each Scope as the holding field set of the Scope, wherein the holding field set of each Scope can be referred by the parent Scope in the parentgeoList of the Scope;
s25) adding a source attribute expressionoorigin to each holding field in the holding field set to record the source information of the holding field, wherein the source attribute expressionoorigin is used for constructing a blood-edge link between the input column and the output column;
s26) establishing an expression for recording the processing logic of the column and the source of the column;
s27) each Scope uses the held field information of the child Scope in the child Scope List to fill the held field information of the current Scope, and then the held field information of the current Scope is transmitted to the parent Scope, so that a processing link between an output column and a source column is constructed, and further the blood relationship information of a column operator is extracted.
3. The column operator consanguinity construction method according to claim 1, characterized in that S3) comprises the following steps:
s31) recursively traversing the resolution tree ParseTree generated in S1) from the root node;
s32) according to the type of SQL, the root node is interpreted as CreateAsScope or InsertScope, and then the child nodes are traversed;
s33) if the node type is the query node SELECT xx from, interpreting the node type as project scope, and constructing expressions after SELECT, WHERE, GROUP BY, HAVING and ORDER BY; analyzing the child node after FROM as the corresponding Scope, adding the child node into the child Scope list of the current Projectscope, waiting for the holding field information of the Scope in the child Scope list to fill the source information of the expression of the current Projectscope, and then filling the holding field of the current Projectscope with the expression information;
s34) if the child node after the FROM is the physical table node, creating a Scope as a current Scope, and filling a holding field of the current Scope according to metadata information;
if the child node after the FROM is the child query node, jumping to S33) to perform recursion processing;
if the child node after FROM contains more than one JOIN node, creating a JoinScope as the current Scope, constructing all expressions after ON condition, and jumping the JoinItems ON two sides of each JOIN one by one to S33) for recursive processing; finally, the holding field of the JoinScope is constructed in a mode that the holding field of each sub-Scope in the child Scope List is widened, and the source information of all ON expressions is filled;
if the child node after FROM comprises more than one UNION node, creating a Unionscope as the current Scope, and jumping UnionItems on two sides of each UNION one by one to S33) for recursive processing; finally, the holding field of the UnionScope is constructed in a mode of overlapping the holding field of each sub-Scope in the child ScopeList;
s35) traversing the parse tree ParseTree through S32) -S34) to construct the abstract syntax tree AST designed by S2).
4. The column operator consanguinity construction method according to claim 1, characterized in that S4) comprises the following steps:
s41) extracting information from Scope to construct a Column Operator blood margin model, and dividing the Column Operator blood margin model into an Operator, a physical Column and a virtual Column Virtualcolumn;
the Operator is divided into 6 types, including:
SELECT operator: extracting the content of each projection item, namely the content before AS after the SELECT, from the Projectscope;
the WHERE operator: extracting the content after WHERE from Projectscope;
GROUP operator: extracting the content after GROUP BY from Projectscope;
HAVING operator: havinging content extracted from the project scope;
JOIN operator: extracting the content after the ON condition from the JoinScope;
UNION operator: the method is used for aggregating a plurality of SELECT statements and extracting the SELECT statements from the Unionscope;
wherein, the SELECT operator and the UNION operator belong to a direct blood margin and are used for tracing the processing logic and the processing link of the column, and the WHERE operator, the GROUP operator, the HAVING operator and the JOIN operator belong to an indirect blood margin and are used for analyzing the influence surface of the column;
the physical Column comprises a real input physical Column and an output physical Column, and belongs to metadata information, wherein the input physical Column is extracted from ScanScape, and the output physical Column is extracted from CreateAsScope or InsertScope;
virtual column VirtualColumn: the alias part in the SELECTitemAS alias in the subquery is extracted from the ProjectScope.
5. A server comprising a memory, a processor and a computer program stored on the memory and executable on the processor, the processor implementing the steps of the method of any one of claims 1 to 4 when the program is executed by the processor.
6. A computer-readable storage medium, on which a computer program is stored which, when being executed by a processor, carries out the steps of the method according to any one of claims 1 to 4.
CN202211526166.8A 2022-11-30 2022-11-30 Column operator blood relationship construction method, server and computer readable storage medium Pending CN115757525A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202211526166.8A CN115757525A (en) 2022-11-30 2022-11-30 Column operator blood relationship construction method, server and computer readable storage medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202211526166.8A CN115757525A (en) 2022-11-30 2022-11-30 Column operator blood relationship construction method, server and computer readable storage medium

Publications (1)

Publication Number Publication Date
CN115757525A true CN115757525A (en) 2023-03-07

Family

ID=85341764

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202211526166.8A Pending CN115757525A (en) 2022-11-30 2022-11-30 Column operator blood relationship construction method, server and computer readable storage medium

Country Status (1)

Country Link
CN (1) CN115757525A (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN118012964A (en) * 2024-04-08 2024-05-10 天津南大通用数据技术股份有限公司 Blood relationship generation method based on relational algebra

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN118012964A (en) * 2024-04-08 2024-05-10 天津南大通用数据技术股份有限公司 Blood relationship generation method based on relational algebra

Similar Documents

Publication Publication Date Title
US11086751B2 (en) Intelligent metadata management and data lineage tracing
CN106897322B (en) A kind of access method and device of database and file system
US8005818B2 (en) Apparatus and method for maintaining metadata version awareness during set evaluation for OLAP hierarchies
US11847040B2 (en) Systems and methods for detecting data alteration from source to target
US10789295B2 (en) Pattern-based searching of log-based representations of graph databases
US20130124521A1 (en) Method, apparatus, and program for supporting creation and management of metadata for correcting problem in dynamic web application
WO2014186057A1 (en) Supporting combination of flow based etl and entity relationship based etl
US9182947B2 (en) Program source code navigation
Rozsnyai et al. Large-scale distributed storage system for business provenance
US20180357278A1 (en) Processing aggregate queries in a graph database
CN110889013B (en) Data association method, device, server and storage medium based on XML
Cheney et al. Database queries that explain their work
CN115757525A (en) Column operator blood relationship construction method, server and computer readable storage medium
US10983997B2 (en) Path query evaluation in graph databases
US20210264312A1 (en) Facilitating machine learning using remote data
US20180113908A1 (en) Transforming and evaluating missing values in graph databases
CN113343036B (en) Data blood relationship analysis method and system based on key topological structure analysis
Dhakal et al. Library Tweets Conversion
Baqasah et al. Maintaining schema versions compatibility in cloud applications collaborative framework
Massari et al. Performing live time-traversal queries via SPARQL on RDF datasets
Aljarallah Comparative study of database modeling approaches
Anzum Systems for Graph Extraction from Tabular Data
Rahman et al. Model migration approach for database preservation
Schuchardt et al. Applying content management to automated provenance capture
US11250010B2 (en) Data access generation providing enhanced search models

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination