CN112347123A - Data blood margin analysis method and device and server - Google Patents

Data blood margin analysis method and device and server Download PDF

Info

Publication number
CN112347123A
CN112347123A CN202011249929.XA CN202011249929A CN112347123A CN 112347123 A CN112347123 A CN 112347123A CN 202011249929 A CN202011249929 A CN 202011249929A CN 112347123 A CN112347123 A CN 112347123A
Authority
CN
China
Prior art keywords
target
statement
data
data table
query
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN202011249929.XA
Other languages
Chinese (zh)
Other versions
CN112347123B (en
Inventor
果然
孙茂林
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing Kingsoft Cloud Network Technology Co Ltd
Original Assignee
Beijing Kingsoft Cloud Network Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing Kingsoft Cloud Network Technology Co Ltd filed Critical Beijing Kingsoft Cloud Network Technology Co Ltd
Priority to CN202011249929.XA priority Critical patent/CN112347123B/en
Publication of CN112347123A publication Critical patent/CN112347123A/en
Application granted granted Critical
Publication of CN112347123B publication Critical patent/CN112347123B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/24Querying
    • G06F16/242Query formulation
    • G06F16/2433Query languages
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/24Querying
    • G06F16/245Query processing
    • G06F16/2458Special types of queries, e.g. statistical queries, fuzzy queries or distributed queries

Landscapes

  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Theoretical Computer Science (AREA)
  • Computational Linguistics (AREA)
  • Mathematical Physics (AREA)
  • Data Mining & Analysis (AREA)
  • Databases & Information Systems (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Probability & Statistics with Applications (AREA)
  • Software Systems (AREA)
  • Fuzzy Systems (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The invention provides a method, a device and a server for analyzing a data blood margin, which are used for extracting a query statement related to query operation in a target SQL statement based on an established abstract syntax tree of the target SQL statement, acquiring a target data table related to the target column name and a related target data column according to the target column name contained in the query statement and further determining a data blood margin analysis result of the target SQL statement. In the method, based on an abstract syntax tree corresponding to an SQL statement, a column name related to the SQL statement is analyzed to obtain a data list and a data column related to the column name, and further obtain a data blood margin analysis result of the SQL statement; even if the SQL sentence is more complex, a more accurate data blood relationship analysis result can be obtained through the method; meanwhile, the method is flexible, suitable for various SQL statement source codes, capable of meeting the requirement of more complex data blood relationship analysis and wide in application range.

Description

Data blood margin analysis method and device and server
Technical Field
The invention relates to the technical field of data processing, in particular to a method, a device and a server for analyzing data blood margin.
Background
The data blood margin refers to a relationship formed between data in the processes of generation, processing fusion, circulation and final extinction of the data. The data blood relationship analysis can be used for analyzing the influence of the change of the upstream data on the downstream data, tracking the source of the upstream problem when the downstream data changes and the like.
In the related art, when performing a blood-based analysis on data, if it is desired to analyze a physical table from which a certain column of data originates in an SQL (Structured Query Language) statement, the analysis may be performed by using a string matching method, for example, a column name and a table name included in the SQL statement may be matched with each physical table; however, SQL statements of databases of different versions may be different, and the diversity of SQL statements causes that this string matching method cannot process complex SQL statements, and cannot guarantee the accuracy of analysis results; in another mode, for an open-source database, an input table and an output table can be obtained from an internal interface in a source code of the database, and a physical table of a data source is analyzed based on the obtained input table and output table; however, this method lacks flexibility, cannot meet the requirement of blood relationship analysis of more complex data, and has a limited application range.
Disclosure of Invention
In view of the above, the present invention provides a method, an apparatus and a server for analyzing a data blood margin, so as to improve the flexibility of the application of the data blood margin analysis and the accuracy of the analysis result.
In a first aspect, an embodiment of the present invention provides a method for analyzing a data blood margin, where the method includes: establishing an abstract syntax tree of a target SQL statement; extracting query sentences related to query operation in the target SQL sentences based on the abstract syntax tree; acquiring a target data table associated with the target column name and a target data column associated with the target column name in the target data table according to the target column name contained in the query statement; and determining a data blood relationship analysis result of the target SQL statement according to the target data table and the target data column.
Further, the step of extracting the query statement related to the query operation in the target SQL statement includes: if the operation type of the target SQL statement is a query operation, determining the target SQL statement as the query statement related to the query operation; and if the operation type of the target SQL statement is an insertion operation, an update operation or a deletion operation, extracting a query statement related to the query operation from the target SQL statement based on a preset query operation keyword.
Further, the step of obtaining a target data table associated with the target column name and a target data column associated with the target column name in the target data table according to the target column name included in the query statement includes: extracting a target data table corresponding to a target column name contained in the query statement and a hierarchical relation between the target data tables from the abstract syntax tree; establishing a multi-branch tree of the query statement according to a from statement in the query statement; wherein each of the from statements corresponds to a node in the multi-way tree; traversing the multi-branch tree according to the target data table and the hierarchical relation between the target data tables, and obtaining the target data table associated with the target column name and the target data column associated with the target column name in the target data table from the multi-branch tree until reaching the data source table at the bottommost layer.
Further, the step of extracting, from the abstract syntax tree, a target data table corresponding to a target column name included in the query statement and a hierarchical relationship between the target data tables includes: extracting item entries and types of the item entries contained in the query statement from the abstract syntax tree; for each item entry, extracting a data column related to the item entry from the item entry in a manner of matching with the type of the item entry; determining the data table to which the data column related to each item entry belongs as a target data table corresponding to a target column name contained in the query statement; and determining the hierarchical relationship between the data tables of the data columns related to the item entries as the hierarchical relationship between the target data tables corresponding to the target column names.
Further, the step of extracting item entries contained in the query statement includes: and if the query statement comprises a unit statement, respectively extracting item entries from statements on two sides of the unit statement.
Further, the step of building a multi-way tree of the query statement according to the from statement in the query statement includes: establishing an initial node of a multi-branch tree, and determining the initial node as a current node; according to the hierarchical structure of the query statement, obtaining a from statement in the query statement from the abstract syntax tree; if a from statement is obtained, determining the from statement as a child node of the current node; and carrying out subsequent processing on the child nodes according to the types of the sentences connected with the from sentences.
Further, the step of obtaining the from statement in the query statement from the abstract syntax tree includes: if the query statement contains a union statement, respectively taking statements at two sides of the union statement as updated query statements, and executing the step of acquiring from statements in the query statement from the abstract syntax tree.
Further, the step of performing subsequent processing on the child node according to the type of the sentence connected to the from sentence includes: if the from statement is connected with a data table, extracting the data table, and recording the data table into the child node; if the from statement is connected with a join statement, respectively carrying out subsequent processing on the statements on two sides of the join statement according to the types of the statements on two sides of the join statement; and if the from statement is connected with a sub query statement, determining the sub node as an updated current node, determining the sub query statement as an updated query statement, and continuously executing the step of acquiring the from statement in the query statement from the abstract syntax tree until the lowest level of the abstract syntax tree is reached.
Further, the step of determining a data blood relationship analysis result of the target SQL statement according to the target data table and the target data column includes: acquiring data source information and database information of the target data table; and determining a data blood margin analysis result of the target SQL statement according to the target data table, the target data column, and the data source information and the database information of the target data table.
Further, before the step of determining the data blood relationship analysis result of the target SQL statement according to the target data table, the target data column, and the data source information and the database information of the target data table, the method further includes: if the operation type of the target SQL statement is an insertion operation, an update operation or a deletion operation, acquiring a data table to be written corresponding to the target column name and a column name to be written in the data table to be written corresponding to the target column name; acquiring data source information and database information of the data table to be written; the step of determining the data blood relationship analysis result of the target SQL statement according to the target data table, the target data column, and the data source information and the database information of the target data table comprises the following steps: and determining the target data table, the target data column, the data source information and the database information of the target data table, the data table to be written, the column name to be written and the data source information and the database information of the data table to be written as the data blood margin analysis result of the target SQL statement.
Further, the step of obtaining the to-be-written column name in the to-be-written data table corresponding to the target column name includes: acquiring a to-be-written column name in the to-be-written data table corresponding to the target column name from the target SQL statement; and if the column name to be written cannot be obtained from the target SQL statement, obtaining all column names of the data table to be written according to the metadata of the data table to be written, and determining all the column names as the column name to be written.
In a second aspect, an embodiment of the present invention provides an apparatus for analyzing a data blood margin, where the apparatus includes: the establishing module is used for establishing an abstract syntax tree of the target SQL statement; the extraction module is used for extracting query sentences related to query operation in the target SQL sentences based on the abstract syntax tree; the acquisition module is used for acquiring a target data table associated with the target column name and a target data column associated with the target column name in the target data table according to the target column name contained in the query statement; and the determining module is used for determining a data blood relationship analysis result of the target SQL statement according to the target data table and the target data column.
In a third aspect, an embodiment of the present invention provides a server, including a processor and a memory, where the memory stores machine executable instructions executable by the processor, and the processor executes the machine executable instructions to implement the method for analyzing data blooding margin according to any one of the above first aspects.
In a fourth aspect, embodiments of the present invention provide a machine-readable storage medium storing machine-executable instructions which, when invoked and executed by a processor, cause the processor to carry out a method of data-edge analysis as described in any one of the first aspects above.
According to the data blood margin analysis method, the data blood margin analysis device and the server, the query statement related to the query operation in the target SQL statement is extracted based on the established abstract syntax tree of the target SQL statement, the target data table related to the target column name and the related target data column are obtained according to the target column name contained in the query statement, and then the data blood margin analysis result of the target SQL statement is determined. In the method, based on an abstract syntax tree corresponding to an SQL statement, a column name related to the SQL statement is analyzed to obtain a data list and a data column related to the column name, and further obtain a data blood margin analysis result of the SQL statement; even if the SQL sentence is more complex, a more accurate data blood relationship analysis result can be obtained through the method; meanwhile, the method is flexible, suitable for various SQL statement source codes, capable of meeting the requirement of more complex data blood relationship analysis and wide in application range.
Additional features and advantages of the invention will be set forth in the description which follows, and in part will be obvious from the description, or may be learned by practice of the invention. The objectives and other advantages of the invention will be realized and attained by the structure particularly pointed out in the written description and claims hereof as well as the appended drawings.
In order to make the aforementioned and other objects, features and advantages of the present invention comprehensible, preferred embodiments accompanied with figures are described in detail below.
Drawings
In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings used in the description of the embodiments or the prior art will be briefly described below, and it is obvious that the drawings in the following description are some embodiments of the present invention, and other drawings can be obtained by those skilled in the art without creative efforts.
FIG. 1 is a flow chart of a method for analyzing data blood relationship according to an embodiment of the present invention;
FIG. 2 is a flow chart of another method for analyzing data blood margin according to an embodiment of the present invention;
FIG. 3 is a flow chart of another method for analyzing data blood margin according to an embodiment of the present invention;
FIG. 4 is a flow chart of another method for analyzing data blood margin according to an embodiment of the present invention;
FIG. 5 is a flowchart illustrating an overall method for analyzing data blood relationship according to an embodiment of the present invention;
fig. 6 is a flowchart of acquiring a correspondence between a column and a physical table according to an embodiment of the present invention;
fig. 7 is a flowchart for obtaining a corresponding relationship between a downstream column and a current hierarchical table according to an embodiment of the present invention;
FIG. 8 is a flowchart of a method for constructing a multi-way tree according to an embodiment of the present invention;
fig. 9 is a schematic structural diagram of an apparatus for analyzing a data blood margin according to an embodiment of the present invention;
fig. 10 is a schematic structural diagram of a server according to an embodiment of the present invention.
Detailed Description
To make the objects, technical solutions and advantages of the embodiments of the present invention clearer, the technical solutions of the present invention will be clearly and completely described below with reference to the accompanying drawings, and it is apparent that the described embodiments are some, but not all embodiments of the present invention. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.
Data blood margin refers to a relationship formed between data in the processes of data generation, processing fusion, circulation and final extinction; the relationship between the data is expressed by using a similar relationship in human society, which can be called as the blood relationship of the data; the data consanguinity is one of the components of the metadata and can be used for analyzing the consanguinity path of the table or the field from the data source to the current table, whether the relationship existing between the consanguinity fields is satisfied, the concerned data consistency and the reasonability of the table design; the method can also be used for analyzing the influence of the change of the upstream data on the downstream data, tracking the source of the upstream problem when the downstream data changes and the like; in a big data cloud product, a data developer can develop various tasks to process data, and the tasks are interdependent; when one of the operations fails, the data developer needs to locate the problem; or when the structure of the upstream table is modified, analyzing which associated downstream tables exist and evaluating the influence caused by the downstream tables; or analyzing which associated data needs to be rerun after the data of the upstream table is changed.
In the related art, when performing blood-related analysis on data, a user needs to analyze which physical table a certain column of data actually comes from in one SQL statement, and usually may adopt the following two modes, one mode is to analyze the data based on string matching, considering that there are many sub-queries in the SQL statement, the difference of the SQL statements of databases of different versions is large, and the flexibility and diversity of the SQL statement make it difficult to implement complete matching based on a string, so that this mode cannot process complex SQL statements, and cannot guarantee the accuracy of analysis results.
In another mode, the data consanguinity may be analyzed by using an input table and an output table obtained from an internal interface in the source code of the partial source database, and for a certain row of data, the input table may be understood as the source table of the row of data; the output table can be understood as a data table into which the data of the column is stored; for example, the LineageInfo in the Hive database can be utilized to obtain an input table and an output table; however, the method lacks flexibility, for example, when tracing back the source of a certain column of data, it is also necessary to trace back which methods and which functions are used by the column of data in the process of blood-related transmission, and for the customized blood-related relationship, it is necessary to go deep into each source code for mass development; in addition, the SQL standards of different data sources are not consistent, and there may be specific grammars or functions thereof, and each data source needs to develop a blood margin parsing function separately, for example, the SQL grammars of the MySQL database and the Hive database are usually not consistent, and may also include some customized functions, special grammars or functions, etc.; moreover, for non-open source items, the internal parsing method cannot be obtained, and the method cannot be associated with database metadata.
Based on this, the data blood margin analysis method, the data blood margin analysis device and the server provided by the embodiment of the invention can be applied to a scene of analyzing the data blood margin. To facilitate understanding of the present embodiment, a detailed description will be first provided for a method for analyzing data blood margin disclosed in the present embodiment, as shown in fig. 1, the method includes the following steps:
step S102, establishing an abstract syntax tree of the target SQL statement.
The SQL statement can be understood as a database query and programming language, and can be used for accessing data, querying, updating and managing a relational database system and the like; the Abstract Syntax Tree (AST for short) can be understood as Tree representation of an Abstract Syntax structure of a source code, each node on the Tree represents one structure in the source code, the Abstract Syntax Tree does not depend on the Syntax of the source language, each detail of a real Syntax can not be represented, and after the source code is converted into the AST, the AST can be subjected to related operation to realize a function required to be realized; in actual implementation, the target SQL statement may be an SQL statement input by a user, or an SQL statement called through an API (Application Programming Interface), and when data lineage analysis needs to be performed on the target SQL statement, an abstract syntax tree matching the SQL statement needs to be established first.
And step S104, extracting the query statement related to the query operation in the target SQL statement based on the abstract syntax tree.
The query statement related to the query operation may be a statement related to a select part in an SQL statement, or the like; after the abstract syntax tree matching the target SQL statement is built, the query statement related to the query operation, for example, the statement related to the select part, may be extracted from the abstract syntax tree.
Step S106, according to the target column name contained in the query statement, acquiring a target data table associated with the target column name and a target data column associated with the target column name in the target data table.
The target column name can be understood as a column name which needs to be queried and is contained in a query statement, and the target column name can be obtained by analyzing the query statement in a target SQL statement; the target data table may include the target column name, and a data column in the target data table associated with the target column name may be understood as a target data column; for example, taking select c1 from t as an example, it shows that c1 is obtained from c1 in table t; wherein c1 is a target column name, t is a target data table, and c1 in the target data table t is a target data column; in practical implementation, there may be a plurality of target data tables associated with the target column names, and correspondingly, there may also be a plurality of target data columns associated with the target column names in the target data tables.
And S108, determining a data blood relationship analysis result of the target SQL statement according to the target data table and the target data column.
The data blood relationship analysis result generally comprises a link relation between the obtained target data tables, a processing method or a processing process adopted between target data columns corresponding to the target data tables of different levels, and the like; and determining a data blood relationship analysis result of the target SQL statement based on the target data table and the target data column acquired in the step.
According to the data blood margin analysis method provided by the embodiment of the invention, the query statement related to the query operation in the target SQL statement is extracted based on the established abstract syntax tree of the target SQL statement, the target data table related to the target column name and the related target data column are obtained according to the target column name contained in the query statement, and the data blood margin analysis result of the target SQL statement is further determined. In the method, based on an abstract syntax tree corresponding to an SQL statement, a column name related to the SQL statement is analyzed to obtain a data list and a data column related to the column name, and further obtain a data blood margin analysis result of the SQL statement; even if the SQL sentence is more complex, a more accurate data blood relationship analysis result can be obtained through the method; meanwhile, the method is flexible, suitable for various SQL statement source codes, capable of meeting the requirement of more complex data blood relationship analysis and wide in application range.
The embodiment of the invention also provides another data blood margin analysis method, which is realized on the basis of the method of the embodiment; the method mainly describes a specific process of extracting query statements related to query operation in a target SQL statement, and as shown in FIG. 2, the method comprises the following steps:
step S202, establishing an abstract syntax tree of the target SQL statement.
In actual implementation, according to a target SQL Statement and a data source type, a drive (an open source database connection pool) is used to construct an SQL Statement, i.e. an abstract syntax tree; establishing an abstract syntax tree of the target SQL statement in other modes; wherein, the data source can be understood as a database or a database server used by a database application program; the data source type can comprise different types such as MySQL database or Hive database; generally, the same SQL statement and different data source types are used, and the constructed AST trees are basically the same, but the construction process usually has differences; the Druid can be understood as a kind of open source database connection pool, which usually includes Druid SQL Parser, and the Druid embeds SQL Parser to implement protection SQL injection, merge SQL without parameterization of statistics, SQL formatting, branch database table, etc.
Step S204, based on the abstract syntax tree, if the operation type of the target SQL statement is the query operation, the target SQL statement is determined as the query statement related to the query operation.
In actual implementation, it is usually determined whether the operation type of the target SQL statement is a query operation, an insert operation, an update operation, or a delete operation, and if the operation type is a query operation, such as a select query, the target SQL statement is determined as a query statement related to the query operation.
In step S206, if the operation type of the target SQL statement is an insert operation, an update operation, or a delete operation, based on a preset query operation keyword, a query statement related to the query operation is extracted from the target SQL statement.
The query operation keywords may be understood as keywords for performing a query operation, such as select query, included in a target SQL statement whose operation type is an insert operation, an update operation, or a delete operation, and the keywords may be preset according to an actual situation; if the operation type of the target SQL statement is an insert operation, an update operation, or a delete operation, that is, an insert operation, an update operation, or a delete operation, the query statement related to the query operation may be extracted from the target SQL statement based on the preset query operation keyword and the abstract syntax tree, or it may be understood that, when the operation type of the target SQL statement is an insert operation, an update operation, or a delete operation, the statement of the query portion generally needs to be parsed from the target SQL statement first.
Step S208, according to the target column name included in the query statement, acquiring a target data table associated with the target column name and a target data column associated with the target column name in the target data table.
Step S210, determining a data blood relationship analysis result of the target SQL statement according to the target data table and the target data column.
In another method for analyzing data consanguinity provided in the embodiments of the present invention, a specific process of extracting a query statement related to a query operation in a target SQL statement is mainly described, and after an abstract syntax tree of the target SQL statement is established, if an operation type of the target SQL statement is the query operation, the target SQL statement is determined as the query statement related to the query operation. If the operation type of the target SQL statement is an inserting operation, an updating operation or a deleting operation, extracting a query statement related to the query operation from the target SQL statement based on a preset query operation keyword; and acquiring a target data table associated with the target column name and the associated target data column according to the target column name contained in the query statement, and further determining a data blood relationship analysis result of the target SQL statement. In the method, based on an abstract syntax tree corresponding to an SQL statement, a column name related to the SQL statement is analyzed to obtain a data list and a data column related to the column name, and further obtain a data blood margin analysis result of the SQL statement; even if the SQL sentence is more complex, a more accurate data blood relationship analysis result can be obtained through the method; meanwhile, the method is flexible, suitable for various SQL statement source codes, capable of meeting the requirement of more complex data blood relationship analysis and wide in application range.
The embodiment of the invention also provides another data blood margin analysis method, which is realized on the basis of the method of the embodiment; the method mainly describes a specific process of acquiring a target data table associated with a target list name and a target data column associated with the target list name in the target data table according to the target list name contained in a query statement, as shown in fig. 3, the method includes the following steps:
step S302, establishing an abstract syntax tree of the target SQL statement.
And step S304, extracting query sentences related to the query operation in the target SQL sentences based on the abstract syntax tree.
Step S306, extracting the target data table corresponding to the target column name contained in the query statement and the hierarchical relation between the target data tables from the abstract syntax tree.
The hierarchical relationship between the target data tables can be realized in a subquery mode, wherein the subquery can be understood as nesting a plurality of small queries with different functions in a complete query statement so as to complete a writing form of complex query together, and can also be understood as nesting another select query statement in a select query statement; in actual implementation, the target data tables corresponding to the target column names included in the query statement are extracted from the abstract syntax tree, the number of the target data tables can be multiple, and the hierarchical relationship among the multiple target data tables is obtained. Specifically, the step S306 can be implemented by the following steps one to four:
step one, item entries and item entry types contained in the query statement are extracted from the abstract syntax tree.
The item entry can be a target column name, a function or a sub-query contained in a query statement; the item entry type can be a constant, a method, an aggregation function, and the like; in practical implementation, it is usually necessary to first determine whether the query statement includes an union statement, where the union statement generally combines two or more select statements for merging result sets of the two or more select statements; if the query statement does not comprise the union statement, extracting item entries and item entry types contained in the select query statement; if the query statement comprises a unit statement, item entries and item entry types are respectively extracted from statements on two sides of the unit statement.
And step two, for each item entry, extracting the data column related to the item entry from the item entry in a mode of matching with the type of the item entry.
In practical implementation, considering that types of item entries generally include multiple types, data columns related to the item entries need to be extracted respectively based on different types; for example, if the type of item entry is a constant, such as 1, "a," then the column of data is not typically involved; if the type of item entry is a method, such as place (t.a, "a"), the data columns involved in the method, such as t.a; if the type of the item entry is an aggregation function, such as a translation type method, a case syntax, a binary operator, a multi-element operator and the like, corresponding processing is carried out according to different aggregation function types so as to extract the data column related to the item entry.
And step three, determining the data table to which the data column related to each item entry belongs as a target data table corresponding to the target column name contained in the query statement.
And determining the data table to which each data column belongs based on the corresponding data column extracted from each item entry through the steps, and determining the data tables as target data tables corresponding to the target column names.
And step four, determining the hierarchical relationship between the data tables to which the data columns related to the item entries belong as the hierarchical relationship between the target data tables corresponding to the target column names.
Step S308, establishing a multi-branch tree of the query statement according to the from statement in the query statement; wherein, each from statement corresponds to a node in the multi-branch tree.
Each node in the multi-way tree usually has a data item, and each node can have two or more child nodes; in the above-mentioned from statement, the from usually contains a table name, can analyze the data table that the table name corresponds to based on AST, if it is the subquery, the data table that the subquery relates to is the child node; it can also be understood that each from statement corresponds to a node in the multi-branch tree, and the multi-branch tree corresponding to the query statement can be constructed according to the from statement in the query statement, where the multi-branch tree includes the relationship between the sub-queries. Specifically, the step S308 can be implemented by the following steps five to eight:
and step five, establishing an initial node of the multi-branch tree, and determining the initial node as the current node.
The initial node may be understood as a root node of the multi-way tree, and the subsequent steps of establishing the multi-way tree are continuously performed with the initial node as a current node.
And step six, acquiring the from statement in the query statement from the abstract syntax tree according to the hierarchical structure of the query statement.
The above hierarchical structure can be understood as a hierarchical recursive structure of a query statement, and in practical implementation, it is usually necessary to first determine whether the query statement includes an union statement, and if the query statement does not include an union statement, obtain a from statement in the query statement from the abstract syntax tree according to the hierarchical structure; if the query statement contains the union statement, the statements on two sides of the union statement are respectively used as updated query statements, and the step of obtaining the from statement in the query statement from the abstract syntax tree is executed.
And step seven, if the from statement is obtained, determining the from statement as the child node of the current node.
The child node can be understood as a next-level node of the current node; and if the from statement in the query statement is acquired from the abstract syntax tree, inserting the from statement as a child node of the current node.
And step eight, performing subsequent processing on the child nodes according to the type of the statement connected with the from statement.
In practical implementation, the statements connected to the from statement may include a plurality of different types, for example, the statements may be of a single table type, a join statement type, or a sub query type, and corresponding processing is required to be performed on the sub nodes according to the different types; specifically, the step eight can be realized by the following steps a to C:
and step A, if the from statement is connected with a data table, extracting the data table and recording the data table into the child node.
If the from statement is connected with a data table, which is equivalent to the type of the statement connected with the from statement being the single table type, then the data table can be extracted and recorded into the corresponding child node.
And step B, if the from statement is connected with the join statement, respectively carrying out subsequent processing on the statements on the two sides of the join statement according to the types of the statements on the two sides of the join statement.
A From statement may be used to specify queries for data From a table or tables, typically by join when multiple table queries are involved; if the from statement is connected with the join statement, the statements on the two sides of the join statement are respectively and correspondingly processed according to the types of the statements on the two sides of the join statement, for example, if the types of the statements on the two sides of the join statement are of a single table type, the data tables on the two sides of the join statement can be respectively extracted.
And C, if the from statement is connected with the sub query statement, determining the sub node as an updated current node, determining the sub query statement as an updated query statement, and continuously executing the step of acquiring the from statement in the query statement from the abstract syntax tree until the lowest level of the abstract syntax tree is reached.
If the from statement is connected with the sub-query statements, each sub-query statement needs to be processed respectively, the from statement in the sub-query statements is obtained from the abstract syntax tree, if the from statement is obtained, the from statement is determined as a next-level sub-node of the sub-nodes, corresponding processing is carried out according to the type of the statement connected with the from statement, whether the from statement is in the most basic form of from t or not is judged in the execution process, and if not, the from statement is recursively processed according to different types of the from statement until the lowest level of the abstract syntax tree is reached.
Step S310, traversing the multi-branch tree according to the target data table and the hierarchical relation between the target data tables, and obtaining the target data table associated with the target column name and the target data column associated with the target column name in the target data table from the multi-branch tree until reaching the data source table at the bottommost layer.
Traversing the multi-branch tree obtained through the steps according to the target data table extracted from the abstract syntax tree and the hierarchical relationship between the target data tables, for example, accessing the current node first and traversing the child nodes of the current node in a recursive subsequent traversal mode; or a layer-sequential traversal mode can be adopted, a layer-by-layer downward traversal is started from the current node, and the like, a proper traversal mode can be selected according to actual requirements, and a target data table associated with a target list name and a target data column associated with the target list name in the target data table can be obtained from the multi-branch tree by traversing the multi-branch tree until a final physical table associated with the target list name is found, wherein the final physical table is equivalent to the data source table at the bottommost layer; for example, the target column name c is derived from the t1 table, while the c column in the t1 table may be derived from the t2 table, and so on until the lowest data source table associated with the target column name c is queried.
Step S312, determining a data blood relationship analysis result of the target SQL statement according to the target data table and the target data column.
The embodiment of the invention also provides another data blood margin analysis method, which mainly describes a concrete process of acquiring a target data table associated with a target column name and the target data column associated with the target column name in the target data table according to the target column name contained in a query statement, extracts the query statement related to query operation in the target SQL statement based on the established abstract syntax tree of the target SQL statement, and extracts the target data table corresponding to the target column name contained in the query statement and the hierarchical relationship between the target data tables from the abstract syntax tree; establishing a multi-branch tree of the query statement according to a from statement in the query statement; traversing the multi-branch tree, and obtaining a target data table associated with the target column name and a target data column associated with the target column name in the target data table from the multi-branch tree until reaching a data source table at the bottom layer; and further determining a data blood margin analysis result of the target SQL statement. In the method, based on an abstract syntax tree corresponding to an SQL statement, a column name related to the SQL statement is analyzed to obtain a data list and a data column related to the column name, and further obtain a data blood margin analysis result of the SQL statement; even if the SQL sentence is more complex, a more accurate data blood relationship analysis result can be obtained through the method; meanwhile, the method is flexible, suitable for various SQL statement source codes, capable of meeting the requirement of more complex data blood relationship analysis and wide in application range.
The embodiment of the invention also provides another data blood margin analysis method, which is realized on the basis of the method of the embodiment; the method mainly describes a specific process of determining a data blood relationship analysis result of a target SQL statement according to a target data table and a target data column, and as shown in FIG. 4, the method comprises the following steps:
step S402, establishing an abstract syntax tree of the target SQL statement.
Step S404, extracting query statements related to the query operation in the target SQL statements based on the abstract syntax tree.
Step S406, obtaining a target data table associated with the target column name and a target data column associated with the target column name in the target data table according to the target column name included in the query statement.
Step S408, data source information and database information of the target data table are obtained.
The database information generally includes information such as a database name; the data source information usually includes information for establishing database connection, such as data source name, and the corresponding database connection can be found by providing the correct data source name; the name of the data source is a data structure which usually contains the information of the database, and the name of the data source is the information which is necessary for the open database connection driver to be connected to the database; in actual implementation, the target SQL statement generally only contains a table name of the target data table, and specific database information such as a database name and the like does not need to be written, so that a user can describe relevant database information and data source information of a database to which the target data table belongs in a task or other positions, and the user can also set default data source information and database information; and after the target data table associated with the target list name is obtained, obtaining database information and data source information of the target data table so as to perfect the relevant information of the target data table.
In step S410, if the operation type of the target SQL statement is an insert operation, an update operation, or a delete operation, the to-be-written data table corresponding to the target column name and the to-be-written column name in the to-be-written data table corresponding to the target column name are obtained.
The data table to be written can be understood as a data table into which a target column name and corresponding data are to be inserted, a data table into which the target column name and corresponding data are to be updated, or a data table from which the target column name and corresponding data are to be deleted; the to-be-written column name may be understood as a header name of a list corresponding to the target column name in the to-be-written data table, and specifically, the step of obtaining the to-be-written column name in the to-be-written data table corresponding to the target column name may be implemented through the following steps nine to ten:
and step nine, acquiring the column name to be written in the data table corresponding to the target column name from the target SQL statement.
If the column name to be written of the data table to be written is clear in the target SQL statement, for example, the target SQL statement indicates that the target column name and the corresponding data need to be inserted into a certain column of the data table to be written, or the target column name and the corresponding data need to be updated to a certain column of the data table to be written, or a certain column corresponding to the target column name in the data table to be written needs to be deleted, the column name to be written corresponding to the target column name in the data table to be written may be obtained from the target SQL statement.
Step ten, if the column name to be written is not obtained from the target SQL statement, obtaining all the column names of the data table to be written according to the metadata of the data table to be written, and determining all the column names as the column name to be written.
The metadata can be understood as structured data extracted from a data table to be written and used for explaining the characteristics and the content of the data, such as column names and other information; if the column name to be written in the data table is not clear in the target SQL statement, all the column names can be obtained from the metadata of the data table to be written in, and the column names are determined as the column name to be written in; in practical implementation, the SQL statement may only include the data table to be written, and does not specify the column name to be written, and it may be generally understood that all the column names of the data table to be written need to be determined as the column name to be written.
Step S412, data source information and database information to be written in the data table are acquired.
During actual implementation, a target SQL statement generally only contains a table name of a data table to be written, and specific database information such as a database name and the like does not need to be written, a user can describe related database information and data source information of a database to which the data table to be written belongs in a task or other positions, and the user can also set default data source information and database information; if the data table to be written analyzed from the target SQL statement does not contain data source information and database information, the database information and the data source information of the data table to be written can be obtained at other positions or in default settings according to a user, so that the relevant information of the data table to be written is perfected.
And step S414, determining a data blood margin analysis result of the target SQL statement according to the target data table, the target data column, and the data source information and the database information of the target data table. Specifically, the step S414 can be implemented by the following step eleven:
and step eleven, determining the target data table, the target data column, the data source information and the database information of the target data table, the data table to be written, the column name to be written and the data source information and the database information of the data table to be written as the data blood margin analysis result of the target SQL statement.
The embodiment of the invention also provides another data blood margin analysis method, which mainly describes a concrete process of determining a data blood margin analysis result of a target SQL statement according to a target data table and a target data column, extracts a query statement related to query operation in the target SQL statement based on an established abstract syntax tree of the target SQL statement, acquires the target data table related to the target list name and the related target data column according to the target list name contained in the query statement, and acquires data source information and database information of the target data table; and determining a data blood margin analysis result of the target SQL statement according to the target data table, the target data column, and the data source information and the database information of the target data table. In the method, based on an abstract syntax tree corresponding to an SQL statement, a column name related to the SQL statement is analyzed to obtain a data list and a data column related to the column name, and further obtain a data blood margin analysis result of the SQL statement; even if the SQL sentence is more complex, a more accurate data blood relationship analysis result can be obtained through the method; meanwhile, the method is flexible, suitable for various SQL statement source codes, capable of meeting the requirement of more complex data blood relationship analysis and wide in application range.
To further understand the above embodiments, the following provides an overall flowchart of a data lineage analysis method, as shown in fig. 5, according to the SQL statement and the data source type, using the drain to construct sqlstate, i.e. AST tree (equivalent to the above abstract syntax tree for building the target SQL statement); judging whether the operation type of the SQL statement is a select query operation, if so, acquiring the corresponding relation between a downstream column and an upstream physical table column (equivalent to acquiring a target data table associated with the target column name and a target data column associated with the target column name in the target data table according to the target column name contained in the query statement, wherein the downstream column is equivalent to the target column name, and the upstream physical table is equivalent to the target data table); and (2) completing according to the default data source and the database information of the user (which is equivalent to the data source information and the database information of the target data table obtained above; and determining the data blood margin analysis result of the target SQL statement according to the target data table, the target data column and the data source information and the database information of the target data table).
If the operation type of the SQL statement is not a select query operation, acquiring a statement of a query part in the SQL according to the AST tree (which is equivalent to extracting a query statement related to the query operation from the target SQL statement based on a preset query operation keyword if the operation type of the target SQL statement is an insert operation, an update operation or a delete operation); acquiring a corresponding relation between a query part SQL query column and an upstream physical table column (which is equivalent to acquiring a target data table associated with a target column name and a target data column associated with the target column name in the target data table according to the target column name contained in the query statement); judging whether a column of a downstream table is indicated when an SQL statement is in insert/update/delete operation, if so, completing according to a default data source and database information of a user (which is equivalent to acquiring a column name to be written in a data table corresponding to a target column name from the target SQL statement; if the column of the downstream table is not indicated in the SQL sentence, acquiring metadata information of the table to obtain all columns in the table (which is equivalent to the above-mentioned case that the column name to be written is not acquired from the target SQL sentence, acquiring all column names of the data table to be written according to the metadata of the data table to be written, and determining all column names as the column names to be written); and completing according to the default data source and the database information of the user.
Further, the step of obtaining the corresponding relationship between the downstream column and the upstream physical table column in fig. 5 can be specifically implemented by a flowchart of obtaining the corresponding relationship between the column and the physical table as shown in fig. 6, and after the AST tree of the SQL statement is constructed, the corresponding relationship between the downstream column and the current hierarchical table is obtained (which is equivalent to the above-mentioned target data table corresponding to the target column name included in the query statement extracted from the abstract syntax tree, and the hierarchical relationship between the target data tables); wherein, the current hierarchical table can be understood as a data table where the downstream column is located; analyzing a table corresponding to a from part in the AST tree, and constructing a multi-branch tree according to a from statement (namely, constructing the multi-branch tree of the query statement according to the from statement in the query statement); and according to the results of the two steps, traversing the corresponding relation between the downstream column and the current hierarchical table layer by layer until the corresponding relation between the downstream column and the final physical table is found (which is equivalent to traversing the multi-branch tree according to the hierarchical relation between the target data table and the target data table, and obtaining the target data table associated with the target column name from the multi-branch tree and the target data column associated with the target column name in the target data table until reaching the data source table at the bottommost layer).
Further, the step of obtaining the corresponding relationship between the downstream column and the current hierarchical table in fig. 6 may be specifically implemented by a flowchart of obtaining the corresponding relationship between the downstream column and the current hierarchical table shown in fig. 7, after an AST tree of an SQL statement is constructed, whether the query statement is a unity statement is determined, and if not, an item of the query portion (which is equivalent to the type of the item entry and the item entry included in the above extraction query statement) is obtained; if the query statement is a union statement, the obtained SQL of the left part and the right part of the union is respectively processed to obtain item of the query part (which is equivalent to extracting item items from statements on two sides of the union statement if the query statement comprises the union statement); acquiring all columns according to different types of items (corresponding to the above-mentioned method for extracting the data columns related to the item entries from the item entries by matching with the types of the item entries for each item entry); and determining related hierarchical tables according to the obtained columns, and further determining the corresponding relation between the current hierarchical tables.
Further, the step of constructing a multi-branch tree in fig. 6 may be specifically implemented by a flowchart of constructing a multi-branch tree shown in fig. 8, after an initial node of the multi-branch tree is established and the initial node is determined as a current node, determining whether the query statement is an union statement, and if not, acquiring a from part (which is equivalent to the above step of acquiring a from statement in the query statement from the abstract syntax tree according to the hierarchical structure of the query statement); if the query statement contains the union statement, acquiring SQL of the left part and the right part of the union statement, and respectively processing the SQL to acquire the from part (which is equivalent to the step of acquiring the from statement in the query statement from the abstract syntax tree by respectively taking statements at two sides of the union statement as updated query statements if the query statement contains the union statement); judging whether the obtained from part is in the most basic form of from t, if so, inserting from as a child Node of the current Node (equivalently, if the from statement is connected with a data table, extracting the data table, and recording the data table into the child Node); if the form is not the most basic form of from t, recursively processing the from according to different types of the from, and continuously executing the step of acquiring the from part (which is equivalent to the above-mentioned process of respectively performing subsequent processing on the sentences on both sides of the join sentence if the from sentence is connected with the join sentence, and if the from sentence is connected with a subquery sentence, determining the subnode as the updated current node, determining the subquery sentence as the updated query sentence, and continuously executing the step of acquiring the from sentence in the query sentence from the abstract syntax tree until the lowest level of the abstract syntax tree is reached), and constructing the from part into a multi-tree according to a hierarchical structure.
By the data blood margin analysis method, a relational graph of a database table and column flow direction can be constructed, and the traceability of data is ensured; the method can support blood margin analysis of data sources such as Mysql, Oracle, Mpp, Hive, Spark and the like; in addition, the method can analyze the blood margin of the data returned by the SQL statements in the execution process, and can also analyze the blood margin of the task, for example, a task is provided in the data development process and some data needs to be run, the task usually comprises the SQL statements, and the blood margin relationship of the task can be analyzed based on the SQL statements; it should be noted that the above analysis method for the blood relationship of data is directed to one SQL statement, and if there are multiple SQL statements, the method of the present invention needs to be executed for each SQL statement. Specifically, the data blood margin analysis mainly has the following functions:
1. data tracing
The relationship of the blood relationship of the data shows the coming and going pulse of the data, and can be used for tracking the source of the data and tracking the data processing process. When data is abnormal, the reason of the abnormal occurrence needs to be tracked, and the risk is controlled at a proper level. With the development of enterprises, data sources are numerous, the quality is not uniform, and the factors can influence data results to a certain extent, so that the data tracing capability is high in value.
2. Impact analysis
As data is applied more and more, the flow chain of data is longer and longer. The core service of a source is changed, all downstream analysis applications must be kept synchronous, and if the analysis is not influenced, the problem of abnormal access of all data services can occur; the data has traceability, so that the development department can conveniently evaluate the influence.
3. Life cycle
The whole life cycle of the data can be intuitively obtained through the data blood margin; for those data that are not of great value, layering, archiving, and even destruction may be considered.
4. Security management and control
Safety compliance departments typically need information about how the data is used, what audiences have, and the like, and this information is available from the data bloodline. The data consanguinity also provides basis for authority management (table-level and field-level authorization), and further guarantees data security from a higher level. And global security control can be performed by matching with the security identification in the metadata (for example, which data need desensitization and the like).
5. Data assets
For the company management layer, the whole data circulation situation can be known through the data consanguinity, and the establishment of company data asset strategies and the like are facilitated.
Corresponding to the above method embodiment, an embodiment of the present invention provides a schematic structural diagram of an apparatus for analyzing a data blood margin, as shown in fig. 9, the apparatus includes: the building module 90 is used for building an abstract syntax tree of the target SQL statement; the extracting module 91 is configured to extract a query statement related to a query operation in the target SQL statement based on the abstract syntax tree; an obtaining module 92, configured to obtain, according to the target column name included in the query statement, a target data table associated with the target column name and a target data column associated with the target column name in the target data table; and the determining module 93 is configured to determine a data blood relationship analysis result of the target SQL statement according to the target data table and the target data column.
The data blood margin analysis device provided by the embodiment of the invention extracts the query statement related to the query operation in the target SQL statement based on the established abstract syntax tree of the target SQL statement, acquires the target data table related to the target column name and the related target data column according to the target column name contained in the query statement, and further determines the data blood margin analysis result of the target SQL statement. In the device, a column name related to an SQL statement is analyzed based on an abstract syntax tree corresponding to the SQL statement to obtain a data list and a data column related to the column name, and further obtain a data blood margin analysis result of the SQL statement; even if the SQL sentence is more complex, a more accurate data blood relationship analysis result can be obtained through the method; meanwhile, the method is flexible, suitable for various SQL statement source codes, capable of meeting the requirement of more complex data blood relationship analysis and wide in application range.
Further, the extracting module 91 is further configured to: if the operation type of the target SQL statement is the query operation, determining the target SQL statement as the query statement related to the query operation; and if the operation type of the target SQL statement is an inserting operation, an updating operation or a deleting operation, extracting the query statement related to the query operation from the target SQL statement based on a preset query operation keyword.
Further, the obtaining module 92 is further configured to: extracting a target data table corresponding to a target column name contained in the query statement and a hierarchical relation between the target data tables from the abstract syntax tree; establishing a multi-branch tree of the query statement according to a from statement in the query statement; each from statement corresponds to a node in the multi-branch tree; and traversing the multi-branch tree according to the target data table and the hierarchical relation between the target data tables, and obtaining the target data table associated with the target column name and the target data column associated with the target column name in the target data table from the multi-branch tree until reaching the data source table at the bottommost layer.
Further, the obtaining module 92 is further configured to: extracting item entries and item entry types contained in the query statement from the abstract syntax tree; for each item entry, extracting a data column related to the item entry from the item entry in a mode of matching with the type of the item entry; determining the data table to which the data column related to each item entry belongs as a target data table corresponding to a target column name contained in the query statement; and determining the hierarchical relationship between the data tables to which the data columns related to the item entries belong as the hierarchical relationship between the target data tables corresponding to the target column names.
Further, the obtaining module 92 is further configured to: if the query statement comprises a unit statement, item entries are respectively extracted from statements on two sides of the unit statement.
Further, the obtaining module 92 is further configured to: establishing an initial node of the multi-branch tree, and determining the initial node as a current node; acquiring a from statement in a query statement from an abstract syntax tree according to a hierarchical structure of the query statement; if the from statement is obtained, determining the from statement as a child node of the current node; and performing subsequent processing on the child nodes according to the type of the sentence connected with the from sentence.
Further, the obtaining module 92 is further configured to: if the query statement contains the union statement, the statements on two sides of the union statement are respectively used as updated query statements, and the step of obtaining the from statement in the query statement from the abstract syntax tree is executed.
Further, the obtaining module 92 is further configured to: if the from statement is connected with a data table, extracting the data table, and recording the data table into the child node; if the from statement is connected with the join statement, respectively carrying out subsequent processing on the statements on the two sides of the join statement according to the types of the statements on the two sides of the join statement; and if the from statement is connected with the sub query statement, determining the sub node as the updated current node, determining the sub query statement as the updated query statement, and continuously executing the step of acquiring the from statement in the query statement from the abstract syntax tree until the lowest level of the abstract syntax tree is reached.
Further, the determining module 93 is further configured to: acquiring data source information and database information of a target data table; and determining a data blood margin analysis result of the target SQL statement according to the target data table, the target data column, and the data source information and the database information of the target data table.
Further, the determining module 93 is further configured to: if the operation type of the target SQL statement is an inserting operation, an updating operation or a deleting operation, acquiring a data table to be written corresponding to the target column name and the column name to be written in the data table to be written corresponding to the target column name; acquiring data source information and database information to be written into a data table; determining a data blood margin analysis result of the target SQL statement according to the target data table, the target data column, and the data source information and the database information of the target data table, wherein the step comprises the following steps: and determining the target data table, the target data column, the data source information and the database information of the target data table, the data table to be written, the column name to be written and the data source information and the database information of the data table to be written as a data blood margin analysis result of the target SQL statement.
Further, the determining module 93 is further configured to: acquiring a to-be-written column name in a to-be-written data table corresponding to the target column name from the target SQL statement; and if the column name to be written cannot be obtained from the target SQL statement, obtaining all the column names of the data table to be written according to the metadata of the data table to be written, and determining all the column names as the column name to be written.
The implementation principle and the technical effects of the data blood margin analysis device provided by the embodiment of the invention are the same as those of the method embodiment, and for brief description, reference may be made to the corresponding contents in the method embodiment for the sake of brevity.
The embodiment of the invention also provides a server for operating the data blood relationship analysis method; referring to fig. 10, the apparatus comprises a processor 101 and a memory 100, the memory 100 stores machine executable instructions capable of being executed by the processor 101, and the processor 101 executes the machine executable instructions to implement the method for analyzing the data blooding margin according to the above embodiment.
Further, the server shown in fig. 10 further includes a bus 102 and a communication interface 103, and the processor 101, the communication interface 103, and the memory 100 are connected through the bus 102.
The Memory 100 may include a high-speed Random Access Memory (RAM) and may further include a non-volatile Memory (non-volatile Memory), such as at least one disk Memory. The communication connection between the network element of the system and at least one other network element is realized through at least one communication interface 103 (which may be wired or wireless), and the internet, a wide area network, a local network, a metropolitan area network, and the like can be used. The bus 102 may be an ISA bus, PCI bus, EISA bus, or the like. The bus may be divided into an address bus, a data bus, a control bus, etc. For ease of illustration, only one double-headed arrow is shown in FIG. 10, but this does not indicate only one bus or one type of bus.
The processor 101 may be an integrated circuit chip having signal processing capabilities. In implementation, the steps of the above method may be performed by integrated logic circuits of hardware or instructions in the form of software in the processor 101. The Processor 101 may be a general-purpose Processor, and includes a Central Processing Unit (CPU), a Network Processor (NP), and the like; the device can also be a Digital Signal Processor (DSP), an Application Specific Integrated Circuit (ASIC), a Field-Programmable Gate Array (FPGA), or other Programmable logic devices, discrete Gate or transistor logic devices, discrete hardware components. The various methods, steps and logic blocks disclosed in the embodiments of the present invention may be implemented or performed. A general purpose processor may be a microprocessor or the processor may be any conventional processor or the like. The steps of the method disclosed in connection with the embodiments of the present invention may be directly implemented by a hardware decoding processor, or implemented by a combination of hardware and software modules in the decoding processor. The software module may be located in ram, flash memory, rom, prom, or eprom, registers, etc. storage media as is well known in the art. The storage medium is located in the memory 100, and the processor 101 reads the information in the memory 100, and completes the steps of the method of the foregoing embodiment in combination with the hardware thereof.
An embodiment of the present invention further provides a machine-readable storage medium, where the machine-readable storage medium stores machine-executable instructions, and when the machine-executable instructions are called and executed by a processor, the machine-executable instructions cause the processor to implement the above method for analyzing the data blood margin.
The method, the apparatus, and the computer program product of the server for analyzing the data blood margin provided in the embodiments of the present invention include a computer readable storage medium storing program codes, where instructions included in the program codes may be used to execute the method described in the foregoing method embodiments, and specific implementation may refer to the method embodiments, and will not be described herein again.
The functions, if implemented in the form of software functional units and sold or used as a stand-alone product, may be stored in a computer readable storage medium. Based on such understanding, the technical solution of the present invention may be embodied in the form of a software product, which is stored in a storage medium and includes instructions for causing a computer device (which may be a personal computer, a server, or a network device) to execute all or part of the steps of the method according to the embodiments of the present invention. And the aforementioned storage medium includes: a U-disk, a removable hard disk, a Read-Only Memory (ROM), a Random Access Memory (RAM), a magnetic disk or an optical disk, and other various media capable of storing program codes.
Finally, it should be noted that: although the present invention has been described in detail with reference to the foregoing embodiments, those skilled in the art will understand that the following embodiments are merely illustrative of the present invention, and not restrictive, and the scope of the present invention is not limited thereto: any person skilled in the art can modify or easily conceive the technical solutions described in the foregoing embodiments or equivalent substitutes for some technical features within the technical scope of the present disclosure; such modifications, changes or substitutions do not depart from the spirit and scope of the embodiments of the present invention, and they should be construed as being included therein. Therefore, the protection scope of the present invention shall be subject to the protection scope of the claims.

Claims (14)

1. A method of analyzing a data blood margin, the method comprising:
establishing an abstract syntax tree of a target SQL statement;
extracting query sentences related to query operation in the target SQL sentences based on the abstract syntax tree;
acquiring a target data table associated with the target column name and a target data column associated with the target column name in the target data table according to the target column name contained in the query statement;
and determining a data blood relationship analysis result of the target SQL statement according to the target data table and the target data column.
2. The method according to claim 1, wherein the step of extracting the query statement related to the query operation in the target SQL statement comprises:
if the operation type of the target SQL statement is a query operation, determining the target SQL statement as the query statement related to the query operation;
and if the operation type of the target SQL statement is an insertion operation, an update operation or a deletion operation, extracting a query statement related to the query operation from the target SQL statement based on a preset query operation keyword.
3. The method according to claim 1, wherein the step of obtaining a target data table associated with a target column name according to the target column name included in the query statement, and a target data column associated with the target column name in the target data table comprises:
extracting a target data table corresponding to a target column name contained in the query statement and a hierarchical relation between the target data tables from the abstract syntax tree;
establishing a multi-branch tree of the query statement according to a from statement in the query statement; wherein each of the from statements corresponds to a node in the multi-way tree;
traversing the multi-branch tree according to the target data table and the hierarchical relation between the target data tables, and obtaining the target data table associated with the target column name and the target data column associated with the target column name in the target data table from the multi-branch tree until reaching the data source table at the bottommost layer.
4. The method according to claim 3, wherein the step of extracting, from the abstract syntax tree, the target data table corresponding to the target column name included in the query statement and the hierarchical relationship between the target data tables comprises:
extracting item entries and types of the item entries contained in the query statement from the abstract syntax tree;
for each item entry, extracting a data column related to the item entry from the item entry in a manner of matching with the type of the item entry;
determining the data table to which the data column related to each item entry belongs as a target data table corresponding to a target column name contained in the query statement;
and determining the hierarchical relationship between the data tables of the data columns related to the item entries as the hierarchical relationship between the target data tables corresponding to the target column names.
5. The method of claim 4, wherein the step of extracting item entries contained in the query statement comprises: and if the query statement comprises a unit statement, respectively extracting item entries from statements on two sides of the unit statement.
6. The method of claim 3, wherein the step of building a multi-way tree of the query statement from the query statement comprises:
establishing an initial node of a multi-branch tree, and determining the initial node as a current node;
according to the hierarchical structure of the query statement, obtaining a from statement in the query statement from the abstract syntax tree;
if a from statement is obtained, determining the from statement as a child node of the current node; and carrying out subsequent processing on the child nodes according to the types of the sentences connected with the from sentences.
7. The method of claim 6, wherein the step of retrieving from statements in the query statement from the abstract syntax tree comprises:
if the query statement contains a union statement, respectively taking statements at two sides of the union statement as updated query statements, and executing the step of acquiring from statements in the query statement from the abstract syntax tree.
8. The method of claim 6, wherein the step of performing subsequent processing on the child node according to the type of the sentence to which the from sentence is connected comprises:
if the from statement is connected with a data table, extracting the data table, and recording the data table into the child node;
if the from statement is connected with a join statement, respectively carrying out subsequent processing on the statements on two sides of the join statement according to the types of the statements on two sides of the join statement;
and if the from statement is connected with a sub query statement, determining the sub node as an updated current node, determining the sub query statement as an updated query statement, and continuously executing the step of acquiring the from statement in the query statement from the abstract syntax tree until the lowest level of the abstract syntax tree is reached.
9. The method according to claim 1, wherein the step of determining the data consanguinity analysis result of the target SQL statement according to the target data table and the target data column comprises:
acquiring data source information and database information of the target data table;
and determining a data blood margin analysis result of the target SQL statement according to the target data table, the target data column, and the data source information and the database information of the target data table.
10. The method of claim 9, wherein prior to the step of determining the data consanguinity analysis result of the target SQL statement based on the target data table, the target data column, and data source information and database information of the target data table, the method further comprises:
if the operation type of the target SQL statement is an insertion operation, an update operation or a deletion operation, acquiring a data table to be written corresponding to the target column name and a column name to be written in the data table to be written corresponding to the target column name;
acquiring data source information and database information of the data table to be written;
the step of determining the data blood relationship analysis result of the target SQL statement according to the target data table, the target data column, and the data source information and the database information of the target data table comprises the following steps:
and determining the target data table, the target data column, the data source information and the database information of the target data table, the data table to be written, the column name to be written and the data source information and the database information of the data table to be written as the data blood margin analysis result of the target SQL statement.
11. The method according to claim 10, wherein the step of obtaining the to-be-written column name in the to-be-written data table corresponding to the target column name comprises:
acquiring a to-be-written column name in the to-be-written data table corresponding to the target column name from the target SQL statement;
and if the column name to be written cannot be obtained from the target SQL statement, obtaining all column names of the data table to be written according to the metadata of the data table to be written, and determining all the column names as the column name to be written.
12. An apparatus for analyzing data blood margins, the apparatus comprising:
the establishing module is used for establishing an abstract syntax tree of the target SQL statement;
the extraction module is used for extracting query sentences related to query operation in the target SQL sentences based on the abstract syntax tree;
the acquisition module is used for acquiring a target data table associated with the target column name and a target data column associated with the target column name in the target data table according to the target column name contained in the query statement;
and the determining module is used for determining a data blood relationship analysis result of the target SQL statement according to the target data table and the target data column.
13. A server comprising a processor and a memory, the memory storing machine executable instructions executable by the processor, the processor executing the machine executable instructions to implement the method of data blood-edge analysis of any one of claims 1-11.
14. A machine-readable storage medium having stored thereon machine-executable instructions which, when invoked and executed by a processor, cause the processor to implement the method of data-edge analysis of any one of claims 1 to 11.
CN202011249929.XA 2020-11-10 2020-11-10 Data blood edge analysis method, device and server Active CN112347123B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202011249929.XA CN112347123B (en) 2020-11-10 2020-11-10 Data blood edge analysis method, device and server

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202011249929.XA CN112347123B (en) 2020-11-10 2020-11-10 Data blood edge analysis method, device and server

Publications (2)

Publication Number Publication Date
CN112347123A true CN112347123A (en) 2021-02-09
CN112347123B CN112347123B (en) 2024-10-29

Family

ID=74362516

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202011249929.XA Active CN112347123B (en) 2020-11-10 2020-11-10 Data blood edge analysis method, device and server

Country Status (1)

Country Link
CN (1) CN112347123B (en)

Cited By (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113326261A (en) * 2021-04-29 2021-08-31 上海淇馥信息技术有限公司 Data blood relationship extraction method and device and electronic equipment
CN113434533A (en) * 2021-07-22 2021-09-24 支付宝(杭州)信息技术有限公司 Data tracing tool construction method, data processing method, device and equipment
CN113918577A (en) * 2021-12-15 2022-01-11 北京新唐思创教育科技有限公司 Data table identification method and device, electronic equipment and storage medium
CN114185958A (en) * 2021-11-18 2022-03-15 招联消费金融有限公司 Blood relationship generation method and device, computer equipment and storage medium
CN115292353A (en) * 2022-10-09 2022-11-04 腾讯科技(深圳)有限公司 Data query method and device, computer equipment and storage medium
CN115994152A (en) * 2023-03-24 2023-04-21 云账户技术(天津)有限公司 Verification method, device, equipment and storage medium of MySQL query statement
CN116089476A (en) * 2023-04-07 2023-05-09 北京宝兰德软件股份有限公司 Data query method and device and electronic equipment
CN117931898B (en) * 2024-03-25 2024-06-07 成都同步新创科技股份有限公司 Multidimensional database statistical analysis method based on large model

Citations (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20160063063A1 (en) * 2014-09-02 2016-03-03 Salesforce.Com, Inc. Database query system
CN107644073A (en) * 2017-09-18 2018-01-30 广东中标数据科技股份有限公司 A kind of field consanguinity analysis method, system and device based on depth-first traversal
CN109582691A (en) * 2018-11-15 2019-04-05 百度在线网络技术(北京)有限公司 Method and apparatus for controlling data query
US20190228008A1 (en) * 2016-09-28 2019-07-25 Ping An Technology (Shenzhen) Co., Ltd. Method, device, server and storage apparatus of reviewing sql
CN110908997A (en) * 2019-10-09 2020-03-24 支付宝(杭州)信息技术有限公司 Data blood margin construction method and device, server and readable storage medium
CN111078729A (en) * 2019-12-19 2020-04-28 医渡云(北京)技术有限公司 Medical data tracing method, device, system, storage medium and electronic equipment
CN111538743A (en) * 2020-04-22 2020-08-14 电子科技大学 SQL-based data blood relationship analysis method and system
CN111782265A (en) * 2020-06-28 2020-10-16 中国工商银行股份有限公司 Software resource system based on field level blood relationship and establishment method thereof

Patent Citations (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20160063063A1 (en) * 2014-09-02 2016-03-03 Salesforce.Com, Inc. Database query system
US20190228008A1 (en) * 2016-09-28 2019-07-25 Ping An Technology (Shenzhen) Co., Ltd. Method, device, server and storage apparatus of reviewing sql
CN107644073A (en) * 2017-09-18 2018-01-30 广东中标数据科技股份有限公司 A kind of field consanguinity analysis method, system and device based on depth-first traversal
CN109582691A (en) * 2018-11-15 2019-04-05 百度在线网络技术(北京)有限公司 Method and apparatus for controlling data query
CN110908997A (en) * 2019-10-09 2020-03-24 支付宝(杭州)信息技术有限公司 Data blood margin construction method and device, server and readable storage medium
CN111078729A (en) * 2019-12-19 2020-04-28 医渡云(北京)技术有限公司 Medical data tracing method, device, system, storage medium and electronic equipment
CN111538743A (en) * 2020-04-22 2020-08-14 电子科技大学 SQL-based data blood relationship analysis method and system
CN111782265A (en) * 2020-06-28 2020-10-16 中国工商银行股份有限公司 Software resource system based on field level blood relationship and establishment method thereof

Non-Patent Citations (3)

* Cited by examiner, † Cited by third party
Title
MIAOMIAO HONG 等: "STNS-CSG: syntax tree networks with self-attention for complex SQL generation", 《 2019 IEEE FOURTH INTERNATIONAL CONFERENCE ON DATA SCIENCE IN CYBERSPACE (DSC). PROCEEDINGS》, 5 December 2019 (2019-12-05), pages 388 - 395 *
崔娜;: "面向数据库性能的SQL语句解析与翻译", 现代电子技术, no. 11, 1 June 2016 (2016-06-01), pages 107 - 110 *
李瑞;: "数据库查询优化技术的研究与实现", 电子科学技术, no. 01, 10 January 2017 (2017-01-10), pages 81 - 84 *

Cited By (11)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113326261A (en) * 2021-04-29 2021-08-31 上海淇馥信息技术有限公司 Data blood relationship extraction method and device and electronic equipment
CN113326261B (en) * 2021-04-29 2024-03-08 奇富数科(上海)科技有限公司 Data blood relationship extraction method and device and electronic equipment
CN113434533A (en) * 2021-07-22 2021-09-24 支付宝(杭州)信息技术有限公司 Data tracing tool construction method, data processing method, device and equipment
CN114185958A (en) * 2021-11-18 2022-03-15 招联消费金融有限公司 Blood relationship generation method and device, computer equipment and storage medium
CN114185958B (en) * 2021-11-18 2024-04-02 招联消费金融股份有限公司 Blood relationship generation method, device, computer equipment and storage medium
CN113918577A (en) * 2021-12-15 2022-01-11 北京新唐思创教育科技有限公司 Data table identification method and device, electronic equipment and storage medium
CN113918577B (en) * 2021-12-15 2022-03-11 北京新唐思创教育科技有限公司 Data table identification method and device, electronic equipment and storage medium
CN115292353A (en) * 2022-10-09 2022-11-04 腾讯科技(深圳)有限公司 Data query method and device, computer equipment and storage medium
CN115994152A (en) * 2023-03-24 2023-04-21 云账户技术(天津)有限公司 Verification method, device, equipment and storage medium of MySQL query statement
CN116089476A (en) * 2023-04-07 2023-05-09 北京宝兰德软件股份有限公司 Data query method and device and electronic equipment
CN117931898B (en) * 2024-03-25 2024-06-07 成都同步新创科技股份有限公司 Multidimensional database statistical analysis method based on large model

Also Published As

Publication number Publication date
CN112347123B (en) 2024-10-29

Similar Documents

Publication Publication Date Title
CN112347123A (en) Data blood margin analysis method and device and server
US8332389B2 (en) Join order for a database query
US8417690B2 (en) Automatically avoiding unconstrained cartesian product joins
CN112148509A (en) Data processing method, device, server and computer readable storage medium
WO2018121153A1 (en) Written judgment retrieval method and device
WO2019169858A1 (en) Searching engine technology based data analysis method and system
CN110019384B (en) Method for acquiring blood edge data, method and device for providing blood edge data
CN110083617B (en) DDL statement processing method and device, electronic device and medium
CN109471889B (en) Report accelerating method, system, computer equipment and storage medium
CN114265945A (en) Blood relationship extraction method and device and electronic equipment
CN112035508A (en) SQL (structured query language) -based online metadata analysis method, system and equipment
WO2023236257A1 (en) Document search platform, search method and apparatus, electronic device, and storage medium
US7159171B2 (en) Structured document management system, structured document management method, search device and search method
CN117076742A (en) Data blood edge tracking method and device and electronic equipment
CN110704472A (en) Data query statistical method and device
CN116166718B (en) Data blood margin acquisition method and device
CN115062049B (en) Data blood margin analysis method and device
Preidel et al. Integrating relational algebra into a visual code checking language for information retrieval from building information models
Medina et al. Evaluation of indexing strategies for possibilistic queries based on indexing techniques available in traditional RDBMS
US6282545B1 (en) Mechanism for information extraction and traversal from an object base including a plurality of object classes
CN110580170A (en) software performance risk identification method and device
CN114861229A (en) Hive dynamic desensitization method and system
Sloan et al. Data preparation and fuzzy matching techniques for improved statistical modeling
CN113221528A (en) Automatic generation and execution method of clinical data quality evaluation rule based on openEHR model
Nurhadi et al. Complex SQL-NoSQL Query Translation for Data Lake Management

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant