WO2021179722A1 - Sql statement parsing method and system, and computer device and storage medium - Google Patents

Sql statement parsing method and system, and computer device and storage medium Download PDF

Info

Publication number
WO2021179722A1
WO2021179722A1 PCT/CN2020/135735 CN2020135735W WO2021179722A1 WO 2021179722 A1 WO2021179722 A1 WO 2021179722A1 CN 2020135735 W CN2020135735 W CN 2020135735W WO 2021179722 A1 WO2021179722 A1 WO 2021179722A1
Authority
WO
WIPO (PCT)
Prior art keywords
layer
sql
field
directed acyclic
acyclic graph
Prior art date
Application number
PCT/CN2020/135735
Other languages
French (fr)
Chinese (zh)
Inventor
陈玉
张茜
凌海挺
杜均
Original Assignee
平安科技(深圳)有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 平安科技(深圳)有限公司 filed Critical 平安科技(深圳)有限公司
Publication of WO2021179722A1 publication Critical patent/WO2021179722A1/en

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/24Querying
    • G06F16/242Query formulation
    • G06F16/2433Query languages
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F8/00Arrangements for software engineering
    • G06F8/40Transformation of program code
    • G06F8/41Compilation
    • G06F8/42Syntactic analysis
    • G06F8/425Lexical analysis
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F8/00Arrangements for software engineering
    • G06F8/40Transformation of program code
    • G06F8/41Compilation
    • G06F8/42Syntactic analysis
    • G06F8/427Parsing

Definitions

  • This application relates to the field of data processing technology, in particular to SQL statement parsing methods, systems, computer equipment and storage media.
  • data needs to be managed, maintained and used through data governance.
  • data blood relationship is analyzed through the SQL statements used in the data processing process to analyze the source and destination of the data. Govern other tasks to provide an important data foundation.
  • the SQL statements used in data blood relationship analysis can be divided into original SQL statements and SQL statements converted by the database system.
  • This method uses the original SQL statement for analysis.
  • the original SQL basically has no dependence on the database software, which reduces the complexity of the data blood relationship analysis system.
  • the inventor realizes that the data dictionary can provide auxiliary data for implicit references, making the analysis results of blood data more comprehensive and detailed. However, it may cause unnecessary data information leakage.
  • the data dictionary enlarges the range of metadata that can be accessed during the SQL parsing process, which poses a certain security risk.
  • this method does not use the data dictionary, try to fully resolve the implicit references between the fields appearing in the SQL statement, and generate the data blood relationship.
  • this application provides a SQL statement parsing method, system, computer equipment, and storage medium, by which SQL statement parsing method can avoid interaction with database software, without having to access additional metadata, and try to protect user metadata Security.
  • the present application provides a method for parsing SQL statements based on the blood relationship of data, and the method for parsing SQL statements includes:
  • Step 1 Obtain a directed acyclic graph, which is obtained by combing SQL statements;
  • Step 2 Traverse the layers of the directed acyclic graph with SQL sub-query to obtain the layer where the substitution symbol appears, and the layer where the substitution symbol first appears during the traversal is the current layer;
  • Step 3 Obtain the field corresponding to the substitute character in the current layer according to the field of the previous layer in the current layer;
  • Step 4 Continue to obtain the remaining layers where the substitute symbols appear, and perform the substitution of the substitute symbols, until all the substitute symbols in the directed acyclic graph are resolved.
  • the present application also provides a SQL statement analysis system based on the blood relationship of the data, and the SQL statement analysis system includes:
  • the data set module is used to obtain a directed acyclic graph, which is obtained by combing SQL statements
  • the traversal module is used to traverse the layers of the directed acyclic graph with SQL sub-queries, and obtain the layer where the substitution symbol appears, and the layer where the substitution symbol first appears during the traversal is the current layer;
  • the substitution module is used to obtain the field corresponding to the substitute symbol in the current layer according to the field of the previous layer of the current layer; The substitution of substitute characters until all substitute characters in the directed acyclic graph are resolved.
  • the present application also provides a computer device, including a memory and a processor.
  • the memory stores computer-readable instructions.
  • the processor executes the following steps:
  • Step 1 Obtain a directed acyclic graph, which is obtained by combing SQL statements;
  • Step 2 Traverse the layers of the directed acyclic graph with SQL sub-query to obtain the layer where the substitution symbol appears, and the layer where the substitution symbol first appears during the traversal is the current layer;
  • Step 3 Obtain the field corresponding to the substitute character in the current layer according to the field of the previous layer in the current layer;
  • Step 4 Continue to obtain the remaining layers where the substitute symbols appear, and perform the substitution of the substitute symbols, until all the substitute symbols in the directed acyclic graph are resolved.
  • this application also provides a storage medium storing a program file capable of implementing the following steps, the steps including:
  • Step 1 Obtain a directed acyclic graph, which is obtained by combing SQL statements;
  • Step 2 Traverse the layers of the directed acyclic graph with SQL sub-query to obtain the layer where the substitution symbol appears, and the layer where the substitution symbol first appears during the traversal is the current layer;
  • Step 3 Obtain the field corresponding to the substitute character in the current layer according to the field of the previous layer in the current layer;
  • Step 4 Continue to obtain the remaining layers where the substitute symbols appear, and perform the substitution of the substitute symbols, until all the substitute symbols in the directed acyclic graph are resolved.
  • the above application provides a SQL statement analysis method, system, computer device and storage medium, wherein the SQL statement analysis method obtains a directed acyclic graph, and the directed acyclic graph is obtained by combing SQL statements; To an acyclic graph layer with SQL sub-query, get the layer where the substitution symbol appears, and take the layer where the substitution symbol appears first during the traversal as the current layer; according to the field of the previous layer of the current layer, get the current layer The field corresponding to the substitution symbol; continue to obtain the remaining layers where the substitution symbol appears, and replace it until all the substitution symbols in the directed acyclic graph are resolved.
  • the SQL statement analysis method used in the SQL-based data blood relationship analysis software tool or system does not need to interact with the database software and does not need to access additional metadata to achieve data blood relationship analysis; in addition, in data security
  • this SQL statement parsing method can achieve high cohesion and low coupling data blood relationship analysis function, which can reduce external dependence, and does not need to obtain other metadata that does not appear in the data blood relationship, so it can be guaranteed as much as possible
  • the security of user metadata will not expose metadata information that does not appear in the relevant SQL due to the analysis of the blood relationship of the data.
  • this application also involves blockchain technology.
  • Figure 1 is an implementation environment diagram of a SQL statement parsing method provided in an embodiment
  • Figure 2 is a block diagram of the internal structure of a computer device in an embodiment
  • Figure 3 is a flowchart of a method for parsing SQL statements in an embodiment
  • Figure 4 is a flow chart for generating a directed acyclic graph in an embodiment
  • Figure 5 is a flowchart of SQL code cleaning in an embodiment
  • Fig. 6 is a schematic diagram of a SQL nested query structure in an embodiment
  • Fig. 7 is a schematic diagram of a directed acyclic graph generated according to Fig. 6;
  • Figure 8 is a schematic diagram of a directed acyclic graph in an embodiment
  • Example 9 is a schematic diagram of the directed acyclic graph of Example 1 in an embodiment
  • Figure 10 is a schematic diagram of the directed acyclic graph of Example 2 in an embodiment
  • Example 11 is a schematic diagram of the directed acyclic graph of Example 3 in an embodiment
  • Figure 12 is a schematic diagram of a SQL statement parsing system in an embodiment
  • Figure 13 is a schematic diagram of a data set module in an embodiment
  • Figure 14 is a schematic diagram of a cleaning module in an embodiment
  • 15 is a schematic diagram of the structure of a computer device in an embodiment
  • FIG. 16 is a schematic diagram of the structure of a storage medium in an embodiment.
  • FIG. 1 is an implementation environment diagram of an SQL statement parsing method based on data blood relationship provided in an embodiment. As shown in FIG. 1, the implementation environment includes a computer device 110 and a display device 120.
  • the computer device 110 may be a computer device used by the user, such as a computer, and the computer device 110 is equipped with a SQL statement parsing system based on the blood relationship of the data.
  • the user can perform analysis on the computer device 110 according to the SQL statement analysis method based on the blood relationship of the data, and display the analysis result on the display device 120.
  • the combination of the computer device 110 and the display device 120 can be a smart phone, a tablet computer, a notebook computer, a desktop computer, etc., but is not limited to this.
  • Figure 2 is a schematic diagram of the internal structure of a computer device in an embodiment.
  • the computer device includes a processor, a non-volatile storage medium, a memory, and a network interface connected through a system bus.
  • the non-volatile storage medium of the computer device stores an operating system, a database, and computer-readable instructions.
  • the database may store control information sequences.
  • the processor can realize a A SQL statement parsing method based on the blood relationship of the data.
  • the processor of the computer equipment is used to provide calculation and control capabilities, and supports the operation of the entire computer equipment.
  • a computer readable instruction may be stored in the memory of the computer device, and when the computer readable instruction is executed by the processor, the processor may execute a SQL statement parsing method based on the blood relationship of the data.
  • the network interface of the computer device is used to connect and communicate with the terminal.
  • a method for parsing SQL statements based on data blood relationship is proposed.
  • This application can also be applied to data warehouse scenarios, thereby promoting the purpose of building big data.
  • the SQL statement parsing method can be applied to the aforementioned computer device 110 and display device 120, and specifically can include the following steps:
  • Step 31 Obtain a directed acyclic graph, which is obtained by combing SQL statements.
  • a complex SQL may be composed of multiple sub-SQL (query statements).
  • the most basic and atomic SQL may be composed of basic elements such as SQL keywords, fields, tables, functions, etc., in order to record the basics of SQL as completely as possible. Elements and the relationship between them, resulting in a field-level mapping relationship.
  • a directed acyclic graph can be obtained, specifically, including:
  • the S311 includes:
  • the script file may be a script such as perl.
  • S312 Perform lexical analysis on the regularized SQL statement to produce an abstract syntax tree, and generate a directed acyclic graph according to the abstract syntax tree.
  • FIG. 6 illustrate how to form a directed acyclic graph through SQL statements, that is, the SQL nested query structure can be converted into the following tree structure.
  • FIG. 6 is assumed for the SQL query S, S is the field that contains the C ij (i, j ⁇ N), comprising a source table T i (i ⁇ N); where S represents a C i0 query
  • the i-th field of the result, C ij , (i ⁇ N,j ⁇ N*) represents the j-th field referenced by the i-th field of the S query result;
  • T i represents the i-th source table that appears in S;
  • S i with different subscripts is used to indicate.
  • the SQL query in the above figure is S 0
  • the source tables of S 0 are S 1 and S 2
  • S 1 comes from T 1 and S 2 comes from T 2
  • T 1 and T 2 represent table aliases
  • S 0 includes fields C 00 and C 10
  • S 1 includes fields C 01 and C 02
  • S 2 includes fields C 11 .
  • C 00 represents 0 field S 0 query results
  • C 10 refers to the first field S 0 query results
  • C 01 represents 0 field S 1 search results referenced by the first field
  • C 02 represents the second field referenced by the 0th field of the S 1 query result
  • C 11 represents the first field referenced by the first field of the S 2 query result
  • T 1 represents the first field that appears in S 1 Source table
  • T 2 represents the second source table appearing in S 2 ; when the source table in S 0 is a sub-query, there are two sub-queries S 1 and S 2.
  • the SQL nested query structure in Figure 6 is transformed into a tree structure with three layers, that is, the generated directed acyclic graph has three layers, and the generated directed acyclic graph is shown in Figure 7. Show.
  • the SQL statement parsing method is mainly used when there is "*" (full field substitution symbol) in a complex SQL statement.
  • the two-way inference is carried out based on the field names used in the upper and lower tables and subqueries, and the "*" "Convert to a real field reference.
  • the SQL statement parsing method has two situations where the field represented by "*" cannot be inferred, and the details are as follows:
  • the SQL statement parsing method can also be said to have three main cases that can be inferred.
  • the first case is that there are table aliases, which can be inferred based on the mapping relationship;
  • the second case is that the source tables or subqueries are between The set relationship can be inferred based on the mapping relationship;
  • the third case is when the column listed in the query in the SQL statement does not specify the alias of the table, and at the same time, there are multiple source tables or subqueries, and the source tables or subqueries are between Association relationship, and only one subquery field uses "*" at the same time, it can be inferred from the correlation between the upper and lower levels and the same level.
  • Step 32 Traverse the layers of the directed acyclic graph with the SQL subquery to obtain the layer where the substitution symbol appears, and the layer where the substitution symbol first appears during the traversal is the current layer.
  • the layers of the directed acyclic graph with SQL subqueries can be traversed from top to bottom or the layers of the directed acyclic graph with SQL subqueries can be traversed from bottom to top.
  • the directed acyclic graph has five layers, where the second and third layers have substitution symbols. If the directed acyclic graph is traversed from top to bottom, the second layer is the current layer. Traverse the directed acyclic graph from bottom to top, and the third layer is the current layer.
  • Step 33 Obtain the field corresponding to the substitute character in the current layer according to the field of the previous layer in the current layer.
  • the previous layer of the current layer is the field layer
  • the field layer can map the corresponding field to the subquery of the current layer according to the subquery alias, and use the fields of the field layer to Replace the substitute character of the current layer.
  • the current layer is the first layer to be traversed, then continue to look for the layer without the substitution symbol, and the first layer without the substitution symbol is the field layer, and the previous layer of the field layer
  • the layer is an alternative layer, to infer the field corresponding to the alternative symbol in the alternative layer according to the field layer.
  • field mapping can be performed between the sub-queries at the same level to obtain the field corresponding to the substitute character.
  • the final layer must not be a subquery but a table, and there will be no fields or "*".
  • the final layer may not be considered in the inference process. Refer to Figure 8 for details.
  • Step 34 continue to obtain the remaining layers where the substitute symbols appear, and perform the substitution of the substitute symbols, until all the substitute symbols in the directed acyclic graph are resolved.
  • step 32 and step 33 if there are alternative symbols in other layers, field inference can be performed through the mapping relationship between the upper and lower layers until all alternative symbols are resolved.
  • Figure 8 is a tree structure.
  • the non-"*" field appears in the L 0 layer (that is, the top layer), L m of the query statement.
  • L n layer ie the bottom layer
  • the field represented by "*" is inferred to the upper and lower layers, and the result of data blood relationship is generated.
  • L n+1 is the final layer that does not contain sub-queries.
  • the corresponding field can be mapped to the lower-level sub-query according to the sub-query alias.
  • the lower-level sub-query contains "*”
  • the field C ij is added to the field list of the lower-level sub-query, and the above steps are performed recursively downward.
  • Example 1 For example:
  • the top level is set to level 0, which means that table0 data query is generated.
  • Table0 contains three fields, namely col3, col4, and col5; according to the expression T 1 .col1+ T at level 0 2.
  • col2 as col3 It can be seen that the data of col3 comes from T 1 .col1 and T 2 .col2; the data of col4 comes from S 1 .col4, the data of col5 comes from S 1 .col5; the next layer (layer 1) consists of 2 sheets
  • the two tables are table1 alias T 1 and table2 alias T 2 , and the subquery alias S 1 ; the field queried by the S 1 subquery in the first level is "*", and it is searched in the 0th level
  • the field reference aliased as S 1 has S 1 .col4 and S 2 in the 0th level.
  • col5 refers to the field in the S 1 subquery, so use col4 and col5 to replace the "*" in the S 1 subquery, S 1 is replaced by select col4,col5 ...; the next layer of T 3 (the second layer) is the subquery S 2 , the field queried in the subquery S 2 is "*", so look for the reference field in the upper layer ; Because the source table of S 1 only has sub-query S 2 , all the fields appearing in S 1 should come from sub-query S 2 , so the "*" in S 2 should be replaced with the field appearing in T 3 , S 2 After replacement, select col4, col5 from table3.
  • Example 2 For example:
  • the above SQL statement is converted into a tree structure (directed acyclic graph) shown in FIG. 10, according to the SQL statement, inferred table0 the '*' S 1 according to the representative of col4 col4.
  • Example 3 please refer to Example 3, for example:
  • the corresponding summary information is obtained based on the analysis result of the SQL statement analysis method based on the blood relationship of the data.
  • the summary information is obtained by hashing the analysis result of the SQL statement analysis method based on the data blood relationship, such as Use sha256s algorithm processing to get.
  • Uploading summary information to the blockchain can ensure its security and fairness and transparency to users.
  • the user can download the summary information from the blockchain to verify whether the analysis result of the SQL statement analysis method based on the blood relationship of the data has been tampered with.
  • the blockchain referred to in this example is a new application mode of computer technology such as distributed data storage, point-to-point transmission, consensus mechanism, and encryption algorithm.
  • Blockchain essentially a decentralized database, is a series of data blocks associated with cryptographic methods. Each data block contains a batch of network transaction information for verification. The validity of the information (anti-counterfeiting) and the generation of the next block.
  • the blockchain can include the underlying platform of the blockchain, the platform product service layer, and the application service layer
  • This application provides a SQL statement parsing method based on data blood relationship, by obtaining a directed acyclic graph, which is obtained by combing SQL statements; traversing a directed acyclic graph with layers of SQL sub-queries, Get the layer where the substitute character appears, and take the layer where the substitute character appears first during the traversal as the current layer; obtain the field corresponding to the substitute character in the current layer according to the field of the previous layer of the current layer; continue to obtain the remaining appearances Replace the layer of substitute symbols and perform replacement until all the substitute symbols in the directed acyclic graph are resolved.
  • the SQL statement analysis method used in the SQL-based data blood relationship analysis software tool or system does not need to interact with the database software or access additional metadata to achieve data blood relationship analysis; in addition, in data security
  • this SQL statement parsing method can achieve high-cohesion, low-coupling data kinship analysis function, which can reduce external dependence, and at the same time, it does not need to obtain other metadata that does not appear in the data kinship, so it can be guaranteed as much as possible
  • the security of user metadata will not expose metadata information that does not appear in the relevant SQL due to the analysis of the blood relationship of the data.
  • this application also involves blockchain technology.
  • the present application also provides a SQL statement parsing system based on data blood relationship.
  • the SQL statement parsing system can be integrated into the above-mentioned computer device 110, and specifically can include a data set module 20, a traversal module 30, and an alternative Module 40.
  • the data set module 20 is configured to obtain a directed acyclic graph, which is obtained by combing SQL statements;
  • the traversal module 30 is used to traverse the layers of the directed acyclic graph with SQL sub-queries, and obtain the layer where the substitution symbol appears, and the layer where the substitution symbol appears first during the traversal is the current layer;
  • the substitution module 40 is configured to obtain the field corresponding to the substitute symbol in the current layer according to the field of the previous layer of the current layer; the substitution module continues to obtain the remaining layers marked by the traversal module where the substitute symbol appears , And replace the substitute characters until all the substitute characters in the directed acyclic graph are resolved.
  • the data set module 20 includes a cleaning module 21 and a generating module 22.
  • the cleaning module 21 extracts regularized SQL statements from script files containing SQL codes to complete the cleaning of the SQL statements;
  • the generating module 22 is used to perform lexical analysis on the regularized SQL statements to produce abstract grammars Trees, and can generate directed acyclic graphs based on abstract syntax trees.
  • the cleaning module 21 includes a searching module 211 and a rule module 212.
  • the search module 211 is used to obtain a script file containing SQL code and search for the flag bit of the SQL code;
  • the rule module 212 is used to use the flag bit to filter irrelevant content in the script file, and retain the regularized SQL code Statement.
  • the traversal module 30 further implements: traversing the layers of the directed acyclic graph with SQL subquery from top to bottom or traversing the layers of the directed acyclic graph with SQL subquery from bottom to top.
  • the replacement module 40 further implements: if the previous layer of the current layer is a field layer, the field layer can map the corresponding field to the subquery of the current layer according to the subquery alias, And use the field of the field level to replace the current level of substitution.
  • the substitution module 40 further implements: if the current layer is the first layer to be traversed, continue to search for the layer without the substitution symbol, and take the first layer without the substitution symbol as the field layer, and The previous layer of the field layer is the substitution layer, so that the field corresponding to the substitution symbol in the substitution layer can be inferred from the field layer.
  • the computing system further includes a display module (not shown) for displaying calculation results.
  • the display module may be a display of a desktop computer or a display device of other computer equipment.
  • FIG. 15 is a schematic structural diagram of a device according to an embodiment of the application.
  • the device 200 includes a processor 201 and a memory 202 coupled to the processor 201.
  • the memory 202 stores program instructions for implementing the SQL statement parsing method based on data blood relationship described in any of the above embodiments.
  • the processor 201 is configured to execute program instructions stored in the memory 202.
  • the processor 201 may also be referred to as a CPU (Central Processing Unit, central processing unit).
  • the processor 201 may be an integrated circuit chip with signal processing capability.
  • the processor 201 may also be a general-purpose processor, a digital signal processor (DSP), an application specific integrated circuit (ASIC), an off-the-shelf programmable gate array (FPGA) or other programmable logic device, discrete gate or transistor logic device, discrete hardware component .
  • the general-purpose processor may be a microprocessor or the processor may also be any conventional processor or the like.
  • FIG. 16 is a schematic structural diagram of a storage medium according to an embodiment of the application.
  • the storage medium of this embodiment of the present application stores a program file 301 that can implement all the above methods, where the program file 301 may be stored in the above storage medium in the form of a software product, and the computer-readable storage medium may be a non-volatile , It can also be volatile, which includes several instructions to make a computer device (which can be a personal computer, a server, or a network device, etc.) or a processor to execute all or all of the methods described in the various embodiments of this application. Part of the steps.
  • the aforementioned storage media include: U disk, mobile hard disk, read-only memory (ROM, Read-Only Memory), random access memory (RAM, Random Access Memory), magnetic disks or optical disks and other media that can store program codes.
  • terminal devices such as computers, servers, mobile phones, and tablets.

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • General Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • Software Systems (AREA)
  • Mathematical Physics (AREA)
  • Computational Linguistics (AREA)
  • Data Mining & Analysis (AREA)
  • Databases & Information Systems (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

Disclosed are an SQL statement parsing method and system, and a computer device and a storage medium. The SQL statement parsing method comprises: acquiring a directed acyclic graph, wherein the directed acyclic graph is obtained by combing an SQL statement (S31); traversing layers, which have an SQL sub-query, of the directed acyclic graph to acquire a layer where a substitute symbol appears, and taking the layer where a substitute symbol first appears during traversal as the current layer (S32); according to a field of the previous layer of the current layer, acquiring a field corresponding to the substitute symbol in the current layer (S33); and continuing to acquire the remaining layers where a substitute symbol appears, and replacing the substitute symbols until the parsing of all the substitute symbols in the directed acyclic graph is completed (S34). Therefore, by using an SQL statement parsing method in an SQL-based data lineage analysis software tool or system, neither interaction with database software nor accessing additional metadata is necessary, such that the security of user metadata can be safeguarded as much as possible, and exposure, due to data lineage parsing, of metadata information that does not appear in a related SQL does not occur.

Description

SQL语句解析方法、系统、计算机设备和存储介质SQL statement parsing method, system, computer equipment and storage medium
本申请要求于2020年10月21日提交中国专利局、申请号为202011134553.8、申请名称为“SQL语句解析方法、系统、计算机设备和存储介质”的中国专利申请的优先权,其全部内容通过引用结合在本申请中。This application claims the priority of a Chinese patent application filed with the Chinese Patent Office on October 21, 2020, the application number is 202011134553.8, and the application name is "SQL statement parsing method, system, computer equipment and storage medium", the entire content of which is incorporated by reference Incorporated in this application.
 To
技术领域Technical field
本申请涉及数据处理技术领域,特别是涉及SQL语句解析方法、系统、计算机设备和存储介质。 This application relates to the field of data processing technology, in particular to SQL statement parsing methods, systems, computer equipment and storage media.
背景技术Background technique
数据作为企业重要的资产,需要通过数据治理的方式进行管理、维护和使用,数据血缘作为数据治理的重要组成部分,通过数据加工过程中使用的SQL语句,分析出数据的来源和去向,为数据治理其他工作提供重要的数据基础。数据血缘解析使用的SQL语句可分为原始SQL语句和数据库系统转换后的SQL语句。As an important asset of an enterprise, data needs to be managed, maintained and used through data governance. As an important part of data governance, data blood relationship is analyzed through the SQL statements used in the data processing process to analyze the source and destination of the data. Govern other tasks to provide an important data foundation. The SQL statements used in data blood relationship analysis can be divided into original SQL statements and SQL statements converted by the database system.
本方法使用原始SQL语句进行分析,与数据库软件转换后的SQL语句相比,原始SQL对数据库软件基本无依赖,降低了数据血缘分析系统的复杂程度。发明人意识到,数据字典能够对隐式引用提供辅助数据,使血缘数据的分析结果更加全面详尽。但有可能会造成不必要的数据信息泄露,在数据安全较为敏感的场景中,数据字典放大了SQL解析过程中可访问的元数据范围,存在一定安全风险。This method uses the original SQL statement for analysis. Compared with the SQL statement converted by the database software, the original SQL basically has no dependence on the database software, which reduces the complexity of the data blood relationship analysis system. The inventor realizes that the data dictionary can provide auxiliary data for implicit references, making the analysis results of blood data more comprehensive and detailed. However, it may cause unnecessary data information leakage. In scenarios where data security is more sensitive, the data dictionary enlarges the range of metadata that can be accessed during the SQL parsing process, which poses a certain security risk.
因此,本方法在不使用数据字典的情况下,尽量完整解析出SQL语句中出现的字段间隐式引用,并生成数据血缘关系。Therefore, this method does not use the data dictionary, try to fully resolve the implicit references between the fields appearing in the SQL statement, and generate the data blood relationship.
技术问题technical problem
基于此,本申请提供了一种SQL语句解析方法、系统、计算机设备和存储介质,通过该SQL语句解析方法能避免与数据库软件进行交互,也不必访问额外的元数据,并尽量保障用户元数据的安全性。Based on this, this application provides a SQL statement parsing method, system, computer equipment, and storage medium, by which SQL statement parsing method can avoid interaction with database software, without having to access additional metadata, and try to protect user metadata Security.
技术解决方案Technical solutions
为实现上述目的,本申请提供一种基于数据血缘的SQL语句解析方法,所述SQL语句解析方法包括:In order to achieve the above-mentioned purpose, the present application provides a method for parsing SQL statements based on the blood relationship of data, and the method for parsing SQL statements includes:
步骤一、获取有向无环图,所述有向无环图通过梳理SQL语句得到;Step 1: Obtain a directed acyclic graph, which is obtained by combing SQL statements;
步骤二、遍历有向无环图具有SQL子查询的图层,获取出现替代符的图层,以遍历时最先出现替代符的图层为当前层;Step 2: Traverse the layers of the directed acyclic graph with SQL sub-query to obtain the layer where the substitution symbol appears, and the layer where the substitution symbol first appears during the traversal is the current layer;
步骤三、根据当前层的前一图层的字段,获取当前层中替代符对应的字段;Step 3: Obtain the field corresponding to the substitute character in the current layer according to the field of the previous layer in the current layer;
步骤四、继续获取剩下的出现替代符的图层,并进行替代符的替换,直到有向无环图中所有的替代符完成解析。Step 4: Continue to obtain the remaining layers where the substitute symbols appear, and perform the substitution of the substitute symbols, until all the substitute symbols in the directed acyclic graph are resolved.
本申请还提供一种基于数据血缘的SQL语句解析系统,所述SQL语句解析系统包括:The present application also provides a SQL statement analysis system based on the blood relationship of the data, and the SQL statement analysis system includes:
数据集模块,用于获取有向无环图,所述有向无环图通过梳理SQL语句得到The data set module is used to obtain a directed acyclic graph, which is obtained by combing SQL statements
遍历模块,用于遍历有向无环图具有SQL子查询的图层,获取出现替代符的图层,以遍历时最先出现替代符的图层为当前层;The traversal module is used to traverse the layers of the directed acyclic graph with SQL sub-queries, and obtain the layer where the substitution symbol appears, and the layer where the substitution symbol first appears during the traversal is the current layer;
替代模块,用于根据当前层的前一图层的字段,获取当前层中替代符对应的字段;所述替代模块继续获取所述遍历模块标记的剩下的出现替代符的图层,并进行替代符的替换,直到有向无环图中所有的替代符完成解析。The substitution module is used to obtain the field corresponding to the substitute symbol in the current layer according to the field of the previous layer of the current layer; The substitution of substitute characters until all substitute characters in the directed acyclic graph are resolved.
为实现上述目的,本申请还提供一种计算机设备,包括存储器和处理器,所述存储器中存储有计算机可读指令,所述计算机可读指令被所述处理器执行时,使得所述处理器执行如下步骤:In order to achieve the above objective, the present application also provides a computer device, including a memory and a processor. The memory stores computer-readable instructions. When the computer-readable instructions are executed by the processor, the processor Perform the following steps:
步骤一、获取有向无环图,所述有向无环图通过梳理SQL语句得到;Step 1: Obtain a directed acyclic graph, which is obtained by combing SQL statements;
步骤二、遍历有向无环图具有SQL子查询的图层,获取出现替代符的图层,以遍历时最先出现替代符的图层为当前层;Step 2: Traverse the layers of the directed acyclic graph with SQL sub-query to obtain the layer where the substitution symbol appears, and the layer where the substitution symbol first appears during the traversal is the current layer;
步骤三、根据当前层的前一图层的字段,获取当前层中替代符对应的字段;Step 3: Obtain the field corresponding to the substitute character in the current layer according to the field of the previous layer in the current layer;
步骤四、继续获取剩下的出现替代符的图层,并进行替代符的替换,直到有向无环图中所有的替代符完成解析。Step 4: Continue to obtain the remaining layers where the substitute symbols appear, and perform the substitution of the substitute symbols, until all the substitute symbols in the directed acyclic graph are resolved.
为实现上述目的,本申请还提供一种存储介质,存储有能够实现如下步骤的程序文件,所述步骤包括:In order to achieve the above objective, this application also provides a storage medium storing a program file capable of implementing the following steps, the steps including:
步骤一、获取有向无环图,所述有向无环图通过梳理SQL语句得到;Step 1: Obtain a directed acyclic graph, which is obtained by combing SQL statements;
步骤二、遍历有向无环图具有SQL子查询的图层,获取出现替代符的图层,以遍历时最先出现替代符的图层为当前层;Step 2: Traverse the layers of the directed acyclic graph with SQL sub-query to obtain the layer where the substitution symbol appears, and the layer where the substitution symbol first appears during the traversal is the current layer;
步骤三、根据当前层的前一图层的字段,获取当前层中替代符对应的字段;Step 3: Obtain the field corresponding to the substitute character in the current layer according to the field of the previous layer in the current layer;
步骤四、继续获取剩下的出现替代符的图层,并进行替代符的替换,直到有向无环图中所有的替代符完成解析。Step 4: Continue to obtain the remaining layers where the substitute symbols appear, and perform the substitution of the substitute symbols, until all the substitute symbols in the directed acyclic graph are resolved.
有益效果Beneficial effect
上述本申请提供了一种SQL语句解析方法、系统、计算机设备和存储介质,其中,所述SQL语句解析方法获取有向无环图,所述有向无环图通过梳理SQL语句得到;遍历有向无环图具有SQL子查询的图层,获取出现替代符的图层,以遍历时最先出现替代符的图层为当前层;根据当前层的前一图层的字段,获取当前层中替代符对应的字段;继续获取剩下的出现替代符的图层,并进行替换,直到有向无环图中所有的替代符完成解析。由此,在基于SQL的数据血缘分析软件工具或系统中采用所述SQL语句解析方法不必与数据库软件进行交互,也不必访问额外的元数据,便可实现数据血缘关系解析;此外,在数据安全要求较高的场景,通过该SQL语句解析方法可实现高内聚、低耦合的数据血缘解析功能,能减少外部依赖,同时不需要获取未在数据血缘关系中出现的其他元数据,可尽量保障用户元数据的安全性,不会因数据血缘解析而暴露未出现在相关SQL中的元数据信息。同时,本申请还涉及区块链技术。The above application provides a SQL statement analysis method, system, computer device and storage medium, wherein the SQL statement analysis method obtains a directed acyclic graph, and the directed acyclic graph is obtained by combing SQL statements; To an acyclic graph layer with SQL sub-query, get the layer where the substitution symbol appears, and take the layer where the substitution symbol appears first during the traversal as the current layer; according to the field of the previous layer of the current layer, get the current layer The field corresponding to the substitution symbol; continue to obtain the remaining layers where the substitution symbol appears, and replace it until all the substitution symbols in the directed acyclic graph are resolved. Therefore, the SQL statement analysis method used in the SQL-based data blood relationship analysis software tool or system does not need to interact with the database software and does not need to access additional metadata to achieve data blood relationship analysis; in addition, in data security In demanding scenarios, this SQL statement parsing method can achieve high cohesion and low coupling data blood relationship analysis function, which can reduce external dependence, and does not need to obtain other metadata that does not appear in the data blood relationship, so it can be guaranteed as much as possible The security of user metadata will not expose metadata information that does not appear in the relevant SQL due to the analysis of the blood relationship of the data. At the same time, this application also involves blockchain technology.
附图说明Description of the drawings
图1为一个实施例中提供的SQL语句解析方法的实施环境图;Figure 1 is an implementation environment diagram of a SQL statement parsing method provided in an embodiment;
图2为一个实施例中计算机设备的内部结构框图;Figure 2 is a block diagram of the internal structure of a computer device in an embodiment;
图3为一个实施例中SQL语句解析方法的流程图;Figure 3 is a flowchart of a method for parsing SQL statements in an embodiment;
图4为一个实施例中有向无环图生成的流程图;Figure 4 is a flow chart for generating a directed acyclic graph in an embodiment;
图5为一个实施例中SQL代码清洗的流程图;Figure 5 is a flowchart of SQL code cleaning in an embodiment;
图6为一个实施例中SQL嵌套查询结构的示意图;Fig. 6 is a schematic diagram of a SQL nested query structure in an embodiment;
图7为根据图6生成的有向无环图的示意图;Fig. 7 is a schematic diagram of a directed acyclic graph generated according to Fig. 6;
图8为一个实施例中有向无环图的示意图;Figure 8 is a schematic diagram of a directed acyclic graph in an embodiment;
图9为一个实施例中实例1的有向无环图的示意图;9 is a schematic diagram of the directed acyclic graph of Example 1 in an embodiment;
图10为一个实施例中实例2的有向无环图的示意图;Figure 10 is a schematic diagram of the directed acyclic graph of Example 2 in an embodiment;
图11为一个实施例中实例3的有向无环图的示意图;11 is a schematic diagram of the directed acyclic graph of Example 3 in an embodiment;
图12为一个实施例中SQL语句解析系统的示意图;Figure 12 is a schematic diagram of a SQL statement parsing system in an embodiment;
图13为一个实施例中数据集模块的示意图;Figure 13 is a schematic diagram of a data set module in an embodiment;
图14为一个实施例中清洗模块的示意图;Figure 14 is a schematic diagram of a cleaning module in an embodiment;
图15为一个实施例中的计算机设备的结构示意图;15 is a schematic diagram of the structure of a computer device in an embodiment;
图16为一个实施例中的存储介质的结构示意图。FIG. 16 is a schematic diagram of the structure of a storage medium in an embodiment.
本发明的实施方式Embodiments of the present invention
为了使本申请的目的、技术方案及优点更加清楚明白,以下结合附图及实施例,对本申请进行进一步详细说明。应当理解,此处所描述的具体实施例仅仅用以解释本申请,并不用于限定本申请。In order to make the purpose, technical solutions, and advantages of this application clearer and clearer, the following further describes the application in detail with reference to the accompanying drawings and embodiments. It should be understood that the specific embodiments described here are only used to explain the present application, and are not used to limit the present application.
可以理解,本申请所使用的术语“第一”、“第二”等可在本文中用于描述各种元件,但这些元件不受这些术语限制。这些术语仅用于将第一个元件与另一个元件区分。It can be understood that the terms "first", "second", etc. used in this application can be used herein to describe various elements, but these elements are not limited by these terms. These terms are only used to distinguish the first element from another element.
图1为一个实施例中提供的基于数据血缘的SQL语句解析方法的实施环境图,如图1所示,在该实施环境中,包括计算机设备110和显示设备120。FIG. 1 is an implementation environment diagram of an SQL statement parsing method based on data blood relationship provided in an embodiment. As shown in FIG. 1, the implementation environment includes a computer device 110 and a display device 120.
计算机设备110可以为用户使用的电脑等计算机设备,计算机设备110上安装有基于数据血缘的SQL语句解析系统。当计算时,用户可以在计算机设备110依照基于数据血缘的SQL语句解析方法进行解析,并通过显示设备120显示解析结果。The computer device 110 may be a computer device used by the user, such as a computer, and the computer device 110 is equipped with a SQL statement parsing system based on the blood relationship of the data. When calculating, the user can perform analysis on the computer device 110 according to the SQL statement analysis method based on the blood relationship of the data, and display the analysis result on the display device 120.
需要说明的是,计算机设备110和显示设备120组合起来可以为智能手机、平板电脑、笔记本电脑、台式计算机等,但并不局限于此。It should be noted that the combination of the computer device 110 and the display device 120 can be a smart phone, a tablet computer, a notebook computer, a desktop computer, etc., but is not limited to this.
图2为一个实施例中计算机设备的内部结构示意图。如图2所示,该计算机设备包括通过系统总线连接的处理器、非易失性存储介质、存储器和网络接口。其中,该计算机设备的非易失性存储介质存储有操作系统、数据库和计算机可读指令,数据库中可存储有控件信息序列,该计算机可读指令被处理器执行时,可使得处理器实现一种基于数据血缘的SQL语句解析方法。该计算机设备的处理器用于提供计算和控制能力,支撑整个计算机设备的运行。该计算机设备的存储器中可存储有计算机可读指令,该计算机可读指令被处理器执行时,可使得处理器执行一种基于数据血缘的SQL语句解析方法。该计算机设备的网络接口用于与终端连接通信。本领域技术人员可以理解,图2中示出的结构,仅仅是与本申请方案相关的部分结构的框图,并不构成对本申请方案所应用于其上的计算机设备的限定,具体的计算机设备可以包括比图中所示更多或更少的部件,或者组合某些部件,或者具有不同的部件布置。Figure 2 is a schematic diagram of the internal structure of a computer device in an embodiment. As shown in Figure 2, the computer device includes a processor, a non-volatile storage medium, a memory, and a network interface connected through a system bus. Wherein, the non-volatile storage medium of the computer device stores an operating system, a database, and computer-readable instructions. The database may store control information sequences. When the computer-readable instructions are executed by the processor, the processor can realize a A SQL statement parsing method based on the blood relationship of the data. The processor of the computer equipment is used to provide calculation and control capabilities, and supports the operation of the entire computer equipment. A computer readable instruction may be stored in the memory of the computer device, and when the computer readable instruction is executed by the processor, the processor may execute a SQL statement parsing method based on the blood relationship of the data. The network interface of the computer device is used to connect and communicate with the terminal. Those skilled in the art can understand that the structure shown in FIG. 2 is only a block diagram of a part of the structure related to the solution of the present application, and does not constitute a limitation on the computer device to which the solution of the present application is applied. The specific computer device may Including more or fewer parts than shown in the figure, or combining some parts, or having a different arrangement of parts.
如图3所示,在一个实施例中,提出了一种基于数据血缘的SQL语句解析方法,本申请还可应用于数据仓库场景中,从而推动大数据的建设的目的。所述SQL语句解析方法可以应用于上述的计算机设备110和显示设备120中,具体可以包括以下步骤:As shown in FIG. 3, in one embodiment, a method for parsing SQL statements based on data blood relationship is proposed. This application can also be applied to data warehouse scenarios, thereby promoting the purpose of building big data. The SQL statement parsing method can be applied to the aforementioned computer device 110 and display device 120, and specifically can include the following steps:
步骤31,获取有向无环图,所述有向无环图通过梳理SQL语句得到。Step 31: Obtain a directed acyclic graph, which is obtained by combing SQL statements.
具体的,一个复杂的SQL可能由多个子SQL(查询语句)组成的,最基础、最原子的SQL可能由SQL关键字、字段、表、函数等基层元素组成,为了尽量完整记录SQL中各基础元素及其之间的关系,从而产生字段级别的映射关系。进一步的,请参考图4和图5,通过从含有SQL代码的脚本提取SQL语句,并通过梳理SQL语句可以得到有向无环图,具体的,包括:Specifically, a complex SQL may be composed of multiple sub-SQL (query statements). The most basic and atomic SQL may be composed of basic elements such as SQL keywords, fields, tables, functions, etc., in order to record the basics of SQL as completely as possible. Elements and the relationship between them, resulting in a field-level mapping relationship. Further, please refer to Figures 4 and 5, by extracting SQL statements from scripts containing SQL code, and by combing SQL statements, a directed acyclic graph can be obtained, specifically, including:
S311、从含有SQL代码的脚本文件中提取得到规则化的SQL语句,完成对SQL语句的清洗;S311. Extract regularized SQL statements from the script file containing the SQL code, and complete the cleaning of the SQL statements;
进一步的,所述S311包括:Further, the S311 includes:
S3111、获取含有SQL代码的脚本文件,并寻找SQL代码的标志位;S3111, obtain the script file containing the SQL code, and look for the flag bit of the SQL code;
优选的,脚本文件可为perl等脚本。Preferably, the script file may be a script such as perl.
S3112、利用标志位过滤脚本文件中的无关内容,保留得到规则化的SQL代码语句。S3112. Use the flag bit to filter irrelevant content in the script file, and retain the regularized SQL code statements.
S312、对规则化的SQL语句进行词法分析,生产抽象语法树,并根据抽象语法树生成有向无环图。S312: Perform lexical analysis on the regularized SQL statement to produce an abstract syntax tree, and generate a directed acyclic graph according to the abstract syntax tree.
根据所述SQL语句解析方法,请参考图6,将具体举例说明如何通过SQL语句形成有向无环图的,即将SQL嵌套查询结构可转换为以下树形结构。According to the SQL statement parsing method, please refer to FIG. 6 to illustrate how to form a directed acyclic graph through SQL statements, that is, the SQL nested query structure can be converted into the following tree structure.
具体的,请参考图6,假设图6中SQL查询语句为S,S中包含字段为C ij (i,j∈N),包含来源表T i (i∈N);其中C i0表示S查询结果的第i个字段,C ij,(i∈N,j∈N* )表示S查询结果的第i个字段所引用的第j个字段;T i表示S中出现的第i个来源表;当S中的来源表为子查询时使用不同下标的S i来表示。 Specifically, referring to FIG. 6, FIG. 6 is assumed for the SQL query S, S is the field that contains the C ij (i, j∈N), comprising a source table T i (i∈N); where S represents a C i0 query The i-th field of the result, C ij , (i∈N,j∈N*) represents the j-th field referenced by the i-th field of the S query result; T i represents the i-th source table that appears in S; When the source table in S is a subquery, S i with different subscripts is used to indicate.
根据图6来看,上图SQL查询语句为S 0,S 0的来源表为S 1和S 2,即S 1来自T 1,S 2来自T 2(T 1和T 2表示表别名),S 0包括字段C 00和C 10,S 1包括字段C 01和字段C 02,S 2包括字段C 11。进一步的,C 00表示S 0查询结果的第0个字段,C 10表示S 0查询结果的第1个字段;C 01表示S 1查询结果的第0个字段所引用的第1个字段,C 02表示S 1查询结果的第0个字段所引用的第2个字段,C 11表示S 2查询结果的第1个字段所引用的第1个字段;T 1表示S 1中出现的第1个来源表,T 2表示S 2中出现的第2个来源表;S 0中的来源表为子查询时具有S 1和S 2两个子查询。 According to Figure 6, the SQL query in the above figure is S 0 , and the source tables of S 0 are S 1 and S 2 , that is, S 1 comes from T 1 and S 2 comes from T 2 (T 1 and T 2 represent table aliases), S 0 includes fields C 00 and C 10 , S 1 includes fields C 01 and C 02 , and S 2 includes fields C 11 . Further, C 00 represents 0 field S 0 query results, C 10 refers to the first field S 0 query results; C 01 represents 0 field S 1 search results referenced by the first field, C 02 represents the second field referenced by the 0th field of the S 1 query result, C 11 represents the first field referenced by the first field of the S 2 query result; T 1 represents the first field that appears in S 1 Source table, T 2 represents the second source table appearing in S 2 ; when the source table in S 0 is a sub-query, there are two sub-queries S 1 and S 2.
根据以上描述,可以看出,图6中SQL嵌套查询结构转为树形架构为三层,即生成的有向无环图具有三层,所述生成的有向无环图如图7所示。According to the above description, it can be seen that the SQL nested query structure in Figure 6 is transformed into a tree structure with three layers, that is, the generated directed acyclic graph has three layers, and the generated directed acyclic graph is shown in Figure 7. Show.
在此需要说明的是,该SQL语句解析方法主要用于复杂SQL语句中存在“*”(全字段替代符号)的情况,根据上下层表和子查询中使用的字段名称进行双向推断,将“*”转换换为真实的字段引用。但是,所述SQL语句解析方法具有两种情况无法推断“*”所表示的字段,具体如下:What needs to be explained here is that the SQL statement parsing method is mainly used when there is "*" (full field substitution symbol) in a complex SQL statement. The two-way inference is carried out based on the field names used in the upper and lower tables and subqueries, and the "*" "Convert to a real field reference. However, the SQL statement parsing method has two situations where the field represented by "*" cannot be inferred, and the details are as follows:
第一种情况,当SQL语句中所有查询字段都使用“*”替代,则无法进行推断。In the first case, when all query fields in the SQL statement are replaced by "*", inference cannot be made.
例如以图8为例,如果S 0是“*”,如果S 1、S 2和S 3都是“*”,则无法进行推断。 For example, taking Fig. 8 as an example, if S 0 is "*", if S 1 , S 2 and S 3 are all "*", no inference can be made.
第二种情况,当SQL语句中查询列出的字段未指定表的别名,同时,来源表或子查询有多个,且来源表或子查询之间是关联关系,并且同时至少二个子查询中字段都使用“*”表示,则无法准确推断“*”所代表的字段。In the second case, when the columns listed in the query in the SQL statement do not specify the alias of the table, at the same time, there are multiple source tables or sub-queries, and the source tables or sub-queries are related, and at least two sub-queries are in the same time If the fields are all represented by "*", the field represented by "*" cannot be accurately inferred.
例如以图8为例,如果S 0是“a+b+c”,如果S 1是“*”,S 2是“*”,S 3显示c,那么S 1和S 2是无法被推断出来的。 For example, take Figure 8 as an example. If S 0 is "a+b+c", if S 1 is "*", S 2 is "*", and S 3 shows c, then S 1 and S 2 cannot be inferred. of.
同理,所述SQL语句解析方法也可以说是主要有三种情况可以推断出来,第一种情况是具有表别名,可根据映射关系进行推断;第二种情况是来源表或子查询之间是集合关系,可以根据映射关系进行推断;第三种情况是当SQL语句中查询列出的字段未指定表的别名,同时,来源表或子查询有多个,且来源表或子查询之间是关联关系,并且同时只有一个子查询中字段使用“*”,则可通过上下层和同层之间的关联性进行推断。In the same way, the SQL statement parsing method can also be said to have three main cases that can be inferred. The first case is that there are table aliases, which can be inferred based on the mapping relationship; the second case is that the source tables or subqueries are between The set relationship can be inferred based on the mapping relationship; the third case is when the column listed in the query in the SQL statement does not specify the alias of the table, and at the same time, there are multiple source tables or subqueries, and the source tables or subqueries are between Association relationship, and only one subquery field uses "*" at the same time, it can be inferred from the correlation between the upper and lower levels and the same level.
步骤32,遍历有向无环图具有SQL子查询的图层,获取出现替代符的图层,以遍历时最先出现替代符的图层为当前层。Step 32: Traverse the layers of the directed acyclic graph with the SQL subquery to obtain the layer where the substitution symbol appears, and the layer where the substitution symbol first appears during the traversal is the current layer.
具体的,由于有向无环图自身的缘故,可以从上到下遍历有向无环图具有SQL子查询的图层或从下到上遍历有向无环图具有SQL子查询的图层。Specifically, due to the directed acyclic graph itself, the layers of the directed acyclic graph with SQL subqueries can be traversed from top to bottom or the layers of the directed acyclic graph with SQL subqueries can be traversed from bottom to top.
更具体的,假设有向无环图具有五层,其中第二层和第三层具有替代符,如果从上到下遍历所述有向无环图,则第二层为当前层,如果从下到上遍历所述有向无环图,则第三层为当前层。More specifically, suppose that the directed acyclic graph has five layers, where the second and third layers have substitution symbols. If the directed acyclic graph is traversed from top to bottom, the second layer is the current layer. Traverse the directed acyclic graph from bottom to top, and the third layer is the current layer.
步骤33,根据当前层的前一图层的字段,获取当前层中替代符对应的字段。Step 33: Obtain the field corresponding to the substitute character in the current layer according to the field of the previous layer in the current layer.
具体的,其中一实施例中,所述当前层的前一图层为字段层,所述字段层可根据子查询别名将对应字段映射至当前层的子查询中,并采用字段层的字段来替代当前层的替代符。Specifically, in one of the embodiments, the previous layer of the current layer is the field layer, and the field layer can map the corresponding field to the subquery of the current layer according to the subquery alias, and use the fields of the field layer to Replace the substitute character of the current layer.
具体的,其中一实施例中,若当前层为遍历的第一层,则继续寻找没有替代符的图层,以最先出现的没有替代符的图层为字段层,以字段层的前一层为替代层,以根据字段层推断出替代层中的替代符对应的字段。Specifically, in one of the embodiments, if the current layer is the first layer to be traversed, then continue to look for the layer without the substitution symbol, and the first layer without the substitution symbol is the field layer, and the previous layer of the field layer The layer is an alternative layer, to infer the field corresponding to the alternative symbol in the alternative layer according to the field layer.
进一步的,如果同层存在多个子查询,且同层级的子查询之间如果是集合操作关系,则可以在同一层子查询之间进行字段映射,以获得替代符对应的字段。Further, if there are multiple sub-queries at the same level, and if the sub-queries at the same level have a set operation relationship, field mapping can be performed between the sub-queries at the same level to obtain the field corresponding to the substitute character.
需要额外说明的是,根据SQL语法规则,最终层一定不会是子查询而是表,不会出现字段或“*”的情况,在推断过程中可以不考虑最终层,具体可参考图8。It should be noted that, according to the SQL grammar rules, the final layer must not be a subquery but a table, and there will be no fields or "*". The final layer may not be considered in the inference process. Refer to Figure 8 for details.
步骤34,继续获取剩下的出现替代符的图层,并进行替代符的替换,直到有向无环图中所有的替代符完成解析。Step 34, continue to obtain the remaining layers where the substitute symbols appear, and perform the substitution of the substitute symbols, until all the substitute symbols in the directed acyclic graph are resolved.
具体的,根据步骤32和步骤33,如果其他图层还存在替代符,均可通过上下层之间映射关系进行字段推断,直到所有替代符完成解析。Specifically, according to step 32 and step 33, if there are alternative symbols in other layers, field inference can be performed through the mapping relationship between the upper and lower layers until all alternative symbols are resolved.
根据以上四个步骤,请继续参考图8,为一种树形结构,在本SQL语句解析方法中,可解决非“*”字段出现在查询语句的L 0层(即最上层)、L m层(即中间层)、L n层(即最下层)时,向上下层推断“*”所代表的字段,并生成数据血缘关系结果,L n+1为不包含子查询的最终层。 According to the above four steps, please continue to refer to Figure 8, which is a tree structure. In this SQL statement parsing method, it can be solved that the non-"*" field appears in the L 0 layer (that is, the top layer), L m of the query statement. When the layer (ie middle layer) and L n layer (ie the bottom layer) are used, the field represented by "*" is inferred to the upper and lower layers, and the result of data blood relationship is generated. L n+1 is the final layer that does not contain sub-queries.
其中,请当非“*”字段C ij出现在L 0时,可根据子查询别名将对应字段映射至下层的子查询中,当下层子查询中包含“*”时,使用字段C ij替代;当下层子查询中不包含字段C ij时,将字段C ij加入下层子查询的字段列表中,以此类推递归向下执行上述步骤。 Among them, when the non-"*" field C ij appears in L 0 , the corresponding field can be mapped to the lower-level sub-query according to the sub-query alias. When the lower-level sub-query contains "*", use the field C ij instead; When the field C ij is not included in the lower-level sub-query, the field C ij is added to the field list of the lower-level sub-query, and the above steps are performed recursively downward.
具体的,请参考实例1,例如:Specifically, please refer to Example 1, for example:
insert into table0insert into table0
selectselect
t1.col1+t2.col2 as col3,s1.col4,s1.col5t1.col1+t2.col2 as col3,s1.col4,s1.col5
from table1 t1 from table1 t1
join table2 t2join table2 t2
on t1.id = t2.idon t1.id = t2.id
join ( select * from  (join (select * from (
select *select *
from table3 t3 ) s2 from table3 t3) s2
    ) s1) S1
on t2.id = t3.id;on t2.id = t3.id;
请参考图9,以上SQL语句转化成树形结构(有向无环图)如图9所示,其中,图9中的实线表示字段的依赖关系,虚线表示字段间的推断关系,根据以上SQL语句进行说明如下:Please refer to Figure 9. The above SQL statement is transformed into a tree structure (directed acyclic graph) as shown in Figure 9. The solid line in Figure 9 represents the dependency relationship of the fields, and the dashed line represents the inferred relationship between the fields. According to the above The SQL statement is explained as follows:
具体的,最顶层设为第0层,表示生成table0的数据查询,此查询中可得出Table0包含三个字段,分别是col3、col4、col5;根据第0层的表达式T 1.col1+ T 2.col2 as col3可知,col3的数据来自T 1.col1和T 2.col2;col4的数据来自S 1.col4,col5的数据来自S 1.col5;下一层(第1层)由2张表和1个子查询组成,两张表分别是table1别名T 1和table2别名T 2,子查询别名S 1;第1层中S 1子查询所查询的字段是“*”,在第0层查找别名为S 1的字段引用,在第0层中有S 1.col4和S 2.col5是引用S 1子查询中的字段,则使用col4和col5替换S 1子查询中的“*”,S 1替换后为select col4,col5 …;T 3的下一层(第2层)是子查询S 2,子查询S 2中所查询的字段是“*”,所以在上一层中查找引用字段;因S 1的来源表只有子查询S 2,那么在S 1中出现的所有字段都应来自子查询S 2,所以S 2中的“*”应使用T 3出现的字段进行替换,S 2替换后为select col4,col5 from table3。 Specifically, the top level is set to level 0, which means that table0 data query is generated. In this query, Table0 contains three fields, namely col3, col4, and col5; according to the expression T 1 .col1+ T at level 0 2. col2 as col3 It can be seen that the data of col3 comes from T 1 .col1 and T 2 .col2; the data of col4 comes from S 1 .col4, the data of col5 comes from S 1 .col5; the next layer (layer 1) consists of 2 sheets The two tables are table1 alias T 1 and table2 alias T 2 , and the subquery alias S 1 ; the field queried by the S 1 subquery in the first level is "*", and it is searched in the 0th level The field reference aliased as S 1 has S 1 .col4 and S 2 in the 0th level. col5 refers to the field in the S 1 subquery, so use col4 and col5 to replace the "*" in the S 1 subquery, S 1 is replaced by select col4,col5 …; the next layer of T 3 (the second layer) is the subquery S 2 , the field queried in the subquery S 2 is "*", so look for the reference field in the upper layer ; Because the source table of S 1 only has sub-query S 2 , all the fields appearing in S 1 should come from sub-query S 2 , so the "*" in S 2 should be replaced with the field appearing in T 3 , S 2 After replacement, select col4, col5 from table3.
其中,当中间层L m的字段为非“*”时,若上一层查询中“*”带有表别名(如T 3.*)则可根据表别名推断中间层中列出的字段是否为上一层“*”应该表示的字段;若上一层“*”无表别名标识,则确认当前查询是否为上层查询唯一的来源表,若是唯一来源表,则当前查询出现的字段为上一层“*”应该表示的字段;若不是唯一来源表,那么所有来源表(子查询)所列出的所有字段都是上一层“*”应该表示的字段。 Among them, when the field of the middle layer L m is not "*", if the "*" in the previous query has a table alias (such as T 3 .*), you can infer whether the field listed in the middle layer is based on the table alias It is the field that the upper layer "*" should represent; if the upper layer "*" has no table alias identification, confirm whether the current query is the only source table of the upper query, if it is the only source table, the field that appears in the current query is the upper Fields that should be represented by "*" at one level; if it is not the only source table, all fields listed in all source tables (subqueries) are the fields that should be represented by "*" at the previous level.
具体的,请参考实例2,例如:Specifically, please refer to Example 2, for example:
insert into table0insert into table0
SELECTSELECT
* *
from(from(
select col1+col2+col3 as col4 from (select col1+col2+col3 as col4 from (
select col1,col2,col3 from table1 t1select col1,col2,col3 from table1 t1
)s2)s2
) s1;) s1;
请参考图10,以上SQL语句转化成树形结构(有向无环图)如图10所示,根据SQL语句,根据S 1中的col4可推断出table0中的‘*’应代表col4。 Please refer to FIG. 10, the above SQL statement is converted into a tree structure (directed acyclic graph) shown in FIG. 10, according to the SQL statement, inferred table0 the '*' S 1 according to the representative of col4 col4.
具体的,从SQL中可知,第0层table0的字段使用‘*’表示;第1层的子查询S 1包含一个字段col4,因table0的下一层数据来源只有子查询S 1,那么S 1中的所有字段都应出现在上一层的查询中;使用第1层子查询S 1的全部字段替代上层的‘*’,替换后第0层的查询应为select col4 from ……。 Specifically, from SQL, we can see that the field of table0 at level 0 is represented by'*'; the subquery S 1 of level 1 contains a field col4, because the data source of the next level of table0 is only subquery S 1 , then S 1 All the fields in should appear in the query of the upper layer; use all the fields of the subquery S 1 of the first layer to replace the'*' in the upper layer, and the query of the 0th layer after the replacement should be select col4 from …….
其中,当多个查询间关系如S 0,S 1,S 2的形式时,同层级的子查询之间如果是集合(union\intersect\minus)操作,则可以在同一层子查询之间进行字段映射。因集合操作要求字段名称和字段顺序完全相同,则可根据非“*”字段在同层级进行集合操作的子查询间进行字段引用映射,可根据S 3中的字段推断出S 2中应包含的字段,在使用S 3和S 2中的字段推断出table0中应包含的字段。 Among them, when the relationship between multiple queries is in the form of S 0 , S 1 , S 2 , if the sub-queries at the same level are set (union\intersect\minus) operations, they can be performed between sub-queries at the same level Field mapping. Operational requirements set by the field names and field sequence are identical, the query can be reference mapping between fields in accordance with the non-child "*" fields set of operations in the same level, can be inferred from the field S 2 S 3 should be included Fields, use the fields in S 3 and S 2 to infer the fields that should be included in table0.
具体的,请参考实例3,例如:Specifically, please refer to Example 3, for example:
insert into table0insert into table0
select * from (select * from (
    (select * from table1 t1) s2(Select * from table1 t1) s2
    UNIONUNION
    (select col1,col2 from table2 t2) s3(Select col1,col2 from table2 t2) s3
)s1;)s1;
请参考图11,UNION在SQL语句代表集合关系中,以上SQL语句转化成树形结构(有向无环图)如图11所示。Please refer to Figure 11. In the SQL statement representing the set relationship of UNION, the above SQL statement is transformed into a tree structure (directed acyclic graph) as shown in Figure 11.
以上三种实例只是列举,进一步的,在实际应用中可能根据SQL复杂程度需要使用上述方式进行组合多次迭代才能将隐式字段引用关系完全列举。The above three examples are just examples. Further, in actual applications, it may be necessary to use the above method to combine multiple iterations according to the complexity of the SQL to fully enumerate the implicit field reference relationship.
在一个可选的实施方式中,还可以:将所述基于数据血缘的SQL语句解析方法的解析结果上传至区块链中。In an optional implementation manner, it is also possible to upload the analysis result of the SQL statement analysis method based on the blood relationship of the data to the blockchain.
具体地,基于所述基于数据血缘的SQL语句解析方法的解析结果得到对应的摘要信息,具体来说,摘要信息由所述基于数据血缘的SQL语句解析方法的解析结果进行散列处理得到,比如利用sha256s算法处理得到。将摘要信息上传至区块链可保证其安全性和对用户的公正透明性。用户可以从区块链中下载得该摘要信息,以便查证所述基于数据血缘的SQL语句解析方法的解析结果是否被篡改。本示例所指区块链是分布式数据存储、点对点传输、共识机制、加密算法等计算机技术的新型应用模式。区块链(Blockchain),本质上是一个去中心化的数据库,是一串使用密码学方法相关联产生的数据块,每一个数据块中包含了一批次网络交易的信息,用于验证其信息的有效性(防伪)和生成下一个区块。区块链可以包括区块链底层平台、平台产品服务层以及应用服务层等。Specifically, the corresponding summary information is obtained based on the analysis result of the SQL statement analysis method based on the blood relationship of the data. Specifically, the summary information is obtained by hashing the analysis result of the SQL statement analysis method based on the data blood relationship, such as Use sha256s algorithm processing to get. Uploading summary information to the blockchain can ensure its security and fairness and transparency to users. The user can download the summary information from the blockchain to verify whether the analysis result of the SQL statement analysis method based on the blood relationship of the data has been tampered with. The blockchain referred to in this example is a new application mode of computer technology such as distributed data storage, point-to-point transmission, consensus mechanism, and encryption algorithm. Blockchain, essentially a decentralized database, is a series of data blocks associated with cryptographic methods. Each data block contains a batch of network transaction information for verification. The validity of the information (anti-counterfeiting) and the generation of the next block. The blockchain can include the underlying platform of the blockchain, the platform product service layer, and the application service layer.
本申请提供了一种基于数据血缘的SQL语句解析方法,通过获取有向无环图,所述有向无环图通过梳理SQL语句得到;遍历有向无环图具有SQL子查询的图层,获取出现替代符的图层,以遍历时最先出现替代符的图层为当前层;根据当前层的前一图层的字段,获取当前层中替代符对应的字段;继续获取剩下的出现替代符的图层,并进行替换,直到有向无环图中所有的替代符完成解析。由此,在基于SQL的数据血缘分析软件工具或系统中采用所述SQL语句解析方法不必与数据库软件进行交互,也不必访问额外的元数据,便可实现数据血缘关系解析;此外,在数据安全要求较高的场景,通过该SQL语句解析方法可实现高内聚、低耦合的数据血缘解析功能,能减少外部依赖,同时不需要获取未在数据血缘关系中出现的其他元数据,可尽量保障用户元数据的安全性,不会因数据血缘解析而暴露未出现在相关SQL中的元数据信息。同时,本申请还涉及区块链技术。This application provides a SQL statement parsing method based on data blood relationship, by obtaining a directed acyclic graph, which is obtained by combing SQL statements; traversing a directed acyclic graph with layers of SQL sub-queries, Get the layer where the substitute character appears, and take the layer where the substitute character appears first during the traversal as the current layer; obtain the field corresponding to the substitute character in the current layer according to the field of the previous layer of the current layer; continue to obtain the remaining appearances Replace the layer of substitute symbols and perform replacement until all the substitute symbols in the directed acyclic graph are resolved. Therefore, the SQL statement analysis method used in the SQL-based data blood relationship analysis software tool or system does not need to interact with the database software or access additional metadata to achieve data blood relationship analysis; in addition, in data security In demanding scenarios, this SQL statement parsing method can achieve high-cohesion, low-coupling data kinship analysis function, which can reduce external dependence, and at the same time, it does not need to obtain other metadata that does not appear in the data kinship, so it can be guaranteed as much as possible The security of user metadata will not expose metadata information that does not appear in the relevant SQL due to the analysis of the blood relationship of the data. At the same time, this application also involves blockchain technology.
如图12所示,本申请还提供了一种基于数据血缘的SQL语句解析系统,该SQL语句解析系统可以集成于上述的计算机设备110中,具体可以包括数据集模块20、遍历模块30以及替代模块40。As shown in FIG. 12, the present application also provides a SQL statement parsing system based on data blood relationship. The SQL statement parsing system can be integrated into the above-mentioned computer device 110, and specifically can include a data set module 20, a traversal module 30, and an alternative Module 40.
所述数据集模块20,用于获取有向无环图,所述有向无环图通过梳理SQL语句得到;The data set module 20 is configured to obtain a directed acyclic graph, which is obtained by combing SQL statements;
所述遍历模块30,用于遍历有向无环图具有SQL子查询的图层,获取出现替代符的图层,以遍历时最先出现替代符的图层为当前层;The traversal module 30 is used to traverse the layers of the directed acyclic graph with SQL sub-queries, and obtain the layer where the substitution symbol appears, and the layer where the substitution symbol appears first during the traversal is the current layer;
所述替代模块40,用于根据当前层的前一图层的字段,获取当前层中替代符对应的字段;所述替代模块继续获取所述遍历模块标记的剩下的出现替代符的图层,并进行替代符的替换,直到有向无环图中所有的替代符完成解析。The substitution module 40 is configured to obtain the field corresponding to the substitute symbol in the current layer according to the field of the previous layer of the current layer; the substitution module continues to obtain the remaining layers marked by the traversal module where the substitute symbol appears , And replace the substitute characters until all the substitute characters in the directed acyclic graph are resolved.
在一个实施例中,请参考图13,所述数据集模块20包括清洗模块21和生成模块22。其中,所述清洗模块21从含有SQL代码的脚本文件中提取得到规则化的SQL语句,完成对SQL语句的清洗;所述生成模块22用于对规则化的SQL语句进行词法分析,生产抽象语法树,并可根据抽象语法树生成有向无环图。In one embodiment, referring to FIG. 13, the data set module 20 includes a cleaning module 21 and a generating module 22. Wherein, the cleaning module 21 extracts regularized SQL statements from script files containing SQL codes to complete the cleaning of the SQL statements; the generating module 22 is used to perform lexical analysis on the regularized SQL statements to produce abstract grammars Trees, and can generate directed acyclic graphs based on abstract syntax trees.
在一个实施例中,请参考图14,所述清洗模块21包括寻找模块211和规则模块212。其中,所述寻找模块211用于获取含有SQL代码的脚本文件,并寻找SQL代码的标志位;所述规则模块212用于利用标志位过滤脚本文件中的无关内容,保留得到规则化的SQL代码语句。In one embodiment, please refer to FIG. 14, the cleaning module 21 includes a searching module 211 and a rule module 212. Wherein, the search module 211 is used to obtain a script file containing SQL code and search for the flag bit of the SQL code; the rule module 212 is used to use the flag bit to filter irrelevant content in the script file, and retain the regularized SQL code Statement.
在一个实施例中,所述遍历模块30进一步实现:从上到下遍历有向无环图具有SQL子查询的图层或从下到上遍历有向无环图具有SQL子查询的图层。In one embodiment, the traversal module 30 further implements: traversing the layers of the directed acyclic graph with SQL subquery from top to bottom or traversing the layers of the directed acyclic graph with SQL subquery from bottom to top.
在一个实施例中,所述替代模块40进一步实现:若所述当前层的前一图层为字段层,则所述字段层可根据子查询别名将对应字段映射至当前层的子查询中,并采用字段层的字段来替代当前层的替代符。In one embodiment, the replacement module 40 further implements: if the previous layer of the current layer is a field layer, the field layer can map the corresponding field to the subquery of the current layer according to the subquery alias, And use the field of the field level to replace the current level of substitution.
在一个实施例中,所述替代模块40进一步实现:若当前层为遍历的第一层,则继续寻找没有替代符的图层,以最先出现的没有替代符的图层为字段层,以字段层的前一层为替代层,以根据字段层推断出替代层中的替代符对应的字段。In one embodiment, the substitution module 40 further implements: if the current layer is the first layer to be traversed, continue to search for the layer without the substitution symbol, and take the first layer without the substitution symbol as the field layer, and The previous layer of the field layer is the substitution layer, so that the field corresponding to the substitution symbol in the substitution layer can be inferred from the field layer.
在一个实施例中,所述计算系统还包括显示模块(未图示),用于显示计算结果,所述显示模块可以是台式电脑的显示器,也可以是其他计算机设备的显示装置。In one embodiment, the computing system further includes a display module (not shown) for displaying calculation results. The display module may be a display of a desktop computer or a display device of other computer equipment.
请参考图15,图15为本申请实施例的设备的结构示意图。如图15所示,该设备200包括处理器201及和处理器201耦接的存储器202。Please refer to FIG. 15, which is a schematic structural diagram of a device according to an embodiment of the application. As shown in FIG. 15, the device 200 includes a processor 201 and a memory 202 coupled to the processor 201.
存储器202存储有用于实现上述任一实施例所述基于数据血缘的SQL语句解析方法的程序指令。The memory 202 stores program instructions for implementing the SQL statement parsing method based on data blood relationship described in any of the above embodiments.
处理器201用于执行存储器202存储的程序指令。The processor 201 is configured to execute program instructions stored in the memory 202.
其中,处理器201还可以称为CPU(Central Processing Unit,中央处理单元)。处理器201可能是一种集成电路芯片,具有信号的处理能力。处理器201还可以是通用处理器、数字信号处理器(DSP)、专用集成电路(ASIC)、现成可编程门阵列(FPGA)或者其他可编程逻辑器件、分立门或者晶体管逻辑器件、分立硬件组件。通用处理器可以是微处理器或者该处理器也可以是任何常规的处理器等。The processor 201 may also be referred to as a CPU (Central Processing Unit, central processing unit). The processor 201 may be an integrated circuit chip with signal processing capability. The processor 201 may also be a general-purpose processor, a digital signal processor (DSP), an application specific integrated circuit (ASIC), an off-the-shelf programmable gate array (FPGA) or other programmable logic device, discrete gate or transistor logic device, discrete hardware component . The general-purpose processor may be a microprocessor or the processor may also be any conventional processor or the like.
参阅图16,图16为本申请实施例的存储介质的结构示意图。本申请实施例的存储介质存储有能够实现上述所有方法的程序文件301,其中,该程序文件301可以以软件产品的形式存储在上述存储介质中,所述计算机可读存储介质可以是非易失性,也可以是易失性,其包括若干指令用以使得一台计算机设备(可以是个人计算机,服务器,或者网络设备等)或处理器(processor)执行本申请各个实施方式所述方法的全部或部分步骤。而前述的存储介质包括:U盘、移动硬盘、只读存储器(ROM,Read-Only Memory)、随机存取存储器(RAM,Random Access Memory)、磁碟或者光盘等各种可以存储程序代码的介质,或者是计算机、服务器、手机、平板等终端设备。Refer to FIG. 16, which is a schematic structural diagram of a storage medium according to an embodiment of the application. The storage medium of this embodiment of the present application stores a program file 301 that can implement all the above methods, where the program file 301 may be stored in the above storage medium in the form of a software product, and the computer-readable storage medium may be a non-volatile , It can also be volatile, which includes several instructions to make a computer device (which can be a personal computer, a server, or a network device, etc.) or a processor to execute all or all of the methods described in the various embodiments of this application. Part of the steps. The aforementioned storage media include: U disk, mobile hard disk, read-only memory (ROM, Read-Only Memory), random access memory (RAM, Random Access Memory), magnetic disks or optical disks and other media that can store program codes. , Or terminal devices such as computers, servers, mobile phones, and tablets.
需要说明的是,在本文中,术语“包括”、“包含”或者其任何其他变体意在涵盖非排他性的包含,从而使得包括一系列要素的过程、装置、物品或者方法不仅包括那些要素,而且还包括没有明确列出的其他要素,或者是还包括为这种过程、装置、物品或者方法所固有的要素。在没有更多限制的情况下,由语句“包括一个……”限定的要素,并不排除在包括该要素的过程、装置、物品或者方法中还存在另外的相同要素。It should be noted that in this article, the terms "include", "include" or any other variants thereof are intended to cover non-exclusive inclusion, so that a process, device, article or method including a series of elements not only includes those elements, It also includes other elements not explicitly listed, or elements inherent to the process, device, article, or method. If there are no more restrictions, the element defined by the sentence "including a..." does not exclude the existence of other identical elements in the process, device, article, or method that includes the element.
上述本申请实施例序号仅仅为了描述,不代表实施例的优劣。通过以上的实施方式的描述,本领域的技术人员可以清楚地了解到上述实施例方法可借助软件加必需的通用硬件平台的方式来实现,当然也可以通过硬件,但很多情况下前者是更佳的实施方式。基于这样的理解,本申请的技术方案本质上或者说对现有技术做出贡献的部分可以以软件产品的形式体现出来,该计算机软件产品存储在如上所述的一个存储介质(如ROM/RAM、磁碟、光盘)中,包括若干指令用以使得一台终端设备(可以是手机,计算机,服务器,或者网络设备等)执行本申请各个实施例所述的方法。The serial numbers of the foregoing embodiments of the present application are for description only, and do not represent the superiority or inferiority of the embodiments. Through the description of the above implementation manners, those skilled in the art can clearly understand that the above-mentioned embodiment method can be implemented by means of software plus the necessary general hardware platform, of course, it can also be implemented by hardware, but in many cases the former is better.的实施方式。 Based on this understanding, the technical solution of this application essentially or the part that contributes to the existing technology can be embodied in the form of a software product, and the computer software product is stored in a storage medium (such as ROM/RAM) as described above. , Magnetic disk, optical disk), including a number of instructions to make a terminal device (which can be a mobile phone, a computer, a server, or a network device, etc.) execute the method described in each embodiment of the present application.

Claims (20)

1、一种基于数据血缘的SQL语句解析方法,其中,所述SQL语句解析方法包括:1. A SQL statement parsing method based on data blood relationship, wherein the SQL statement parsing method includes:
步骤一、获取有向无环图,所述有向无环图通过梳理SQL语句得到;Step 1: Obtain a directed acyclic graph, which is obtained by combing SQL statements;
步骤二、遍历有向无环图具有SQL子查询的图层,获取出现替代符的图层,以遍历时最先出现替代符的图层为当前层;Step 2: Traverse the layers of the directed acyclic graph with SQL sub-query to obtain the layer where the substitution symbol appears, and the layer where the substitution symbol first appears during the traversal is the current layer;
步骤三、根据当前层的前一图层的字段,获取当前层中替代符对应的字段;Step 3: Obtain the field corresponding to the substitute character in the current layer according to the field of the previous layer in the current layer;
步骤四、继续获取剩下的出现替代符的图层,并进行替代符的替换,直到有向无环图中所有的替代符完成解析。Step 4: Continue to obtain the remaining layers where the substitute symbols appear, and perform the substitution of the substitute symbols, until all the substitute symbols in the directed acyclic graph are resolved.
2、如权利要求1所述的SQL语句解析方法,其中,所述有向无环图通过梳理SQL语句得到,包括:2. The SQL statement parsing method according to claim 1, wherein the directed acyclic graph is obtained by combing SQL statements, comprising:
从含有SQL代码的脚本文件中提取得到规则化的SQL语句,完成对SQL语句的清洗;Extract regularized SQL statements from script files containing SQL codes to complete the cleaning of SQL statements;
对规则化的SQL语句进行词法分析,生成抽象语法树,并根据抽象语法树生成有向无环图。Perform lexical analysis on regularized SQL statements to generate an abstract syntax tree, and generate a directed acyclic graph according to the abstract syntax tree.
3、如权利要求2所述的SQL语句解析方法,其中,所述从含有SQL代码的脚本文件中提取得到规则化的SQL语句,完成对SQL语句的清洗,包括:3. The SQL statement parsing method according to claim 2, wherein said extracting and obtaining regularized SQL statements from a script file containing SQL codes to complete the cleaning of SQL statements comprises:
获取含有SQL代码的脚本文件,并寻找SQL代码的标志位;Obtain the script file containing the SQL code, and look for the flag bit of the SQL code;
利用标志位过滤脚本文件中的无关内容,保留得到规则化的SQL代码语句。Use the flag bit to filter the irrelevant content in the script file, and retain the regularized SQL code statements.
4、如权利要求1所述的SQL语句解析方法,其中,所述遍历有向无环图具有SQL子查询的图层,包括:从上到下遍历有向无环图具有SQL子查询的图层或从下到上遍历有向无环图具有SQL子查询的图层。4. The SQL statement parsing method according to claim 1, wherein the traversing the directed acyclic graph with SQL sub-query layers includes: traversing the directed acyclic graph with SQL sub-queries from top to bottom Layers or layers with SQL subqueries traversed from bottom to top in a directed acyclic graph.
5、如权利要求1所述的SQL语句解析方法,其中,所述根据当前层的前一图层的字段,获取当前层中替代符对应的字段,包括:所述当前层的前一图层为字段层,所述字段层可根据子查询别名将对应字段映射至当前层的子查询中,并采用字段层的字段来替代当前层的替代符。5. The SQL statement parsing method according to claim 1, wherein said obtaining the field corresponding to the substitute character in the current layer according to the field of the previous layer of the current layer comprises: the previous layer of the current layer It is the field layer, and the field layer can map the corresponding field to the subquery of the current layer according to the subquery alias, and use the field of the field layer to replace the substitute character of the current layer.
6、如权利要求1所述的SQL语句解析方法,其中,所述根据当前层的前一图层的字段,获取当前层中替代符对应的字段,包括:若当前层为遍历的第一层,则继续寻找没有替代符的图层,以最先出现的没有替代符的图层为字段层,以字段层的前一层为替代层,以根据字段层推断出替代层中的替代符对应的字段。6. The SQL statement parsing method according to claim 1, wherein said obtaining the field corresponding to the substitute character in the current layer according to the field of the previous layer of the current layer comprises: if the current layer is the first layer to be traversed , Then continue to look for the layer without the substitution symbol, the first layer without the substitution symbol is the field layer, and the layer before the field layer is the substitution layer, so as to infer the substitution symbol corresponding to the substitution layer according to the field layer Field.
7、如权利要求1所述的SQL语句解析方法,其中,所述有向无环图中所有的替代符完成解析后,将所述解析结果上传至区块链中,以使得所述区块链对所述解析结果进行加密存储。7. The SQL statement parsing method of claim 1, wherein after all the substitution symbols in the directed acyclic graph are resolved, the parsing result is uploaded to the blockchain, so that the block The chain encrypts and stores the analysis result.
8、一种基于数据血缘的SQL语句解析系统,其中,所述SQL语句解析系统包括:8. A SQL statement parsing system based on data blood relationship, wherein the SQL statement parsing system includes:
数据集模块,用于获取有向无环图,所述有向无环图通过梳理SQL语句得到;The data set module is used to obtain a directed acyclic graph, which is obtained by combing SQL statements;
遍历模块,用于遍历有向无环图具有SQL子查询的图层,获取出现替代符的图层,以遍历时最先出现替代符的图层为当前层;The traversal module is used to traverse the layers of the directed acyclic graph with SQL sub-queries, and obtain the layer where the substitution symbol appears, and the layer where the substitution symbol first appears during the traversal is the current layer;
替代模块,用于根据当前层的前一图层的字段,获取当前层中替代符对应的字段;所述替代模块继续获取所述遍历模块标记的剩下的出现替代符的图层,并进行替代符的替换,直到有向无环图中所有的替代符完成解析。The substitution module is used to obtain the field corresponding to the substitute symbol in the current layer according to the field of the previous layer of the current layer; The substitution of substitute characters until all substitute characters in the directed acyclic graph are resolved.
9、一种计算机设备,包括存储器和处理器,所述存储器中存储有计算机可读指令,所述计算机可读指令被所述处理器执行时,使得所述处理器执行如下步骤:9. A computer device comprising a memory and a processor, and computer-readable instructions are stored in the memory, and when the computer-readable instructions are executed by the processor, the processor is caused to perform the following steps:
步骤一、获取有向无环图,所述有向无环图通过梳理SQL语句得到;Step 1: Obtain a directed acyclic graph, which is obtained by combing SQL statements;
步骤二、遍历有向无环图具有SQL子查询的图层,获取出现替代符的图层,以遍历时最先出现替代符的图层为当前层;Step 2: Traverse the layers of the directed acyclic graph with SQL sub-query to obtain the layer where the substitution symbol appears, and the layer where the substitution symbol first appears during the traversal is the current layer;
步骤三、根据当前层的前一图层的字段,获取当前层中替代符对应的字段;Step 3: Obtain the field corresponding to the substitute character in the current layer according to the field of the previous layer in the current layer;
步骤四、继续获取剩下的出现替代符的图层,并进行替代符的替换,直到有向无环图中所有的替代符完成解析。Step 4: Continue to obtain the remaining layers where the substitute symbols appear, and perform the substitution of the substitute symbols, until all the substitute symbols in the directed acyclic graph are resolved.
10、如权利要求9所述的计算机设备,其中,所述有向无环图通过梳理SQL语句得到,包括:10. The computer device according to claim 9, wherein the directed acyclic graph is obtained by combing SQL statements, comprising:
从含有SQL代码的脚本文件中提取得到规则化的SQL语句,完成对SQL语句的清洗;Extract regularized SQL statements from script files containing SQL codes to complete the cleaning of SQL statements;
对规则化的SQL语句进行词法分析,生成抽象语法树,并根据抽象语法树生成有向无环图。Perform lexical analysis on regularized SQL statements to generate an abstract syntax tree, and generate a directed acyclic graph according to the abstract syntax tree.
11、如权利要求10所述的计算机设备,其中,所述从含有SQL代码的脚本文件中提取得到规则化的SQL语句,完成对SQL语句的清洗,包括:11. The computer device according to claim 10, wherein said extracting and obtaining regularized SQL statements from a script file containing SQL codes to complete the cleaning of SQL statements comprises:
获取含有SQL代码的脚本文件,并寻找SQL代码的标志位;Obtain the script file containing the SQL code, and look for the flag bit of the SQL code;
利用标志位过滤脚本文件中的无关内容,保留得到规则化的SQL代码语句。Use the flag bit to filter the irrelevant content in the script file, and retain the regularized SQL code statements.
12、如权利要求9所述的计算机设备,其中,所述遍历有向无环图具有SQL子查询的图层,包括:从上到下遍历有向无环图具有SQL子查询的图层或从下到上遍历有向无环图具有SQL子查询的图层。12. The computer device of claim 9, wherein the traversal of the directed acyclic graph layer with SQL subquery comprises: traversing the directed acyclic graph layer with SQL subquery from top to bottom or Traverse the layers of the directed acyclic graph with SQL subqueries from bottom to top.
13、如权利要求9所述的计算机设备,其中,所述根据当前层的前一图层的字段,获取当前层中替代符对应的字段,包括:所述当前层的前一图层为字段层,所述字段层可根据子查询别名将对应字段映射至当前层的子查询中,并采用字段层的字段来替代当前层的替代符。13. The computer device according to claim 9, wherein the obtaining the field corresponding to the substitute character in the current layer according to the field of the previous layer of the current layer comprises: the previous layer of the current layer is a field The field layer can map the corresponding field to the subquery of the current layer according to the subquery alias, and use the field of the field layer to replace the substitute character of the current layer.
14、如权利要求9所述的计算机设备,其中,所述根据当前层的前一图层的字段,获取当前层中替代符对应的字段,包括:若当前层为遍历的第一层,则继续寻找没有替代符的图层,以最先出现的没有替代符的图层为字段层,以字段层的前一层为替代层,以根据字段层推断出替代层中的替代符对应的字段。14. The computer device according to claim 9, wherein the obtaining the field corresponding to the substitute character in the current layer according to the field of the previous layer of the current layer comprises: if the current layer is the first layer to be traversed, then Continue to look for the layer without the substitution symbol, the first layer without the substitution symbol is the field layer, and the layer before the field layer is the substitution layer, so as to infer the field corresponding to the substitution symbol in the substitution layer according to the field layer .
15、一种存储介质,其中,存储有能够实现如下步骤的程序文件,所述步骤包括:15. A storage medium, wherein a program file capable of implementing the following steps is stored, the steps including:
步骤一、获取有向无环图,所述有向无环图通过梳理SQL语句得到;Step 1: Obtain a directed acyclic graph, which is obtained by combing SQL statements;
步骤二、遍历有向无环图具有SQL子查询的图层,获取出现替代符的图层,以遍历时最先出现替代符的图层为当前层;Step 2: Traverse the layers of the directed acyclic graph with SQL sub-query to obtain the layer where the substitution symbol appears, and the layer where the substitution symbol first appears during the traversal is the current layer;
步骤三、根据当前层的前一图层的字段,获取当前层中替代符对应的字段;Step 3: Obtain the field corresponding to the substitute character in the current layer according to the field of the previous layer in the current layer;
步骤四、继续获取剩下的出现替代符的图层,并进行替代符的替换,直到有向无环图中所有的替代符完成解析。Step 4: Continue to obtain the remaining layers where the substitute symbols appear, and perform the substitution of the substitute symbols, until all the substitute symbols in the directed acyclic graph are resolved.
16、如权利要求15所述的存储介质,其中,所述有向无环图通过梳理SQL语句得到,包括:16. The storage medium of claim 15, wherein the directed acyclic graph is obtained by combing SQL statements, comprising:
从含有SQL代码的脚本文件中提取得到规则化的SQL语句,完成对SQL语句的清洗;Extract regularized SQL statements from script files containing SQL codes to complete the cleaning of SQL statements;
对规则化的SQL语句进行词法分析,生成抽象语法树,并根据抽象语法树生成有向无环图。Perform lexical analysis on regularized SQL statements to generate an abstract syntax tree, and generate a directed acyclic graph according to the abstract syntax tree.
17、如权利要求16所述的存储介质,其中,所述从含有SQL代码的脚本文件中提取得到规则化的SQL语句,完成对SQL语句的清洗,包括:17. The storage medium of claim 16, wherein said extracting and obtaining regularized SQL statements from a script file containing SQL codes to complete the cleaning of SQL statements comprises:
获取含有SQL代码的脚本文件,并寻找SQL代码的标志位;Obtain the script file containing the SQL code, and look for the flag bit of the SQL code;
利用标志位过滤脚本文件中的无关内容,保留得到规则化的SQL代码语句。Use the flag bit to filter the irrelevant content in the script file, and retain the regularized SQL code statements.
18、如权利要求15所述的存储介质,其中,所述遍历有向无环图具有SQL子查询的图层,包括:从上到下遍历有向无环图具有SQL子查询的图层或从下到上遍历有向无环图具有SQL子查询的图层。18. The storage medium of claim 15, wherein the traversal of the directed acyclic graph layer with SQL subquery comprises: traversing the directed acyclic graph layer with SQL subquery from top to bottom or Traverse the layers of the directed acyclic graph with SQL subqueries from bottom to top.
19、如权利要求15所述的存储介质,其中,所述根据当前层的前一图层的字段,获取当前层中替代符对应的字段,包括:所述当前层的前一图层为字段层,所述字段层可根据子查询别名将对应字段映射至当前层的子查询中,并采用字段层的字段来替代当前层的替代符。19. The storage medium of claim 15, wherein the obtaining the field corresponding to the substitute character in the current layer according to the field of the previous layer of the current layer comprises: the previous layer of the current layer is a field The field layer can map the corresponding field to the subquery of the current layer according to the subquery alias, and use the field of the field layer to replace the substitute character of the current layer.
20、如权利要求15所述的存储介质,其中,所述根据当前层的前一图层的字段,获取当前层中替代符对应的字段,包括:若当前层为遍历的第一层,则继续寻找没有替代符的图层,以最先出现的没有替代符的图层为字段层,以字段层的前一层为替代层,以根据字段层推断出替代层中的替代符对应的字段。20. The storage medium of claim 15, wherein the obtaining the field corresponding to the substitute character in the current layer according to the field of the previous layer of the current layer comprises: if the current layer is the first layer to be traversed, then Continue to look for the layer without the substitution symbol, the first layer without the substitution symbol is the field layer, and the layer before the field layer is the substitution layer, so as to infer the field corresponding to the substitution symbol in the substitution layer according to the field layer .
PCT/CN2020/135735 2020-10-21 2020-12-11 Sql statement parsing method and system, and computer device and storage medium WO2021179722A1 (en)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN202011134553.8 2020-10-21
CN202011134553.8A CN112256721B (en) 2020-10-21 2020-10-21 SQL statement parsing method, system, computer device and storage medium

Publications (1)

Publication Number Publication Date
WO2021179722A1 true WO2021179722A1 (en) 2021-09-16

Family

ID=74264521

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2020/135735 WO2021179722A1 (en) 2020-10-21 2020-12-11 Sql statement parsing method and system, and computer device and storage medium

Country Status (2)

Country Link
CN (1) CN112256721B (en)
WO (1) WO2021179722A1 (en)

Cited By (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113868253A (en) * 2021-09-28 2021-12-31 中通服创立信息科技有限责任公司 Data relationship capturing and big data relationship tree construction method
CN114911785A (en) * 2022-05-16 2022-08-16 北京航空航天大学 Data blood reason management method and device and electronic equipment
CN115237936A (en) * 2022-09-14 2022-10-25 北京海致星图科技有限公司 Method, device, storage medium and equipment for detecting fields in SQL (structured query language) statement
CN115544065A (en) * 2022-11-28 2022-12-30 北京数语科技有限公司 Data blood relationship discovery method, system, equipment and storage medium
CN115563150A (en) * 2022-12-02 2023-01-03 浙江大华技术股份有限公司 Method, equipment and storage medium for mapping Hive SQL (structured query language) and execution engine DAG (direct Access computer)
CN116541887B (en) * 2023-07-07 2023-09-15 云启智慧科技有限公司 Data security protection method for big data platform

Families Citing this family (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115080599B (en) * 2022-07-25 2022-11-25 成都烽顺科技有限公司 Database query SQL field blood relationship generation method
CN116303370B (en) * 2023-05-17 2023-08-15 建信金融科技有限责任公司 Script blood margin analysis method, script blood margin analysis device, storage medium, script blood margin analysis equipment and script blood margin analysis product

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103186541A (en) * 2011-12-27 2013-07-03 阿里巴巴集团控股有限公司 Generation method and device for mapping relationship
US20190026358A1 (en) * 2016-03-28 2019-01-24 Alibaba Group Holding Limited Big data-based method and device for calculating relationship between development objects
CN109325078A (en) * 2018-09-18 2019-02-12 拉扎斯网络科技(上海)有限公司 Method and device is determined based on the data blood relationship of structured data
CN111125758A (en) * 2019-12-19 2020-05-08 北京安华金和科技有限公司 Dynamic desensitization method based on full syntax tree analysis
CN111538743A (en) * 2020-04-22 2020-08-14 电子科技大学 SQL-based data blood relationship analysis method and system

Family Cites Families (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20130091266A1 (en) * 2011-10-05 2013-04-11 Ajit Bhave System for organizing and fast searching of massive amounts of data
CN109033109B (en) * 2017-06-09 2020-11-27 杭州海康威视数字技术股份有限公司 Data processing method and system
CN109582660B (en) * 2018-12-06 2021-08-10 深圳前海微众银行股份有限公司 Data blood margin analysis method, device, equipment, system and readable storage medium

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103186541A (en) * 2011-12-27 2013-07-03 阿里巴巴集团控股有限公司 Generation method and device for mapping relationship
US20190026358A1 (en) * 2016-03-28 2019-01-24 Alibaba Group Holding Limited Big data-based method and device for calculating relationship between development objects
CN109325078A (en) * 2018-09-18 2019-02-12 拉扎斯网络科技(上海)有限公司 Method and device is determined based on the data blood relationship of structured data
CN111125758A (en) * 2019-12-19 2020-05-08 北京安华金和科技有限公司 Dynamic desensitization method based on full syntax tree analysis
CN111538743A (en) * 2020-04-22 2020-08-14 电子科技大学 SQL-based data blood relationship analysis method and system

Cited By (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113868253A (en) * 2021-09-28 2021-12-31 中通服创立信息科技有限责任公司 Data relationship capturing and big data relationship tree construction method
CN113868253B (en) * 2021-09-28 2024-04-23 中通服创立信息科技有限责任公司 Data relationship capturing and big data relationship tree construction method
CN114911785A (en) * 2022-05-16 2022-08-16 北京航空航天大学 Data blood reason management method and device and electronic equipment
CN115237936A (en) * 2022-09-14 2022-10-25 北京海致星图科技有限公司 Method, device, storage medium and equipment for detecting fields in SQL (structured query language) statement
CN115237936B (en) * 2022-09-14 2024-04-05 北京海致星图科技有限公司 Method, device, storage medium and equipment for detecting fields in SQL (structured query language) statement
CN115544065A (en) * 2022-11-28 2022-12-30 北京数语科技有限公司 Data blood relationship discovery method, system, equipment and storage medium
CN115544065B (en) * 2022-11-28 2023-02-28 北京数语科技有限公司 Data blood relationship discovery method, system, equipment and storage medium
CN115563150A (en) * 2022-12-02 2023-01-03 浙江大华技术股份有限公司 Method, equipment and storage medium for mapping Hive SQL (structured query language) and execution engine DAG (direct Access computer)
CN116541887B (en) * 2023-07-07 2023-09-15 云启智慧科技有限公司 Data security protection method for big data platform

Also Published As

Publication number Publication date
CN112256721B (en) 2021-08-17
CN112256721A (en) 2021-01-22

Similar Documents

Publication Publication Date Title
WO2021179722A1 (en) Sql statement parsing method and system, and computer device and storage medium
CN107402988B (en) Distributed NewSQL database system and semi-structured data query method
CN107038207B (en) Data query method, data processing method and device
Li Transforming relational database into HBase: A case study
US9965641B2 (en) Policy-based data-centric access control in a sorted, distributed key-value data store
Leavitt Will NoSQL databases live up to their promise?
US8924373B2 (en) Query plans with parameter markers in place of object identifiers
US20140365527A1 (en) Secure access to hierarchical documents in a sorted, distributed key/value data store
US11354284B2 (en) System and method for migration of a legacy datastore
US8938456B2 (en) Data recovery system and method in a resource description framework environment
US11030242B1 (en) Indexing and querying semi-structured documents using a key-value store
US8812489B2 (en) Swapping expected and candidate affinities in a query plan cache
US20140172905A1 (en) Performing a function on rows of data determined from transitive relationships between columns
Li et al. An integration approach of hybrid databases based on SQL in cloud computing environment
US20210357461A1 (en) Method, apparatus and storage medium for searching blockchain data
CN113051268A (en) Data query method, data query device, electronic equipment and storage medium
CN102708203A (en) Database dynamic management method based on XML metadata
CN111078205B (en) Modularized programming method and device, storage medium and electronic equipment
US11907251B2 (en) Method and system for implementing distributed lobs
CN116383238B (en) Data virtualization system, method, device, equipment and medium based on graph structure
US7711675B2 (en) Database simulation of data types
US20230153455A1 (en) Query-based database redaction
US9002810B1 (en) Method and system for managing versioned structured documents in a database
US11366810B2 (en) Index contention under high concurrency in a database system
Engle et al. The Case for NoSQL on a Single Desktop

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 20923746

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 20923746

Country of ref document: EP

Kind code of ref document: A1