CN113032366A

CN113032366A - SQL syntax tree analysis method based on Flex and Bison

Info

Publication number: CN113032366A
Application number: CN202110312632.1A
Authority: CN
Inventors: 王斌; 赵智博; 曲彦秋; 杨晓春
Original assignee: Northeastern University China
Current assignee: Northeastern University China
Priority date: 2021-03-24
Filing date: 2021-03-24
Publication date: 2021-06-25

Abstract

The invention provides a SQL syntax tree analysis method based on Flex and Bison, which comprises the steps of firstly obtaining a database historical query file, collecting SQL sentences corresponding to a specified database instance, then carrying out lexical analysis on each SQL sentence, dividing the SQL sentences into corresponding token sequences according to a word formation rule, then carrying out syntactic analysis on the token sequences obtained by the lexical analysis processing, converting the token sequences into a syntax tree form according to the SQL syntax rule, and finally converting the syntax tree structure into a JSON format for persistent storage.

Description

SQL syntax tree analysis method based on Flex and Bison

Technical Field

The invention relates to the technical field of database index tuning and syntax tree analysis, in particular to a SQL syntax tree analysis method based on Flex and Bison.

Background

The database index optimization technology is a technology for constructing a group of reasonable candidate index sets by analyzing query workload (query workload) in a database within a period of time, so that query optimization effect and index maintenance cost are balanced. In the database index tuning technique, the parsing of the structured query language SQL is not left.

Syntax trees (syntax trees), or Abstract Syntax Trees (AST), are concepts in the compilation principle, and their roles are to represent abstract syntax structures in the form of trees, where each non-leaf node on a tree is an operator in the abstract syntax, and a leaf node represents a concrete object of the abstract syntax tree. The use of syntax trees allows machines to read the inherent meaning of the abstract syntax as well as the order of execution of the abstract syntax. The syntax tree has wide application, for example, source code parsing of programming language, analysis of expressions, even natural language processing and machine translation technologies need syntax tree technology. Structured Query Language (SQL) is abbreviated as SQL, and the SQL syntax tree parsing has important meaning for the analysis of SQL statements and the definition of Query execution paths of SQL statements.

Generating syntax trees typically requires two processes: lexical analysis and syntactic analysis. The task of lexical analysis is to read in a file from left to right character by character, i.e. to scan the character stream in the file and then to identify words according to the word formation rules. Parsing is a logical phase of the compilation process. The task of syntactic analysis is to combine sequences of words into various types of grammatical phrases on the basis of lexical analysis. It is common practice for professional compiler programmers to write lexical and syntactic parsers to parse SQL statements to generate syntax trees. However, the lexical analyzer and the parser source code are too complex to be developed quickly, and the maintenance work at the later stage is very tedious.

Flex and Bison are tools used to generate lexical analyzer and parser source programs that can process structured input. Originally used to generate compilers, Flex and Bison have also been applied to many other fields. The lexical analysis is to divide the input into meaningful word blocks called tokens (tokens). The patterns of any flex lexical analyzer use a regular expression language that uses meta-language to describe the patterns that are desired to be matched. Meta languages use standard text characters, one part representing themselves and the other part representing patterns. The parser determines how the tokens relate to each other. Bison generates a parser that can recognize valid statements in this grammar based on a given bacause-form BNF grammar.

In the current database index tuning algorithm, the analysis of the SQL statement usually counts the attributes appearing in the SQL statement, and then combines them to obtain a candidate index set. The method has the disadvantages that the structure of the SQL statement is not considered, and the obtained candidate index is not the optimal solution.

Disclosure of Invention

Aiming at the defects of the prior art, the invention provides a method for analyzing an SQL syntax tree based on Flex and Bison, which comprises the following steps:

step 1: acquiring a historical query file of a database, and collecting SQL statements corresponding to a specified database instance, wherein the SQL statements comprise insert statements, delete statements, update statements and select statements;

step 2: performing lexical analysis on each SQL statement, and dividing the SQL statement into corresponding marks according to word formation rules;

and step 3: carrying out syntactic analysis on the token sequence obtained by the lexical analysis processing, and converting the token sequence into a syntax tree form according to an SQL syntax rule;

and 4, step 4: converting the syntax tree structure into a JSON format;

and 5: and (4) repeatedly executing the step 2 to the step 4 until all SQL sentences are processed.

The step 1 comprises the following steps:

step 1.1: starting a historical query log function in a relational database, and collecting query workload information;

step 1.2: reading a historical query log file of the database, tracking the operation executed by a user through an ID (identity), and screening out a query log of a specified database instance;

step 1.3: and excluding the time information, the ID information and the command type information in the log to obtain SQL statements contained in the log of the specified database instance, wherein the SQL statements comprise insert statements, delete statements, update statements and select statements.

The step 2 comprises the following steps:

step 2.1: according to keywords, database naming rules and variable types in the SQL sentences, formulating regular expressions to match each word and convert the words into corresponding marks;

step 2.2: converting the regular expression and the conversion action of the corresponding mark into a C language form by using Flex software;

step 2.3: calling a yylex () function in the C language code file to read SQL sentences for word construction rule matching, and judging whether each SQL sentence has a lexical problem or not;

step 2.4: and (3) dividing the SQL sentences without the lexical problem into corresponding signs to obtain a sign sequence corresponding to each SQL sentence, and directly deleting the SQL sentences with the lexical problem.

The step 3 comprises the following steps:

step 3.1: designing a structure body type of the SQL syntax tree node, wherein the structure body type comprises the type of the node, a numerical value stored by the node, a brother node and a child node of the node;

step 3.2: designing a BNF grammar according to the SQL grammar;

step 3.3: and (3) converting the written BNF grammar into a C language form by using Bison software, reading the mark sequence generated in the step (2) and matching the BNF grammar to obtain a final SQL grammar tree.

The invention has the beneficial effects that:

the invention provides a SQL syntax tree analysis method based on Flex and Bison, which can be used for carrying out structured analysis on insert statements, delete statements, update statements and select statements in query workload in database index tuning. These statements are parsed into a syntax tree form and persisted using the JSON format. The candidate index set can be more accurately constructed, the execution plan of the SQL statement can be known through the syntax tree structure, and for any SQL statement, even the complex multi-table query or nested query statement can be converted into the SQL syntax tree through the BNF grammar and stored in the JSON format.

Drawings

FIG. 1 is a flow chart of a SQL syntax tree parsing method based on Flex and Bison in the invention.

FIG. 2 is a diagram of the SQL syntax tree structure of the present invention.

Detailed Description

The invention is further described with reference to the following figures and specific examples. In order to analyze the SQL statement and convert the SQL statement into a syntax tree structure so as to further analyze the SQL statement, the invention provides an SQL syntax tree analysis algorithm based on Flex and Bison. The invention aims to solve the technical problem that any SQL statement, even complex statements such as multi-table query or nested query, can be converted into an SQL syntax tree through a BNF grammar and stored in a JSON format.

As shown in fig. 1, a method for parsing SQL syntax tree based on Flex and Bison includes:

step 1: acquiring a database historical query file, and collecting SQL statements corresponding to a specified database instance, wherein the SQL statements comprise:

step 1.2: the collected query workload information may include a plurality of database instances, and a screening work of specific information is required in the information extraction process. Reading a historical query log file of the database, tracking the operation executed by a user through an ID (identity), and screening out a query log of a specified database instance;

step 1.3: and (4) according to the historical query log file structure of the relational database, performing data cleaning operation. And excluding the time information, ID information for identifying the identity and command type information in the log to obtain SQL statements contained in the log of the specified database instance, wherein the SQL statements comprise insert statements, delete statements, update statements and select statements.

Step 2: performing lexical analysis on each SQL statement, and dividing the SQL statement into corresponding marks according to word formation rules; the method comprises the following steps:

the pseudo code written to implement step 2.1 is:

step 2.2: and converting the regular expression and the conversion action of the corresponding mark into a C language form by using Flex software. Because the internal format of the C language form uses a deterministic finite automaton, compared with the regular expression matching, the speed of the C language form can be increased by hundreds of times;

step 2.4: dividing SQL sentences without lexical problems into corresponding signs to obtain a sign sequence corresponding to each SQL sentence, and directly deleting the SQL sentences with the lexical problems;

the SQL statement is converted into a token sequence after lexical analysis, and the sequence needs to be parsed and converted into a structured syntax tree form.

And step 3: carrying out syntactic analysis on the token sequence obtained by the lexical analysis processing, and converting the token sequence into a syntax tree form according to an SQL syntax rule; the method comprises the following steps:

step 3.2: designing a BNF grammar according to SQL grammar, for example, for a BNF grammar corresponding to a simple SELECT statement:

select_stmt∶＝SELECT select_optss elect_expr_list FROM table opt_where

wherein, SELECT _ stmt represents a SELECT query statement, SELECT represents a SELECT keyword, SELECT _ ops represents a query limit condition, SELECT _ expr _ list represents an attribute list, FROM represents a FROM keyword, table represents a table name, and opt _ where represents a where clause, that is, a limit condition.

And then converting the clauses matched with the BNF grammar into node types, namely constructing an SQL syntax tree.

Step 3.3: and (2) converting the written BNF grammar into a C language form by using Bison software, reading the mark sequence generated in the step (2) to perform BNF grammar matching, and obtaining a final SQL grammar tree, wherein as shown in FIG. 2, the original SQL sentence is a select sentence, and aims to inquire the CUSTOMER number CID with the AGE being more than 20 and the gender SEX being male and the CUSTOMER name FNAME in a CUSTOMER table CUSTOMER, and sorting according to the CUSTOMER names after inquiry. After the parsing, a structure of a syntax tree is formed, and the SELECT clause, the FROM clause, the WHERE clause and the ORDER BY clause are structurally represented.

And 4, step 4: and the finally obtained SQL syntax tree form is embodied by the data type of the multi-branch tree, and in order to ensure that the syntax tree can be persisted for the use of subsequent recommended indexes, the syntax tree structure is converted into the JSON format for persisted storage.

And 5: and (4) repeatedly executing the step (2) to the step (4) until all SQL sentences are processed, and outputting the syntax tree persistent file.

Claims

1. A SQL syntax tree parsing method based on Flex and Bison is characterized by comprising the following steps:

and 4, step 4: converting the syntax tree structure into a JSON format;

2. The method according to claim 1, wherein the step 1 comprises:

3. The method according to claim 1, wherein the step 2 comprises:

4. The method of claim 1, wherein the step 3 comprises:

step 3.2: designing a BNF grammar according to the SQL grammar;