CN113032366A - SQL syntax tree analysis method based on Flex and Bison - Google Patents
SQL syntax tree analysis method based on Flex and Bison Download PDFInfo
- Publication number
- CN113032366A CN113032366A CN202110312632.1A CN202110312632A CN113032366A CN 113032366 A CN113032366 A CN 113032366A CN 202110312632 A CN202110312632 A CN 202110312632A CN 113032366 A CN113032366 A CN 113032366A
- Authority
- CN
- China
- Prior art keywords
- sql
- statements
- syntax tree
- node
- lexical
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/20—Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
- G06F16/21—Design, administration or maintenance of databases
- G06F16/217—Database tuning
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/20—Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
- G06F16/22—Indexing; Data structures therefor; Storage structures
- G06F16/2228—Indexing structures
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/20—Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
- G06F16/24—Querying
- G06F16/242—Query formulation
- G06F16/2433—Query languages
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/20—Natural language analysis
- G06F40/205—Parsing
- G06F40/211—Syntactic parsing, e.g. based on context-free grammar [CFG] or unification grammars
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/20—Natural language analysis
- G06F40/253—Grammatical analysis; Style critique
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Databases & Information Systems (AREA)
- Computational Linguistics (AREA)
- Data Mining & Analysis (AREA)
- Audiology, Speech & Language Pathology (AREA)
- General Health & Medical Sciences (AREA)
- Health & Medical Sciences (AREA)
- Artificial Intelligence (AREA)
- Mathematical Physics (AREA)
- Software Systems (AREA)
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
Abstract
The invention provides a SQL syntax tree analysis method based on Flex and Bison, which comprises the steps of firstly obtaining a database historical query file, collecting SQL sentences corresponding to a specified database instance, then carrying out lexical analysis on each SQL sentence, dividing the SQL sentences into corresponding token sequences according to a word formation rule, then carrying out syntactic analysis on the token sequences obtained by the lexical analysis processing, converting the token sequences into a syntax tree form according to the SQL syntax rule, and finally converting the syntax tree structure into a JSON format for persistent storage.
Description
Technical Field
The invention relates to the technical field of database index tuning and syntax tree analysis, in particular to a SQL syntax tree analysis method based on Flex and Bison.
Background
The database index optimization technology is a technology for constructing a group of reasonable candidate index sets by analyzing query workload (query workload) in a database within a period of time, so that query optimization effect and index maintenance cost are balanced. In the database index tuning technique, the parsing of the structured query language SQL is not left.
Syntax trees (syntax trees), or Abstract Syntax Trees (AST), are concepts in the compilation principle, and their roles are to represent abstract syntax structures in the form of trees, where each non-leaf node on a tree is an operator in the abstract syntax, and a leaf node represents a concrete object of the abstract syntax tree. The use of syntax trees allows machines to read the inherent meaning of the abstract syntax as well as the order of execution of the abstract syntax. The syntax tree has wide application, for example, source code parsing of programming language, analysis of expressions, even natural language processing and machine translation technologies need syntax tree technology. Structured Query Language (SQL) is abbreviated as SQL, and the SQL syntax tree parsing has important meaning for the analysis of SQL statements and the definition of Query execution paths of SQL statements.
Generating syntax trees typically requires two processes: lexical analysis and syntactic analysis. The task of lexical analysis is to read in a file from left to right character by character, i.e. to scan the character stream in the file and then to identify words according to the word formation rules. Parsing is a logical phase of the compilation process. The task of syntactic analysis is to combine sequences of words into various types of grammatical phrases on the basis of lexical analysis. It is common practice for professional compiler programmers to write lexical and syntactic parsers to parse SQL statements to generate syntax trees. However, the lexical analyzer and the parser source code are too complex to be developed quickly, and the maintenance work at the later stage is very tedious.
Flex and Bison are tools used to generate lexical analyzer and parser source programs that can process structured input. Originally used to generate compilers, Flex and Bison have also been applied to many other fields. The lexical analysis is to divide the input into meaningful word blocks called tokens (tokens). The patterns of any flex lexical analyzer use a regular expression language that uses meta-language to describe the patterns that are desired to be matched. Meta languages use standard text characters, one part representing themselves and the other part representing patterns. The parser determines how the tokens relate to each other. Bison generates a parser that can recognize valid statements in this grammar based on a given bacause-form BNF grammar.
In the current database index tuning algorithm, the analysis of the SQL statement usually counts the attributes appearing in the SQL statement, and then combines them to obtain a candidate index set. The method has the disadvantages that the structure of the SQL statement is not considered, and the obtained candidate index is not the optimal solution.
Disclosure of Invention
Aiming at the defects of the prior art, the invention provides a method for analyzing an SQL syntax tree based on Flex and Bison, which comprises the following steps:
step 1: acquiring a historical query file of a database, and collecting SQL statements corresponding to a specified database instance, wherein the SQL statements comprise insert statements, delete statements, update statements and select statements;
step 2: performing lexical analysis on each SQL statement, and dividing the SQL statement into corresponding marks according to word formation rules;
and step 3: carrying out syntactic analysis on the token sequence obtained by the lexical analysis processing, and converting the token sequence into a syntax tree form according to an SQL syntax rule;
and 4, step 4: converting the syntax tree structure into a JSON format;
and 5: and (4) repeatedly executing the step 2 to the step 4 until all SQL sentences are processed.
The step 1 comprises the following steps:
step 1.1: starting a historical query log function in a relational database, and collecting query workload information;
step 1.2: reading a historical query log file of the database, tracking the operation executed by a user through an ID (identity), and screening out a query log of a specified database instance;
step 1.3: and excluding the time information, the ID information and the command type information in the log to obtain SQL statements contained in the log of the specified database instance, wherein the SQL statements comprise insert statements, delete statements, update statements and select statements.
The step 2 comprises the following steps:
step 2.1: according to keywords, database naming rules and variable types in the SQL sentences, formulating regular expressions to match each word and convert the words into corresponding marks;
step 2.2: converting the regular expression and the conversion action of the corresponding mark into a C language form by using Flex software;
step 2.3: calling a yylex () function in the C language code file to read SQL sentences for word construction rule matching, and judging whether each SQL sentence has a lexical problem or not;
step 2.4: and (3) dividing the SQL sentences without the lexical problem into corresponding signs to obtain a sign sequence corresponding to each SQL sentence, and directly deleting the SQL sentences with the lexical problem.
The step 3 comprises the following steps:
step 3.1: designing a structure body type of the SQL syntax tree node, wherein the structure body type comprises the type of the node, a numerical value stored by the node, a brother node and a child node of the node;
step 3.2: designing a BNF grammar according to the SQL grammar;
step 3.3: and (3) converting the written BNF grammar into a C language form by using Bison software, reading the mark sequence generated in the step (2) and matching the BNF grammar to obtain a final SQL grammar tree.
The invention has the beneficial effects that:
the invention provides a SQL syntax tree analysis method based on Flex and Bison, which can be used for carrying out structured analysis on insert statements, delete statements, update statements and select statements in query workload in database index tuning. These statements are parsed into a syntax tree form and persisted using the JSON format. The candidate index set can be more accurately constructed, the execution plan of the SQL statement can be known through the syntax tree structure, and for any SQL statement, even the complex multi-table query or nested query statement can be converted into the SQL syntax tree through the BNF grammar and stored in the JSON format.
Drawings
FIG. 1 is a flow chart of a SQL syntax tree parsing method based on Flex and Bison in the invention.
FIG. 2 is a diagram of the SQL syntax tree structure of the present invention.
Detailed Description
The invention is further described with reference to the following figures and specific examples. In order to analyze the SQL statement and convert the SQL statement into a syntax tree structure so as to further analyze the SQL statement, the invention provides an SQL syntax tree analysis algorithm based on Flex and Bison. The invention aims to solve the technical problem that any SQL statement, even complex statements such as multi-table query or nested query, can be converted into an SQL syntax tree through a BNF grammar and stored in a JSON format.
As shown in fig. 1, a method for parsing SQL syntax tree based on Flex and Bison includes:
step 1: acquiring a database historical query file, and collecting SQL statements corresponding to a specified database instance, wherein the SQL statements comprise:
step 1.1: starting a historical query log function in a relational database, and collecting query workload information;
step 1.2: the collected query workload information may include a plurality of database instances, and a screening work of specific information is required in the information extraction process. Reading a historical query log file of the database, tracking the operation executed by a user through an ID (identity), and screening out a query log of a specified database instance;
step 1.3: and (4) according to the historical query log file structure of the relational database, performing data cleaning operation. And excluding the time information, ID information for identifying the identity and command type information in the log to obtain SQL statements contained in the log of the specified database instance, wherein the SQL statements comprise insert statements, delete statements, update statements and select statements.
Step 2: performing lexical analysis on each SQL statement, and dividing the SQL statement into corresponding marks according to word formation rules; the method comprises the following steps:
step 2.1: according to keywords, database naming rules and variable types in the SQL sentences, formulating regular expressions to match each word and convert the words into corresponding marks;
the pseudo code written to implement step 2.1 is:
step 2.2: and converting the regular expression and the conversion action of the corresponding mark into a C language form by using Flex software. Because the internal format of the C language form uses a deterministic finite automaton, compared with the regular expression matching, the speed of the C language form can be increased by hundreds of times;
step 2.3: calling a yylex () function in the C language code file to read SQL sentences for word construction rule matching, and judging whether each SQL sentence has a lexical problem or not;
step 2.4: dividing SQL sentences without lexical problems into corresponding signs to obtain a sign sequence corresponding to each SQL sentence, and directly deleting the SQL sentences with the lexical problems;
the SQL statement is converted into a token sequence after lexical analysis, and the sequence needs to be parsed and converted into a structured syntax tree form.
And step 3: carrying out syntactic analysis on the token sequence obtained by the lexical analysis processing, and converting the token sequence into a syntax tree form according to an SQL syntax rule; the method comprises the following steps:
step 3.1: designing a structure body type of the SQL syntax tree node, wherein the structure body type comprises the type of the node, a numerical value stored by the node, a brother node and a child node of the node;
step 3.2: designing a BNF grammar according to SQL grammar, for example, for a BNF grammar corresponding to a simple SELECT statement:
select_stmt∶=SELECT select_optss elect_expr_list FROM table opt_where
wherein, SELECT _ stmt represents a SELECT query statement, SELECT represents a SELECT keyword, SELECT _ ops represents a query limit condition, SELECT _ expr _ list represents an attribute list, FROM represents a FROM keyword, table represents a table name, and opt _ where represents a where clause, that is, a limit condition.
And then converting the clauses matched with the BNF grammar into node types, namely constructing an SQL syntax tree.
Step 3.3: and (2) converting the written BNF grammar into a C language form by using Bison software, reading the mark sequence generated in the step (2) to perform BNF grammar matching, and obtaining a final SQL grammar tree, wherein as shown in FIG. 2, the original SQL sentence is a select sentence, and aims to inquire the CUSTOMER number CID with the AGE being more than 20 and the gender SEX being male and the CUSTOMER name FNAME in a CUSTOMER table CUSTOMER, and sorting according to the CUSTOMER names after inquiry. After the parsing, a structure of a syntax tree is formed, and the SELECT clause, the FROM clause, the WHERE clause and the ORDER BY clause are structurally represented.
And 4, step 4: and the finally obtained SQL syntax tree form is embodied by the data type of the multi-branch tree, and in order to ensure that the syntax tree can be persisted for the use of subsequent recommended indexes, the syntax tree structure is converted into the JSON format for persisted storage.
And 5: and (4) repeatedly executing the step (2) to the step (4) until all SQL sentences are processed, and outputting the syntax tree persistent file.
Claims (4)
1. A SQL syntax tree parsing method based on Flex and Bison is characterized by comprising the following steps:
step 1: acquiring a historical query file of a database, and collecting SQL statements corresponding to a specified database instance, wherein the SQL statements comprise insert statements, delete statements, update statements and select statements;
step 2: performing lexical analysis on each SQL statement, and dividing the SQL statement into corresponding marks according to word formation rules;
and step 3: carrying out syntactic analysis on the token sequence obtained by the lexical analysis processing, and converting the token sequence into a syntax tree form according to an SQL syntax rule;
and 4, step 4: converting the syntax tree structure into a JSON format;
and 5: and (4) repeatedly executing the step 2 to the step 4 until all SQL sentences are processed.
2. The method according to claim 1, wherein the step 1 comprises:
step 1.1: starting a historical query log function in a relational database, and collecting query workload information;
step 1.2: reading a historical query log file of the database, tracking the operation executed by a user through an ID (identity), and screening out a query log of a specified database instance;
step 1.3: and excluding the time information, the ID information and the command type information in the log to obtain SQL statements contained in the log of the specified database instance, wherein the SQL statements comprise insert statements, delete statements, update statements and select statements.
3. The method according to claim 1, wherein the step 2 comprises:
step 2.1: according to keywords, database naming rules and variable types in the SQL sentences, formulating regular expressions to match each word and convert the words into corresponding marks;
step 2.2: converting the regular expression and the conversion action of the corresponding mark into a C language form by using Flex software;
step 2.3: calling a yylex () function in the C language code file to read SQL sentences for word construction rule matching, and judging whether each SQL sentence has a lexical problem or not;
step 2.4: and (3) dividing the SQL sentences without the lexical problem into corresponding signs to obtain a sign sequence corresponding to each SQL sentence, and directly deleting the SQL sentences with the lexical problem.
4. The method of claim 1, wherein the step 3 comprises:
step 3.1: designing a structure body type of the SQL syntax tree node, wherein the structure body type comprises the type of the node, a numerical value stored by the node, a brother node and a child node of the node;
step 3.2: designing a BNF grammar according to the SQL grammar;
step 3.3: and (3) converting the written BNF grammar into a C language form by using Bison software, reading the mark sequence generated in the step (2) and matching the BNF grammar to obtain a final SQL grammar tree.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202110312632.1A CN113032366A (en) | 2021-03-24 | 2021-03-24 | SQL syntax tree analysis method based on Flex and Bison |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202110312632.1A CN113032366A (en) | 2021-03-24 | 2021-03-24 | SQL syntax tree analysis method based on Flex and Bison |
Publications (1)
Publication Number | Publication Date |
---|---|
CN113032366A true CN113032366A (en) | 2021-06-25 |
Family
ID=76473584
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202110312632.1A Pending CN113032366A (en) | 2021-03-24 | 2021-03-24 | SQL syntax tree analysis method based on Flex and Bison |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN113032366A (en) |
Cited By (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN113792027A (en) * | 2021-08-23 | 2021-12-14 | 浙江金惠科技有限公司 | Universal database conversion device |
-
2021
- 2021-03-24 CN CN202110312632.1A patent/CN113032366A/en active Pending
Non-Patent Citations (1)
Title |
---|
邱涛等: "面向关系数据库的智能索引调优方法", 《软件学报》, vol. 31, no. 3, pages 1 - 4 * |
Cited By (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN113792027A (en) * | 2021-08-23 | 2021-12-14 | 浙江金惠科技有限公司 | Universal database conversion device |
CN113792027B (en) * | 2021-08-23 | 2023-06-16 | 浙江金惠科技有限公司 | General database conversion device |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN107515887B (en) | Interactive query method suitable for various big data management systems | |
US9122540B2 (en) | Transformation of computer programs and eliminating errors | |
CN107203468B (en) | AST-based software version evolution comparative analysis method | |
CN110909016B (en) | Repeated association detection method, device, equipment and storage medium based on database | |
CN115576984A (en) | Method for generating SQL (structured query language) statement and cross-database query by Chinese natural language | |
CN109241080B (en) | Construction and use method and system of FQL query language | |
CN112860727B (en) | Data query method, device, equipment and medium based on big data query engine | |
Guo et al. | Benchmarking meaning representations in neural semantic parsing | |
CN116450616A (en) | General heterogeneous relational database SQL migration method based on parse tree | |
CN110909126A (en) | Information query method and device | |
Peterfreund | Grammars for document spanners | |
CN113779062A (en) | SQL statement generation method and device, storage medium and electronic equipment | |
CN111831624A (en) | Data table creating method and device, computer equipment and storage medium | |
Bai et al. | Enhanced natural language interface for web-based information retrieval | |
Rodriguez-Cardenas et al. | Benchmarking causal study to interpret large language models for source code | |
CN113032366A (en) | SQL syntax tree analysis method based on Flex and Bison | |
CN113297251A (en) | Multi-source data retrieval method, device, equipment and storage medium | |
CN113032371A (en) | Database grammar analysis method and device and computer equipment | |
CN110716953B (en) | SQL sentence automatic generation method, device, equipment and readable storage medium | |
CN110008448B (en) | Method and device for automatically converting SQL code into Java code | |
WO2023138078A1 (en) | Method and apparatus for parsing programming language, and non-volatile storage medium | |
CN117390130A (en) | Code searching method based on multi-mode representation | |
CN115935943A (en) | Analysis framework supporting natural language structure calculation | |
CN115292347A (en) | Active SQL algorithm performance checking device and method based on rules | |
CN113868312A (en) | Multi-method fused mechanism matching method, device, equipment and storage medium |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination |