CN113032366A - SQL syntax tree analysis method based on Flex and Bison - Google Patents

SQL syntax tree analysis method based on Flex and Bison Download PDF

Info

Publication number
CN113032366A
CN113032366A CN202110312632.1A CN202110312632A CN113032366A CN 113032366 A CN113032366 A CN 113032366A CN 202110312632 A CN202110312632 A CN 202110312632A CN 113032366 A CN113032366 A CN 113032366A
Authority
CN
China
Prior art keywords
sql
statements
syntax tree
node
lexical
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202110312632.1A
Other languages
Chinese (zh)
Inventor
王斌
赵智博
曲彦秋
杨晓春
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Northeastern University China
Original Assignee
Northeastern University China
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Northeastern University China filed Critical Northeastern University China
Priority to CN202110312632.1A priority Critical patent/CN113032366A/en
Publication of CN113032366A publication Critical patent/CN113032366A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/21Design, administration or maintenance of databases
    • G06F16/217Database tuning
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/22Indexing; Data structures therefor; Storage structures
    • G06F16/2228Indexing structures
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/24Querying
    • G06F16/242Query formulation
    • G06F16/2433Query languages
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/205Parsing
    • G06F40/211Syntactic parsing, e.g. based on context-free grammar [CFG] or unification grammars
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/253Grammatical analysis; Style critique

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Databases & Information Systems (AREA)
  • Computational Linguistics (AREA)
  • Data Mining & Analysis (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • General Health & Medical Sciences (AREA)
  • Health & Medical Sciences (AREA)
  • Artificial Intelligence (AREA)
  • Mathematical Physics (AREA)
  • Software Systems (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The invention provides a SQL syntax tree analysis method based on Flex and Bison, which comprises the steps of firstly obtaining a database historical query file, collecting SQL sentences corresponding to a specified database instance, then carrying out lexical analysis on each SQL sentence, dividing the SQL sentences into corresponding token sequences according to a word formation rule, then carrying out syntactic analysis on the token sequences obtained by the lexical analysis processing, converting the token sequences into a syntax tree form according to the SQL syntax rule, and finally converting the syntax tree structure into a JSON format for persistent storage.

Description

SQL syntax tree analysis method based on Flex and Bison
Technical Field
The invention relates to the technical field of database index tuning and syntax tree analysis, in particular to a SQL syntax tree analysis method based on Flex and Bison.
Background
The database index optimization technology is a technology for constructing a group of reasonable candidate index sets by analyzing query workload (query workload) in a database within a period of time, so that query optimization effect and index maintenance cost are balanced. In the database index tuning technique, the parsing of the structured query language SQL is not left.
Syntax trees (syntax trees), or Abstract Syntax Trees (AST), are concepts in the compilation principle, and their roles are to represent abstract syntax structures in the form of trees, where each non-leaf node on a tree is an operator in the abstract syntax, and a leaf node represents a concrete object of the abstract syntax tree. The use of syntax trees allows machines to read the inherent meaning of the abstract syntax as well as the order of execution of the abstract syntax. The syntax tree has wide application, for example, source code parsing of programming language, analysis of expressions, even natural language processing and machine translation technologies need syntax tree technology. Structured Query Language (SQL) is abbreviated as SQL, and the SQL syntax tree parsing has important meaning for the analysis of SQL statements and the definition of Query execution paths of SQL statements.
Generating syntax trees typically requires two processes: lexical analysis and syntactic analysis. The task of lexical analysis is to read in a file from left to right character by character, i.e. to scan the character stream in the file and then to identify words according to the word formation rules. Parsing is a logical phase of the compilation process. The task of syntactic analysis is to combine sequences of words into various types of grammatical phrases on the basis of lexical analysis. It is common practice for professional compiler programmers to write lexical and syntactic parsers to parse SQL statements to generate syntax trees. However, the lexical analyzer and the parser source code are too complex to be developed quickly, and the maintenance work at the later stage is very tedious.
Flex and Bison are tools used to generate lexical analyzer and parser source programs that can process structured input. Originally used to generate compilers, Flex and Bison have also been applied to many other fields. The lexical analysis is to divide the input into meaningful word blocks called tokens (tokens). The patterns of any flex lexical analyzer use a regular expression language that uses meta-language to describe the patterns that are desired to be matched. Meta languages use standard text characters, one part representing themselves and the other part representing patterns. The parser determines how the tokens relate to each other. Bison generates a parser that can recognize valid statements in this grammar based on a given bacause-form BNF grammar.
In the current database index tuning algorithm, the analysis of the SQL statement usually counts the attributes appearing in the SQL statement, and then combines them to obtain a candidate index set. The method has the disadvantages that the structure of the SQL statement is not considered, and the obtained candidate index is not the optimal solution.
Disclosure of Invention
Aiming at the defects of the prior art, the invention provides a method for analyzing an SQL syntax tree based on Flex and Bison, which comprises the following steps:
step 1: acquiring a historical query file of a database, and collecting SQL statements corresponding to a specified database instance, wherein the SQL statements comprise insert statements, delete statements, update statements and select statements;
step 2: performing lexical analysis on each SQL statement, and dividing the SQL statement into corresponding marks according to word formation rules;
and step 3: carrying out syntactic analysis on the token sequence obtained by the lexical analysis processing, and converting the token sequence into a syntax tree form according to an SQL syntax rule;
and 4, step 4: converting the syntax tree structure into a JSON format;
and 5: and (4) repeatedly executing the step 2 to the step 4 until all SQL sentences are processed.
The step 1 comprises the following steps:
step 1.1: starting a historical query log function in a relational database, and collecting query workload information;
step 1.2: reading a historical query log file of the database, tracking the operation executed by a user through an ID (identity), and screening out a query log of a specified database instance;
step 1.3: and excluding the time information, the ID information and the command type information in the log to obtain SQL statements contained in the log of the specified database instance, wherein the SQL statements comprise insert statements, delete statements, update statements and select statements.
The step 2 comprises the following steps:
step 2.1: according to keywords, database naming rules and variable types in the SQL sentences, formulating regular expressions to match each word and convert the words into corresponding marks;
step 2.2: converting the regular expression and the conversion action of the corresponding mark into a C language form by using Flex software;
step 2.3: calling a yylex () function in the C language code file to read SQL sentences for word construction rule matching, and judging whether each SQL sentence has a lexical problem or not;
step 2.4: and (3) dividing the SQL sentences without the lexical problem into corresponding signs to obtain a sign sequence corresponding to each SQL sentence, and directly deleting the SQL sentences with the lexical problem.
The step 3 comprises the following steps:
step 3.1: designing a structure body type of the SQL syntax tree node, wherein the structure body type comprises the type of the node, a numerical value stored by the node, a brother node and a child node of the node;
step 3.2: designing a BNF grammar according to the SQL grammar;
step 3.3: and (3) converting the written BNF grammar into a C language form by using Bison software, reading the mark sequence generated in the step (2) and matching the BNF grammar to obtain a final SQL grammar tree.
The invention has the beneficial effects that:
the invention provides a SQL syntax tree analysis method based on Flex and Bison, which can be used for carrying out structured analysis on insert statements, delete statements, update statements and select statements in query workload in database index tuning. These statements are parsed into a syntax tree form and persisted using the JSON format. The candidate index set can be more accurately constructed, the execution plan of the SQL statement can be known through the syntax tree structure, and for any SQL statement, even the complex multi-table query or nested query statement can be converted into the SQL syntax tree through the BNF grammar and stored in the JSON format.
Drawings
FIG. 1 is a flow chart of a SQL syntax tree parsing method based on Flex and Bison in the invention.
FIG. 2 is a diagram of the SQL syntax tree structure of the present invention.
Detailed Description
The invention is further described with reference to the following figures and specific examples. In order to analyze the SQL statement and convert the SQL statement into a syntax tree structure so as to further analyze the SQL statement, the invention provides an SQL syntax tree analysis algorithm based on Flex and Bison. The invention aims to solve the technical problem that any SQL statement, even complex statements such as multi-table query or nested query, can be converted into an SQL syntax tree through a BNF grammar and stored in a JSON format.
As shown in fig. 1, a method for parsing SQL syntax tree based on Flex and Bison includes:
step 1: acquiring a database historical query file, and collecting SQL statements corresponding to a specified database instance, wherein the SQL statements comprise:
step 1.1: starting a historical query log function in a relational database, and collecting query workload information;
step 1.2: the collected query workload information may include a plurality of database instances, and a screening work of specific information is required in the information extraction process. Reading a historical query log file of the database, tracking the operation executed by a user through an ID (identity), and screening out a query log of a specified database instance;
step 1.3: and (4) according to the historical query log file structure of the relational database, performing data cleaning operation. And excluding the time information, ID information for identifying the identity and command type information in the log to obtain SQL statements contained in the log of the specified database instance, wherein the SQL statements comprise insert statements, delete statements, update statements and select statements.
Step 2: performing lexical analysis on each SQL statement, and dividing the SQL statement into corresponding marks according to word formation rules; the method comprises the following steps:
step 2.1: according to keywords, database naming rules and variable types in the SQL sentences, formulating regular expressions to match each word and convert the words into corresponding marks;
the pseudo code written to implement step 2.1 is:
Figure BDA0002989955480000041
step 2.2: and converting the regular expression and the conversion action of the corresponding mark into a C language form by using Flex software. Because the internal format of the C language form uses a deterministic finite automaton, compared with the regular expression matching, the speed of the C language form can be increased by hundreds of times;
step 2.3: calling a yylex () function in the C language code file to read SQL sentences for word construction rule matching, and judging whether each SQL sentence has a lexical problem or not;
step 2.4: dividing SQL sentences without lexical problems into corresponding signs to obtain a sign sequence corresponding to each SQL sentence, and directly deleting the SQL sentences with the lexical problems;
the SQL statement is converted into a token sequence after lexical analysis, and the sequence needs to be parsed and converted into a structured syntax tree form.
And step 3: carrying out syntactic analysis on the token sequence obtained by the lexical analysis processing, and converting the token sequence into a syntax tree form according to an SQL syntax rule; the method comprises the following steps:
step 3.1: designing a structure body type of the SQL syntax tree node, wherein the structure body type comprises the type of the node, a numerical value stored by the node, a brother node and a child node of the node;
step 3.2: designing a BNF grammar according to SQL grammar, for example, for a BNF grammar corresponding to a simple SELECT statement:
select_stmt∶=SELECT select_optss elect_expr_list FROM table opt_where
wherein, SELECT _ stmt represents a SELECT query statement, SELECT represents a SELECT keyword, SELECT _ ops represents a query limit condition, SELECT _ expr _ list represents an attribute list, FROM represents a FROM keyword, table represents a table name, and opt _ where represents a where clause, that is, a limit condition.
And then converting the clauses matched with the BNF grammar into node types, namely constructing an SQL syntax tree.
Step 3.3: and (2) converting the written BNF grammar into a C language form by using Bison software, reading the mark sequence generated in the step (2) to perform BNF grammar matching, and obtaining a final SQL grammar tree, wherein as shown in FIG. 2, the original SQL sentence is a select sentence, and aims to inquire the CUSTOMER number CID with the AGE being more than 20 and the gender SEX being male and the CUSTOMER name FNAME in a CUSTOMER table CUSTOMER, and sorting according to the CUSTOMER names after inquiry. After the parsing, a structure of a syntax tree is formed, and the SELECT clause, the FROM clause, the WHERE clause and the ORDER BY clause are structurally represented.
And 4, step 4: and the finally obtained SQL syntax tree form is embodied by the data type of the multi-branch tree, and in order to ensure that the syntax tree can be persisted for the use of subsequent recommended indexes, the syntax tree structure is converted into the JSON format for persisted storage.
And 5: and (4) repeatedly executing the step (2) to the step (4) until all SQL sentences are processed, and outputting the syntax tree persistent file.

Claims (4)

1. A SQL syntax tree parsing method based on Flex and Bison is characterized by comprising the following steps:
step 1: acquiring a historical query file of a database, and collecting SQL statements corresponding to a specified database instance, wherein the SQL statements comprise insert statements, delete statements, update statements and select statements;
step 2: performing lexical analysis on each SQL statement, and dividing the SQL statement into corresponding marks according to word formation rules;
and step 3: carrying out syntactic analysis on the token sequence obtained by the lexical analysis processing, and converting the token sequence into a syntax tree form according to an SQL syntax rule;
and 4, step 4: converting the syntax tree structure into a JSON format;
and 5: and (4) repeatedly executing the step 2 to the step 4 until all SQL sentences are processed.
2. The method according to claim 1, wherein the step 1 comprises:
step 1.1: starting a historical query log function in a relational database, and collecting query workload information;
step 1.2: reading a historical query log file of the database, tracking the operation executed by a user through an ID (identity), and screening out a query log of a specified database instance;
step 1.3: and excluding the time information, the ID information and the command type information in the log to obtain SQL statements contained in the log of the specified database instance, wherein the SQL statements comprise insert statements, delete statements, update statements and select statements.
3. The method according to claim 1, wherein the step 2 comprises:
step 2.1: according to keywords, database naming rules and variable types in the SQL sentences, formulating regular expressions to match each word and convert the words into corresponding marks;
step 2.2: converting the regular expression and the conversion action of the corresponding mark into a C language form by using Flex software;
step 2.3: calling a yylex () function in the C language code file to read SQL sentences for word construction rule matching, and judging whether each SQL sentence has a lexical problem or not;
step 2.4: and (3) dividing the SQL sentences without the lexical problem into corresponding signs to obtain a sign sequence corresponding to each SQL sentence, and directly deleting the SQL sentences with the lexical problem.
4. The method of claim 1, wherein the step 3 comprises:
step 3.1: designing a structure body type of the SQL syntax tree node, wherein the structure body type comprises the type of the node, a numerical value stored by the node, a brother node and a child node of the node;
step 3.2: designing a BNF grammar according to the SQL grammar;
step 3.3: and (3) converting the written BNF grammar into a C language form by using Bison software, reading the mark sequence generated in the step (2) and matching the BNF grammar to obtain a final SQL grammar tree.
CN202110312632.1A 2021-03-24 2021-03-24 SQL syntax tree analysis method based on Flex and Bison Pending CN113032366A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202110312632.1A CN113032366A (en) 2021-03-24 2021-03-24 SQL syntax tree analysis method based on Flex and Bison

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202110312632.1A CN113032366A (en) 2021-03-24 2021-03-24 SQL syntax tree analysis method based on Flex and Bison

Publications (1)

Publication Number Publication Date
CN113032366A true CN113032366A (en) 2021-06-25

Family

ID=76473584

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202110312632.1A Pending CN113032366A (en) 2021-03-24 2021-03-24 SQL syntax tree analysis method based on Flex and Bison

Country Status (1)

Country Link
CN (1) CN113032366A (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113792027A (en) * 2021-08-23 2021-12-14 浙江金惠科技有限公司 Universal database conversion device

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
邱涛等: "面向关系数据库的智能索引调优方法", 《软件学报》, vol. 31, no. 3, pages 1 - 4 *

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113792027A (en) * 2021-08-23 2021-12-14 浙江金惠科技有限公司 Universal database conversion device
CN113792027B (en) * 2021-08-23 2023-06-16 浙江金惠科技有限公司 General database conversion device

Similar Documents

Publication Publication Date Title
CN107515887B (en) Interactive query method suitable for various big data management systems
US9122540B2 (en) Transformation of computer programs and eliminating errors
CN107203468B (en) AST-based software version evolution comparative analysis method
CN110909016B (en) Repeated association detection method, device, equipment and storage medium based on database
CN115576984A (en) Method for generating SQL (structured query language) statement and cross-database query by Chinese natural language
CN109241080B (en) Construction and use method and system of FQL query language
CN112860727B (en) Data query method, device, equipment and medium based on big data query engine
Guo et al. Benchmarking meaning representations in neural semantic parsing
CN116450616A (en) General heterogeneous relational database SQL migration method based on parse tree
CN110909126A (en) Information query method and device
Peterfreund Grammars for document spanners
CN113779062A (en) SQL statement generation method and device, storage medium and electronic equipment
CN111831624A (en) Data table creating method and device, computer equipment and storage medium
Bai et al. Enhanced natural language interface for web-based information retrieval
Rodriguez-Cardenas et al. Benchmarking causal study to interpret large language models for source code
CN113032366A (en) SQL syntax tree analysis method based on Flex and Bison
CN113297251A (en) Multi-source data retrieval method, device, equipment and storage medium
CN113032371A (en) Database grammar analysis method and device and computer equipment
CN110716953B (en) SQL sentence automatic generation method, device, equipment and readable storage medium
CN110008448B (en) Method and device for automatically converting SQL code into Java code
WO2023138078A1 (en) Method and apparatus for parsing programming language, and non-volatile storage medium
CN117390130A (en) Code searching method based on multi-mode representation
CN115935943A (en) Analysis framework supporting natural language structure calculation
CN115292347A (en) Active SQL algorithm performance checking device and method based on rules
CN113868312A (en) Multi-method fused mechanism matching method, device, equipment and storage medium

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination