CN115033597A - Method and system for deep learning to participate in SQL optimization of HTAP database - Google Patents
Method and system for deep learning to participate in SQL optimization of HTAP database Download PDFInfo
- Publication number
- CN115033597A CN115033597A CN202210752471.2A CN202210752471A CN115033597A CN 115033597 A CN115033597 A CN 115033597A CN 202210752471 A CN202210752471 A CN 202210752471A CN 115033597 A CN115033597 A CN 115033597A
- Authority
- CN
- China
- Prior art keywords
- execution
- sql
- execution plan
- deep learning
- plan
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
- 238000005457 optimization Methods 0.000 title claims abstract description 86
- 238000000034 method Methods 0.000 title claims abstract description 66
- 238000013135 deep learning Methods 0.000 title claims abstract description 27
- 238000013136 deep learning model Methods 0.000 claims abstract description 62
- 238000012549 training Methods 0.000 claims abstract description 48
- 230000008569 process Effects 0.000 claims abstract description 42
- 230000006870 function Effects 0.000 claims description 19
- 238000004364 calculation method Methods 0.000 claims description 14
- 238000013528 artificial neural network Methods 0.000 claims description 12
- 125000004122 cyclic group Chemical group 0.000 claims description 10
- 238000013461 design Methods 0.000 claims description 8
- 230000003993 interaction Effects 0.000 claims description 7
- 238000004458 analytical method Methods 0.000 description 20
- 238000012545 processing Methods 0.000 description 18
- 238000005516 engineering process Methods 0.000 description 5
- 230000001351 cycling effect Effects 0.000 description 4
- 238000007405 data analysis Methods 0.000 description 4
- 238000004422 calculation algorithm Methods 0.000 description 2
- 150000001875 compounds Chemical class 0.000 description 2
- 238000004590 computer program Methods 0.000 description 2
- 238000013499 data model Methods 0.000 description 2
- 238000013500 data storage Methods 0.000 description 2
- 238000001514 detection method Methods 0.000 description 2
- 238000010586 diagram Methods 0.000 description 2
- 239000000835 fiber Substances 0.000 description 2
- 230000003287 optical effect Effects 0.000 description 2
- 230000000306 recurrent effect Effects 0.000 description 2
- 239000000523 sample Substances 0.000 description 2
- 238000007619 statistical method Methods 0.000 description 2
- 230000009286 beneficial effect Effects 0.000 description 1
- 238000004891 communication Methods 0.000 description 1
- 238000011161 development Methods 0.000 description 1
- 230000000694 effects Effects 0.000 description 1
- 238000012986 modification Methods 0.000 description 1
- 230000004048 modification Effects 0.000 description 1
- 238000012546 transfer Methods 0.000 description 1
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/20—Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
- G06F16/24—Querying
- G06F16/245—Query processing
- G06F16/2453—Query optimisation
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/20—Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
- G06F16/24—Querying
- G06F16/242—Query formulation
- G06F16/2433—Query languages
- G06F16/2445—Data retrieval commands; View definitions
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/08—Learning methods
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Physics & Mathematics (AREA)
- Computational Linguistics (AREA)
- Data Mining & Analysis (AREA)
- General Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Mathematical Physics (AREA)
- Databases & Information Systems (AREA)
- Life Sciences & Earth Sciences (AREA)
- Health & Medical Sciences (AREA)
- Artificial Intelligence (AREA)
- Biomedical Technology (AREA)
- Biophysics (AREA)
- Evolutionary Computation (AREA)
- General Health & Medical Sciences (AREA)
- Molecular Biology (AREA)
- Computing Systems (AREA)
- Software Systems (AREA)
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
Abstract
The invention discloses a method and a system for deep learning to participate in SQL optimization of an HTAP database, which avoid manual optimization of a search engine, improve query speed, reduce memory occupation and improve machine performance. The technical scheme is as follows: the SQL parser receives the SQL execution statement and checks whether the SQL execution statement conforms to the specification; the execution plan module generates an execution plan through a deep learning model based on SQL sentences which are checked by the SQL parser and meet the specification; the engine executes the generated execution plan and sends the SQL execution result and the detailed information of the SQL execution process to a training set of the deep learning model; the deep learning model is used for learning and training based on the training set, then the results of the model learning and training are fed back to the execution plan module in an optimal optimization scheme mode, and after the execution plan module receives the feedback of the deep learning model, the execution plan is automatically optimized and applied to the next SQL query.
Description
Technical Field
The invention relates to a database technology, in particular to a method and a system for deep learning to participate in SQL optimization of an HTAP database.
Background
Before the internet wave, the data volume of enterprises is not large generally, and particularly, the core business data can be stored in a single database. The storage at that time does not require a complex architecture, and all Online requests (OLTP) and Online Analytical Processing (OLAP) run on the same database instance.
With the development of the internet, the business data volume of enterprises is continuously increased, and the capacity limitation of a single-machine database restricts the use of the single-machine database in a massive data scene. Therefore, in practical applications, in order to meet various requirements, OLTP and OLAP are technically based on splitting, and in many enterprise architectures, the two types of task processing are completed by different teams. When the internet of things is used deeply, massive sensor data are required to be updated and inquired in real time, and the performance requirement on a database is higher and higher, new problems occur.
There is usually a delay of several minutes or even hours between the OLAP and OLTP systems, and the consistency between the OLAP database and the OLTP database cannot be guaranteed, so that it is difficult to satisfy a service scenario with a high requirement on the real-time performance of analysis. In addition, enterprises need to maintain different databases to support two different types of tasks, which is costly to manage and maintain.
Therefore, being able to unify databases that support transaction processing and workload analysis has become a requirement for many enterprises. In this context, HTAP (Hybrid Transactional/Analytical Processing) proposed by Gartner has become desirable. Based on an innovative computing storage framework, the HTAP database can simultaneously support business system operation and an OLAP scene on one data, and a large amount of data interaction between an online database and an offline database in a traditional architecture is avoided. In addition, the HTAP supports elastic capacity expansion based on a distributed architecture, can expand throughput or storage as required, and easily cope with high-concurrency and mass data scenes.
At present, a few databases for realizing HTAP are available, mainly including TiDB of PingcAP, hybrid DB for MySQL of Aliskiu, Baidu BaikalDB and the like.
In the big data scenario, in addition to HTAP and the need for cloud-native technology (ensuring dynamic scalability), search engines will have higher and higher requirements for SQL optimized execution. Not only is speed required, but also the machine performance is guaranteed to be loadable, such as memory, IO, and the like. Especially for OLAP, the data to be pulled is very large and is all the data quantity of TB level, the memory is often not enough,
in summary, in the existing scheme, under the scenario of HTAP and large data volume, the memory occupation is too high, a search engine needs to be manually optimized, the time complexity and the space complexity of each query are not greatly different, and automatic optimization cannot be performed.
Disclosure of Invention
The following presents a simplified summary of one or more aspects in order to provide a basic understanding of such aspects. This summary is not an extensive overview of all contemplated aspects, and is intended to neither identify key or critical elements of all aspects nor delineate the scope of any or all aspects. Its sole purpose is to present some concepts of one or more aspects in a simplified form as a prelude to the more detailed description that is presented later.
The invention aims to solve the problems and provides a method and a system for performing SQL optimization by deep learning and participating in an HTAP database.
The technical scheme of the invention is as follows: the invention discloses a method for deep learning to participate in SQL optimization of an HTAP database, which comprises the following steps:
step 1: the SQL parser receives the SQL execution statement and checks whether the SQL execution statement conforms to the specification;
step 2: the execution plan module generates an execution plan through a deep learning model based on SQL sentences which are checked by the SQL parser and meet the specification;
and step 3: the engine executes the execution plan generated in the step 2 and sends the SQL execution result and the detailed information of the SQL execution process to a training set of the deep learning model;
and 4, step 4: the deep learning model is used for learning and training based on the training set, then the results of the model learning and training are fed back to the execution plan module in an optimal optimization scheme mode, and after the execution plan module receives the feedback of the deep learning model, the execution plan is automatically optimized and applied to the next SQL query.
According to an embodiment of the method for deep learning to participate in SQL optimization of the HTAP database of the present invention, the execution plan generated in step 2 is divided into a row-based plan and a column-based plan, and the engine in step 3 is divided into a row-based execution engine to execute the row-based plan and a vectorization execution engine to execute the column-based plan.
According to an embodiment of the method for deep learning to participate in SQL optimization of the HTAP database, the step 2, in which the execution plan module generates the execution plan based on the deep learning model, further includes:
the execution planning module utilizes an input layer, a hidden layer and an output layer of the cyclic neural network structure, the input layer inputs SQL execution results and detailed information of the SQL execution process, the hidden layer makes the calculation, and the output layer outputs SQL statements and an optimal execution plan.
According to an embodiment of the method for deep learning to participate in the SQL optimization of the HTAP database, in the cost calculation of the hidden layer, cost estimation is carried out on an SQL execution result and detailed information of an SQL execution process, loss function parameters are continuously adjusted in the process of loop iteration to reduce cost, and when the reduced cost is better than an expected value of the cost estimation, the optimization result of the loop iteration is constructed into a data structure of a well-defined execution plan and fed back to an execution plan module.
According to an embodiment of the method for deep learning to participate in the SQL optimization of the HTAP database of the present invention, step 4 further includes:
a protocol for interaction is preset between the execution plan module and the deep learning model, and a data structure of the execution plan is defined in advance;
the deep learning model learns and trains the training set data sent by the engine in the step 3, quantifies and designs the points which can be fed back, and finally feeds back the learning and training results of the model to the execution plan module, so that the execution plan module directly modifies the data structure of the corresponding execution plan in the abstract syntax tree in the memory for optimization after receiving the feedback of the deep learning model, and ensures that the next query can be applied to the optimization feedback.
The invention also discloses a system for deep learning to participate in the execution of SQL optimization of the HTAP database, which comprises the following steps:
the SQL parser is used for receiving the SQL execution statement and checking whether the SQL execution statement conforms to the specification;
the execution plan module is used for generating an execution plan through a deep learning model based on the SQL sentences which are checked by the SQL parser and meet the specification;
the engine module is used for executing the generated execution plan and sending the SQL execution result and the detailed information of the SQL execution process to the training set of the deep learning model;
and the deep learning model is used for learning and training based on the training set, and then the learning and training results of the model are fed back to the execution plan module in an optimal optimization scheme mode, so that the execution plan module automatically optimizes the execution plan and applies the execution plan to the next SQL query after receiving the feedback of the deep learning model.
According to an embodiment of the system for deep learning to participate in SQL optimization of an HTAP database of the present invention, the execution plan generated by the execution plan module is divided into a row-based plan and a column-based plan, and the engine module is divided into a row-based execution engine to execute the row-based plan and a vectorization execution engine to execute the column-based plan.
According to an embodiment of the system for deep learning to participate in SQL optimization of an HTAP database according to the invention, the deep learning-based model generation execution plan of the execution plan module further comprises:
the execution planning module utilizes an input layer, a hidden layer and an output layer of the cyclic neural network structure, the input layer inputs SQL execution results and detailed information of the SQL execution process, the hidden layer makes the calculation, and the output layer outputs SQL statements and an optimal execution plan.
According to an embodiment of the system for deep learning to participate in the execution of the SQL optimization by the HTAP database, the execution plan module performs cost estimation on SQL execution results and detailed information of the SQL execution process in the cost calculation of the hidden layer, continuously adjusts parameters of a loss function in the loop iteration process to reduce the cost, and constructs the optimization results of the loop iteration into a data structure of a well-defined execution plan and feeds the data structure back to the execution plan module when the reduced cost is better than an expected value of the cost estimation.
According to an embodiment of the system for deep learning to participate in HTAP database SQL optimization according to the invention, the deep learning model is further configured to:
a protocol for interaction is preset between the execution plan module and the deep learning model, and a data structure of the execution plan is defined in advance;
the deep learning model learns and trains training set data sent by the engine module, quantifies and designs feedback points, and finally feeds back model learning and training results to the execution plan module, so that the execution plan module directly modifies the data structure of the corresponding execution plan in the abstract syntax tree in the memory for optimization after receiving the feedback of the deep learning model, and ensures that the next query can be applied to the optimization feedback.
Compared with the prior art, the invention has the following beneficial effects: in the scheme of the invention, for the HTAP database, the execution result and the execution process are sent to the deep learning model for training after each SQL execution, the optimization scheme prediction is carried out, the prediction result of the optimization scheme prediction is fed back to the execution plan module, the execution plan module automatically optimizes and generates the plan according to the feedback of the prediction result, and the generated plan is applied to the next query. After the feedback takes effect, the whole SQL execution flow is more efficient and occupies lower memory.
Drawings
The above features and advantages of the present disclosure will be better understood upon reading the detailed description of embodiments of the disclosure in conjunction with the following drawings. In the drawings, components are not necessarily drawn to scale, and components having similar relative characteristics or features may have the same or similar reference numerals.
FIG. 1 illustrates a flow diagram of one embodiment of a method of deep learning to participate in an HTAP database to perform SQL optimization of the present invention.
FIG. 2 illustrates a schematic diagram of an embodiment of the deep learning participant HTAP database system of the present invention performing SQL optimization.
Detailed Description
The invention is described in detail below with reference to the figures and specific embodiments. It is noted that the aspects described below in connection with the figures and the specific embodiments are only exemplary and should not be construed as imposing any limitation on the scope of the present invention.
FIG. 1 illustrates the flow of an embodiment of the method of deep learning to participate in an HTAP database to perform SQL optimization of the present invention. Referring to fig. 1, the following is a detailed description of each implementation step of the method for deep learning to participate in SQL optimization of the HTAP database according to the embodiment.
Step 1: the SQL Parser (SQL Parser) receives the SQL execution statements to check whether the SQL execution statements conform to the specification, which typically includes checking the SQL format, whether the syntax is correct, etc.
Step 2: and the execution plan module generates an execution plan through a deep learning model based on the SQL sentences which are checked by the SQL parser and meet the specification.
As shown in fig. 1, the execution plan module generates a logical layer plan and then generates a physical layer plan according to the logical layer plan. Since the HTAP database is a mixed-rank database, the physical layer plan is divided into a row-based plan and a column-based plan.
The specific process of the execution plan module generating the execution plan based on the deeply learned model is as follows.
The execution planning module utilizes a cyclic neural network structure (the cyclic neural network comprises an input layer, a hidden layer and an output layer), the input layer inputs SQL execution results and detailed information of an SQL execution process, the hidden layer makes the calculation, and the output layer outputs SQL statements and an optimal execution plan.
In the cost calculation of the hidden layer, cost estimation is carried out on an SQL execution result and detailed information of the SQL execution process, loss function parameters are continuously adjusted in the process of loop iteration (the iteration process presets iteration times in advance) so as to reduce the cost, and when the reduced cost is better than an expected value of the cost estimation, an optimized result of the loop iteration is constructed into a data structure of a well-defined execution plan and fed back to an execution plan module.
The cost estimation is divided into 2 stages, one is a cost estimation analysis and the other is a learning optimization stage.
The cost estimation and analysis stage comprises a plurality of execution conditions such as full-table scanning estimation, common index estimation, multi-table connection inquiry and the like, the deep learning model carries out integral estimation on the different execution conditions to obtain an expected value of the cost estimation, and the expected value comprises IO (input/output) cost, CPU (Central processing Unit) calculation cost, memory occupation cost and the like.
In the subsequent learning optimization stage, according to the training set data of the deep learning model, the weight parameters of the deep learning model are continuously optimized by using a recurrent neural network, the expected value of the cost estimation analysis stage is compared, if the learning optimization result is superior to the expected value of the cost estimation analysis stage, the learning optimization result is fed back to the execution plan module after a certain learning frequency, and the iterative optimization is continuously carried out.
The learning optimization stage and the SQL execution are decoupled, the SQL execution result and the detailed information of the SQL execution process are sent to a training set of the deep learning model in an asynchronous mode every time, and the cyclic neural network feeds back to the execution plan module after certain optimization times.
The generated execution plan is generally divided into the following cases:
1. and (3) matching useless plans: sometimes there may be only one way to execute the query. For example, heap tables can only obtain data through table scanning, in order to avoid wasting time to optimize such queries, SQL Server maintains a garbage list for selection, and if the optimization stage is to find a plan in the garbage list that matches a query, a similar plan is generated without any optimization.
2. Multi-stage optimization: for complex queries, the number of alternative processing strategies that need to be analyzed may be large, and evaluating each selection may take a long time. Thus, the optimization stage does not analyze all possible processing strategies, but rather divides them into several configurations, each containing different indexing and linking techniques.
The index variations take into account different indexing characteristics, single-column index, compound index, index-column order, index density, etc. Similarly, the connection variants consider different connection technologies available in the engine: nested loop join, merge join, and hash match.
Learning optimization considers the statistics of columns induced in the WHERE clause to evaluate the effectiveness of the indexing and join policies, which evaluate the overhead of configuration in multiple optimization stages, including many factors such as CPU, memory usage, and disk I/O needed to execute the query. After each optimization phase, the overhead of the processing strategy is evaluated, and if the overhead is economic enough, further cycling through the optimization phase is stopped and the optimization process is exited, otherwise, cycling through the optimization phase is continued to determine a cost-effective processing strategy.
And 3, step 3: and (3) executing the Execution plan generated in the step (2) by an Engine, wherein the Engine comprises a Row-Based Execution Engine (used for executing the Row-Based plan) and a vectored Execution Engine (used for executing the column-Based plan), and sending the SQL Execution result and the detailed information of the SQL Execution process to a training set of the deep learning model.
During the execution of the engine, data (data page, Index page, redo log, etc.) generated during the execution of SQL is put into a Row Buffer Pool (Row Buffer Pool) and a Column Index Buffer Pool (Column Index Buffer Pool) in the storage engine. And after the SQL execution is finished each time, sending the SQL execution result and the detailed information of the SQL execution process to a training set of the deep learning model.
The execution engine based on the row mainly processes an execution plan related to a transaction, selects operators (such as operators for full table scanning, nested loop connection and the like) of different modules according to the execution plan, and determines that several tables and APIs (application program interfaces) which need to be called need to be operated in the execution.
The full table scan in the row-based execution engine is: and scanning a certain table specified by the execution plan, generating a corresponding instruction by the expression framework, and calling a corresponding processing API (application program interface) until required data is found.
The nested loop connections in the line-based execution engine are: the nested loop connects an external data set to the internal data set, and for each row of data of the external data set, the database matches all rows in the internal data set that match predicate conditions. If an index is available for the internal dataset or internal table at this time, the database will retrieve the data by using it to locate the rowid.
The vectorization execution engine mainly processes execution plans related to data analysis, most requests are mainly related to query and analysis, and data storage is multidimensional, so that data analysis is very convenient. The conventional transaction type is a two-dimensional data model, which is inconvenient to operate and analyze when a plurality of form sets are required to operate. In addition, a special SIMD instruction (single instruction stream and multiple data streams) is generated through an expression framework, and a corresponding column index access API is called, so that the overall analysis efficiency is improved.
The full-table scan of the vectorization execution engine is: due to the fact that the column type database is used, even if the column type database is used for full-table scanning, certain required column data can be extracted for analysis, and the data size of full-table scanning is greatly reduced.
The nested loop connections of the vectorization execution engine are: according to the execution plan, the characteristics of the column-type database are fully utilized, only the required columns are associated aiming at each row of data of the external data set, and corresponding column data are recombined.
In the hash join of the vectorization execution engine, the hash join is mainly divided into two stages: a build phase and a probe phase. A building stage: a table (typically the smaller table to reduce the time and space to build the hash table) is selected and a hash value is obtained by applying a hash function to the join attribute (join attribute) on each tuple to build a hash table. A detection stage: and for another table, scanning each line of the table, calculating the hash value of the connection attribute, comparing the hash value with the hash table established in the establishing stage, and if the hash value falls in the same bucket, connecting the hash value and the hash value into a new table if a connection predicate (predicate) is satisfied. When the internal memory is large enough, the whole table is in the internal memory and is placed in the disk after the connection operation is finished. This process also results in many I/O operations.
The aggregated groups of the vectoring execution engine are: aggregate grouping (GROUP BY) is common, and GROUP BY is combined with all, cube and roolup as the common functions of OLAP (online analysis), and is very convenient to use. The OALP series functions can realize a plurality of functions in OLAP scenes such as database reports, statistical analysis, warehouse processing and the like, and can play a greater role by combining with window functions.
And 4, step 4: the deep learning model is used for learning and training based on the training set, then the results (the optimal optimization scheme) of the model learning and training are fed back to the execution plan module, and the execution plan module automatically optimizes the execution plan after receiving the feedback of the deep learning model and applies the optimized execution plan to the next SQL query.
And a protocol for interaction is preset between the execution plan module and the deep learning model, and data structures of the execution plan are defined in advance, wherein the data structures comprise abstract syntax trees, access type types, table connection matching conditions ref and the like.
The deep learning model learns and trains training set data sent by the engine in the step 3, quantifies and designs feedback points (such as height estimation of an abstract syntax tree, access type and the like), and finally feeds back the model learning and training results to the execution planning module. And after receiving the feedback of the deep learning model, the execution plan module directly modifies the data structure of the corresponding execution plan in the abstract syntax tree in the memory for optimization, ensures that the next query can be applied to the optimization feedback, and timely disfigures the optimization information for the analysis of developers.
FIG. 2 illustrates the principles of an embodiment of the deep learning participant HTAP database system of the present invention performing SQL optimization. Referring to fig. 2, the system of the present embodiment includes: the system comprises an SQL parser, an execution plan module, an engine module and a deep learning model.
The SQL parser is used to receive the SQL execution statement and check whether the SQL execution statement conforms to the specification, which generally includes checking the SQL format and whether the syntax is correct.
And the execution plan module is used for generating an execution plan through a deep learning model based on the SQL sentences which are checked by the SQL parser and meet the specification.
The execution plan module generates a logic layer plan firstly, and then generates a physical layer plan according to the logic layer plan. Since the HTAP database is a mixed-rank database, the physical layer plan is divided into a row-based plan and a column-based plan.
The specific process of the execution plan module generating the execution plan based on the deeply learned model is as follows.
The execution planning module utilizes a cyclic neural network structure (the cyclic neural network comprises an input layer, a hidden layer and an output layer), the input layer inputs SQL execution results and detailed information of an SQL execution process, the hidden layer makes the calculation, and the output layer outputs SQL statements and an optimal execution plan.
In the cost calculation of the hidden layer, cost estimation is carried out on an SQL execution result and detailed information of the SQL execution process, loss function parameters are continuously adjusted in the process of loop iteration (the iteration process presets iteration times in advance) so as to reduce the cost, and when the reduced cost is better than an expected value of the cost estimation, an optimized result of the loop iteration is constructed into a data structure of a well-defined execution plan and fed back to an execution plan module.
The cost estimation is divided into 2 stages, one is a cost estimation analysis and the other is a learning optimization stage.
The cost estimation and analysis stage comprises a plurality of execution conditions such as full-table scanning estimation, common index estimation, multi-table connection inquiry and the like, the deep learning model carries out integral estimation on the different execution conditions to obtain an expected value of the cost estimation, and the expected value comprises IO (input/output) cost, CPU (Central processing Unit) calculation cost, memory occupation cost and the like.
In the subsequent learning optimization stage, according to training set data of the deep learning model, the weight parameters of the deep learning model are continuously optimized by using a recurrent neural network, expected values of the cost estimation and analysis stage are compared, if the learning optimization result is superior to the expected value of the cost estimation and analysis stage, the learning optimization result is fed back to the execution plan module after certain learning times, and continuous iterative optimization is carried out.
The learning optimization stage and the SQL execution are decoupled, the SQL execution result and the detailed information of the SQL execution process are sent to a training set of the deep learning model in an asynchronous mode every time, and the cyclic neural network feeds back to the execution plan module after certain optimization times.
The generated execution plan is generally divided into the following cases:
3. and (3) matching useless plans: sometimes there may be only one way to execute the query. For example, a heap table can only acquire data through table scanning, in order to avoid wasting time to optimize such queries, the SQL Server maintains a garbage list for selection, and if the optimization stage finds a matching plan in the garbage list with a query, a similar plan is generated without any optimization.
4. Multi-stage optimization: for complex queries, the number of alternative processing strategies that need to be analyzed may be large, and evaluating each selection may take a long time. Thus, the optimization phase does not analyze all possible processing strategies, but rather divides them into several configurations, each containing different indexing and linking techniques.
The index variations take into account different indexing characteristics, single-column index, compound index, index-column order, index density, etc. Similarly, the connection variants consider different connection techniques available in the engine: nested loop join, merged join, and hash matching.
Learning optimization considers the statistics of columns induced in the WHERE clause to evaluate the effectiveness of the indexing and join policies, which evaluate the overhead of configuration in multiple optimization stages, including many factors such as CPU, memory usage, and disk I/O needed to execute the query. After each optimization phase, the overhead of the processing strategy is evaluated, and if the overhead is economic enough, further cycling through the optimization phase is stopped and the optimization process is exited, otherwise, cycling through the optimization phase is continued to determine a cost-effective processing strategy.
The engine module is used for executing the generated execution plan and sending the SQL execution result and the detailed information of the SQL execution process to the training set of the deep learning model.
The engines comprise a Row-Based Execution Engine (used for executing a Row-Based plan) and a vectored Execution Engine (used for executing a column-Based plan), and the SQL Execution result and the detailed information of the SQL Execution process at each time are sent to a training set of the deep learning model.
During the execution of the engine, data (data page, Index page, redo log, etc.) generated during the execution of SQL is put into a Row Buffer Pool (Row Buffer Pool) and a Column Index Buffer Pool (Column Index Buffer Pool) in the storage engine. And after the SQL execution is finished each time, sending the SQL execution result and the detailed information of the SQL execution process to a training set of the deep learning model.
The execution engine based on the row mainly processes an execution plan related to a transaction, selects operators (such as operators for full table scanning, nested loop connection and the like) of different modules according to the execution plan, and determines that several tables and APIs (application program interfaces) which need to be called need to be operated in the execution.
The full table scan in the row-based execution engine is: scanning a certain table specified by the execution plan, generating a corresponding instruction by the expression framework, and calling a corresponding processing API until the required data is found.
The nested loop connections in the line-based execution engine are: the nested loop connects an external data set to the internal data set, and for each row of data of the external data set, the database matches all rows in the internal data set that match predicate conditions. If an index is available for the internal dataset or internal table at this time, the database will retrieve the data by using it to locate the rowid.
The vectorization execution engine mainly processes execution plans related to data analysis, most requests are mainly related to query and analysis, and data storage is multidimensional, so that data analysis is very convenient. The common transaction type is a two-dimensional data model, which is inconvenient to operate and analyze when a plurality of form sets are required to operate. In addition, a special SIMD instruction (single instruction stream and multiple data streams) is generated through an expression framework, and a corresponding column index access API is called, so that the overall analysis efficiency is improved.
The full-table scan of the vectorization execution engine is: due to the fact that the column type database is used, even if the column type database is used for full-table scanning, certain required column data can be extracted for analysis, and the data size of full-table scanning is greatly reduced.
The nested loop connections of the vectorization execution engine are: according to the execution plan, the characteristics of the column-type database are fully utilized, only the required columns are associated for each row of data of the external data set, and corresponding column data are recombined.
In the hash join of the vectorization execution engine, the hash join is mainly divided into two stages: a build phase and a probe phase. A building stage: a table (typically the smaller table to reduce the time and space to build the hash table) is selected and a hash value is obtained by applying a hash function to the join attribute (join attribute) on each tuple to build a hash table. A detection stage: and for another table, scanning each line of the table, calculating the hash value of the connection attribute, comparing the hash value with the hash table established in the establishing stage, and if the hash value falls in the same bucket, connecting the hash value and the hash value into a new table if a connection predicate (predicate) is satisfied. When the internal memory is large enough, the whole table is in the internal memory and is placed in the disk after the connection operation is finished. This process also results in many I/O operations.
The aggregated groups of the vectoring execution engine are: aggregate grouping (GROUP BY) is common, and GROUP BY is combined with all, cube and roolup as the common functions of OLAP (online analysis), and is very convenient to use. The OALP series functions can realize a plurality of functions in OLAP scenes such as database reports, statistical analysis, warehouse processing and the like, and can play a greater role by combining with window functions.
The deep learning model is used for learning and training based on the training set, and then the results of the model learning and training are fed back to the execution plan module in an optimal optimization scheme mode, so that the execution plan module automatically optimizes the execution plan and applies the execution plan to the next SQL query after receiving the feedback of the deep learning model.
And a protocol for interaction is preset between the execution plan module and the deep learning model, and data structures of the execution plan are defined in advance, wherein the data structures comprise abstract syntax trees, access type types, table connection matching conditions ref and the like.
The deep learning model learns and trains training set data sent by the engine module, quantifies and designs points which can be fed back (such as height estimation of an abstract syntax tree, access type and the like), and finally feeds back the learning and training results of the model to the execution plan module. And after receiving the feedback of the deep learning model, the execution plan module directly modifies the data structure of the corresponding execution plan in the abstract syntax tree in the memory for optimization, ensures that the next query can be applied to the optimization feedback, and timely disfigures the optimization information for the analysis of developers.
While, for purposes of simplicity of explanation, the methodologies are shown and described as a series of acts, it is to be understood and appreciated that the methodologies are not limited by the order of acts, as some acts may, in accordance with one or more embodiments, occur in different orders and/or concurrently with other acts from that shown and described herein or not shown and described herein, as would be understood by one skilled in the art.
Those of skill would further appreciate that the various illustrative logical blocks, modules, circuits, and algorithm steps described in connection with the embodiments disclosed herein may be implemented as electronic hardware, computer software, or combinations of both. To clearly illustrate this interchangeability of hardware and software, various illustrative components, blocks, modules, circuits, and steps have been described above generally in terms of their functionality. Whether such functionality is implemented as hardware or software depends upon the particular application and design constraints imposed on the overall system. Skilled artisans may implement the described functionality in varying ways for each particular application, but such implementation decisions should not be interpreted as causing a departure from the scope of the present invention.
The various illustrative logical blocks, modules, and circuits described in connection with the embodiments disclosed herein may be implemented or performed with a general purpose processor, a Digital Signal Processor (DSP), an Application Specific Integrated Circuit (ASIC), a Field Programmable Gate Array (FPGA) or other programmable logic device, discrete gate or transistor logic, discrete hardware components, or any combination thereof designed to perform the functions described herein. A general purpose processor may be a microprocessor, but in the alternative, the processor may be any conventional processor, controller, microcontroller, or state machine. A processor may also be implemented as a combination of computing devices, e.g., a combination of a DSP and a microprocessor, a plurality of microprocessors, one or more microprocessors in conjunction with a DSP core, or any other such configuration.
The steps of a method or algorithm described in connection with the embodiments disclosed herein may be embodied directly in hardware, in a software module executed by a processor, or in a combination of the two. A software module may reside in RAM memory, flash memory, ROM memory, EPROM memory, EEPROM memory, registers, hard disk, a removable disk, a CD-ROM, or any other form of storage medium known in the art. An exemplary storage medium is coupled to the processor such the processor can read information from, and write information to, the storage medium. In the alternative, the storage medium may be integral to the processor. The processor and the storage medium may reside in an ASIC. The ASIC may reside in a user terminal. In the alternative, the processor and the storage medium may reside as discrete components in a user terminal.
In one or more exemplary embodiments, the functions described may be implemented in hardware, software, firmware, or any combination thereof. If implemented in software as a computer program product, the functions may be stored on or transmitted over as one or more instructions or code on a computer-readable medium. Computer-readable media includes both computer storage media and communication media including any medium that facilitates transfer of a computer program from one place to another. A storage media may be any available media that can be accessed by a computer. By way of example, and not limitation, such computer-readable media can comprise RAM, ROM, EEPROM, CD-ROM or other optical disk storage, magnetic disk storage or other magnetic storage devices, or any other medium that can be used to carry or store desired program code in the form of instructions or data structures and that can be accessed by a computer. Any connection is properly termed a computer-readable medium. For example, if the software is transmitted from a web site, server, or other remote source using a coaxial cable, fiber optic cable, twisted pair, Digital Subscriber Line (DSL), or wireless technologies such as infrared, radio, and microwave, then the coaxial cable, fiber optic cable, twisted pair, DSL, or wireless technologies such as infrared, radio, and microwave are included in the definition of medium. Disk (disk) and disc (disc), as used herein, includes Compact Disc (CD), laser disc, optical disc, Digital Versatile Disc (DVD), floppy disk and blu-ray disc where disks (disks) usually reproduce data magnetically, while discs (discs) reproduce data optically with lasers. Combinations of the above should also be included within the scope of computer-readable media.
The previous description of the disclosure is provided to enable any person skilled in the art to make or use the disclosure. Various modifications to the disclosure will be readily apparent to those skilled in the art, and the generic principles defined herein may be applied to other variations without departing from the spirit or scope of the disclosure. Thus, the disclosure is not intended to be limited to the examples and designs described herein but is to be accorded the widest scope consistent with the principles and novel features disclosed herein.
Claims (10)
1. A method for deep learning to participate in SQL optimization of an HTAP database is characterized by comprising the following steps:
step 1: the SQL parser receives the SQL execution statement and checks whether the SQL execution statement conforms to the specification;
step 2: the execution plan module generates an execution plan through a deep learning model based on SQL sentences which are checked by the SQL parser and meet the specification;
and step 3: the engine executes the execution plan generated in the step 2, and sends the SQL execution result and the detailed information of the SQL execution process to a training set of the deep learning model;
and 4, step 4: the deep learning model is used for learning and training based on the training set, then the results of the model learning and training are fed back to the execution plan module in an optimal optimization scheme mode, and after the execution plan module receives the feedback of the deep learning model, the execution plan is automatically optimized and applied to the next SQL query.
2. The method of deep learning to participate in HTAP database execution SQL optimization according to claim 1, wherein the execution plan generated in step 2 is divided into a row-based plan and a column-based plan, and the engine in step 3 is divided into a row-based execution engine to execute the row-based plan and a vectorized execution engine to execute the column-based plan.
3. The method of claim 1 in which the execution plan module in step 2 generates the execution plan based on the deep-learned model further comprises:
the execution planning module utilizes an input layer, a hidden layer and an output layer of the cyclic neural network structure, the input layer inputs SQL execution results and detailed information of the SQL execution process, the hidden layer makes the calculation, and the output layer outputs SQL statements and an optimal execution plan.
4. The method of claim 3, wherein in the cost calculation of the hidden layer, cost estimation is performed on the SQL execution result and detailed information of the SQL execution process, the parameters of the loss function are continuously adjusted in the loop iteration process to reduce the cost, and when the reduced cost is better than the expected value of the cost estimation, the optimization result of the loop iteration is constructed into a data structure of a well-defined execution plan and fed back to the execution plan module.
5. The method of claim 1 in which step 4 further comprises:
a protocol for interaction is preset between the execution plan module and the deep learning model, and a data structure of the execution plan is defined in advance;
the deep learning model learns and trains the training set data sent by the engine in the step 3, quantifies and designs the points which can be fed back, and finally feeds back the learning and training results of the model to the execution plan module, so that the execution plan module directly modifies the data structure of the corresponding execution plan in the abstract syntax tree in the memory for optimization after receiving the feedback of the deep learning model, and ensures that the next query can be applied to the optimization feedback.
6. A system for deep learning to participate in SQL optimization of an HTAP database, the system comprising:
the SQL parser is used for receiving the SQL execution statement and checking whether the SQL execution statement conforms to the specification;
the execution plan module is used for generating an execution plan through a deep learning model based on the SQL sentences which are checked by the SQL parser and meet the specification;
the engine module is used for executing the generated execution plan and sending the SQL execution result and the detailed information of the SQL execution process to the training set of the deep learning model;
and the deep learning model is used for learning and training based on the training set, and then the learning and training results of the model are fed back to the execution plan module in an optimal scheme optimizing mode, so that the execution plan module automatically optimizes the execution plan and applies the execution plan to the next SQL query after receiving the feedback of the deep learning model.
7. The system of claim 6, wherein the execution plan generated by the execution plan module is divided into a row-based plan and a column-based plan, and the engine module is divided into a row-based execution engine to execute the row-based plan and a vectorization execution engine to execute the column-based plan.
8. The system for deep learning participation in an HTAP database SQL optimization according to claim 6, wherein the deep learning based model generation execution plan of the execution plan module further comprises:
the execution planning module utilizes an input layer, a hidden layer and an output layer of the cyclic neural network structure, the input layer inputs SQL execution results and detailed information of the SQL execution process, the hidden layer makes the calculation, and the output layer outputs SQL statements and an optimal execution plan.
9. The system of claim 8, wherein the cost calculation of the execution plan module in the hidden layer estimates the cost of the SQL execution result and the detailed information of the SQL execution process, and continuously adjusts the parameters of the loss function during the loop iteration to reduce the cost, and when the reduced cost is better than the expected value of the cost estimation, the optimized result of the loop iteration is constructed as the data structure of the defined execution plan and fed back to the execution plan module.
10. The system for deep learning participation in an HTAP database to perform SQL optimization according to claim 6, wherein the deep learning model is further configured to:
a protocol for interaction is preset between the execution plan module and the deep learning model, and a data structure of the execution plan is defined in advance;
the deep learning model learns and trains training set data sent by the engine module, quantifies and designs feedback points, and finally feeds back model learning and training results to the execution plan module, so that the execution plan module directly modifies the data structure of the corresponding execution plan in the abstract syntax tree in the memory for optimization after receiving the feedback of the deep learning model, and ensures that the next query can be applied to the optimization feedback.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202210752471.2A CN115033597A (en) | 2022-06-29 | 2022-06-29 | Method and system for deep learning to participate in SQL optimization of HTAP database |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202210752471.2A CN115033597A (en) | 2022-06-29 | 2022-06-29 | Method and system for deep learning to participate in SQL optimization of HTAP database |
Publications (1)
Publication Number | Publication Date |
---|---|
CN115033597A true CN115033597A (en) | 2022-09-09 |
Family
ID=83126429
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202210752471.2A Pending CN115033597A (en) | 2022-06-29 | 2022-06-29 | Method and system for deep learning to participate in SQL optimization of HTAP database |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN115033597A (en) |
Citations (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN101059810A (en) * | 2007-03-16 | 2007-10-24 | 华为技术有限公司 | System and method for implementing automatic optimization of data base system |
US20080270368A1 (en) * | 2007-04-26 | 2008-10-30 | International Business Machines Corporation | Apparatus, system, and method for efficiently supporting generic sql data manipulation statements |
US20120005189A1 (en) * | 2010-06-30 | 2012-01-05 | Oracle International Corporation | Techniques for recommending alternative sql execution plans |
US20170124151A1 (en) * | 2015-11-03 | 2017-05-04 | Sap Se | Optimization of continuous queries in hybrid database and stream processing systems |
CN108763573A (en) * | 2018-06-06 | 2018-11-06 | 众安信息技术服务有限公司 | A kind of OLAP engines method for routing and system based on machine learning |
-
2022
- 2022-06-29 CN CN202210752471.2A patent/CN115033597A/en active Pending
Patent Citations (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN101059810A (en) * | 2007-03-16 | 2007-10-24 | 华为技术有限公司 | System and method for implementing automatic optimization of data base system |
US20080270368A1 (en) * | 2007-04-26 | 2008-10-30 | International Business Machines Corporation | Apparatus, system, and method for efficiently supporting generic sql data manipulation statements |
US20120005189A1 (en) * | 2010-06-30 | 2012-01-05 | Oracle International Corporation | Techniques for recommending alternative sql execution plans |
US20170124151A1 (en) * | 2015-11-03 | 2017-05-04 | Sap Se | Optimization of continuous queries in hybrid database and stream processing systems |
CN108763573A (en) * | 2018-06-06 | 2018-11-06 | 众安信息技术服务有限公司 | A kind of OLAP engines method for routing and system based on machine learning |
Non-Patent Citations (3)
Title |
---|
李国良等: "轩辕:AI 原生数据库系统", 《软件学报》, 15 March 2020 (2020-03-15), pages 831 - 843 * |
李莎;王小琼;: "基于SQL Server数据库的优化策略", 科技风, no. 12, 30 June 2016 (2016-06-30) * |
王振辉;张敏;杨晓明;马永辉;: "基于SQL的数据库应用系统性能优化研究", 计算机时代, no. 03, 2 March 2008 (2008-03-02) * |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
US20200278966A1 (en) | Optimization of database queries for database systems and environments | |
US8935232B2 (en) | Query execution systems and methods | |
US10585887B2 (en) | Multi-system query execution plan | |
CN113711197B (en) | Placement of adaptive aggregation operators and attributes in query plans | |
US6789071B1 (en) | Method for efficient query execution using dynamic queries in database environments | |
US10146837B1 (en) | RLE-aware optimization of SQL queries | |
US7536380B2 (en) | Dynamic look ahead predicate generation | |
US20110137890A1 (en) | Join Order for a Database Query | |
US20180121563A1 (en) | Skew detection and handling in a parallel processing relational database system | |
CN111367954A (en) | Data query processing method, device and system and computer readable storage medium | |
Giannakouris et al. | MuSQLE: Distributed SQL query execution over multiple engine environments | |
CN109885585B (en) | Distributed database system and method supporting stored procedures, triggers and views | |
Vogt et al. | Icarus: Towards a multistore database system | |
Vakharia et al. | Shared foundations: Modernizing meta’s data lakehouse | |
Ding et al. | Sagedb: An instance-optimized data analytics system | |
US10997175B2 (en) | Method for predicate evaluation in relational database systems | |
US11620287B2 (en) | Framework for providing intermediate aggregation operators in a query plan | |
US8832157B1 (en) | System, method, and computer-readable medium that facilitates efficient processing of distinct counts on several columns in a parallel processing system | |
Pal | SQL on Big Data: Technology, Architecture, and Innovation | |
CN115033597A (en) | Method and system for deep learning to participate in SQL optimization of HTAP database | |
Salama | Scalable data analytics and machine learning on the cloud | |
US11977582B2 (en) | Global index with repartitioning operator | |
US12105709B2 (en) | Blocked index join | |
Zhu et al. | Hydb: Access optimization for data-intensive service | |
US20100121836A1 (en) | System, method, and computer-readable medium for eliminating unnecessary self-joins in a database system |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination |