CN116521719A

CN116521719A - Query optimization system based on cost estimation

Info

Publication number: CN116521719A
Application number: CN202310401083.4A
Authority: CN
Inventors: 荆一楠; 王嵩立; 张寒冰; 徐伟; 陈振强; 何震瀛; 王晓阳
Original assignee: Fudan University; Transwarp Technology Shanghai Co Ltd
Current assignee: Fudan University; Transwarp Technology Shanghai Co Ltd
Priority date: 2023-04-15
Filing date: 2023-04-15
Publication date: 2023-08-01

Abstract

The invention belongs to the technical field of database query, and particularly relates to a query optimization system based on cost estimation. The invention comprises a system information extractor and a cost estimation model based on deep learning; the system information device processes information such as storage, execution model and the like of the database management system into structured data for the model to use; the cost estimation model based on deep learning can establish a mapping relation from query to cost through a history execution record according to different system information, so as to estimate the cost of unknown query; the training of the cost estimation model adopts a layered training strategy, so that the model can be helped to learn from batch training data, the memory utilization rate of model training is improved, training concussion is reduced, and model convergence is accelerated. The invention can help the database optimizer to select the correct execution plan, and finally improves the overall query execution efficiency of the database.

Description

Query optimization system based on cost estimation

Technical Field

The invention belongs to the technical field of database query, and particularly relates to a query optimization system based on cost estimation.

Background

Today, the response delay requirements of the database field on the database system are higher and higher, and whether the query execution is efficient depends greatly on the performance of the query optimizer. In the process of query optimization, cost-based execution plan optimization is an important link, and an excellent cost estimation model can help an optimizer to better select a proper execution plan. Most of the existing cost-based query optimization techniques rely strongly on estimating the cardinality of the sub-query. Firstly, obtaining a relatively accurate base estimation value, and then carrying out weighted summation on the bases of all operators to obtain a cost estimation value.

In recent years, radix estimation methods are advanced, and besides traditional histogram estimation methods based on statistics, hyperlog estimation methods and the like, a plurality of methods based on machine learning also achieve good effects. Cardinal number estimation methods based on machine learning are largely divided into two categories: query-based data-based. The query-based method utilizes the deep neural network to establish the mapping relation from the query to the base, has the advantages of high accuracy, can quickly adapt to data change, and has the defect of poor adaptability to the change of the workload; the data-based method utilizes a sum-product network, an autoregressive model and the like to obtain the joint probability distribution of the data, and then obtains the base number of the query through calculation, so that the method has the advantages of high robustness to the change of the workload, but the accuracy rate is obviously reduced when the data distribution is changed.

From the query base to the query cost, the database system which is increasingly complex nowadays is no longer a simple linear association. From a storage perspective, for relational data, in a transaction-based system, the data is stored in rows; storing in columns in a data warehouse for analysis; in some hybrid transaction databases, a pattern of multi-copy rank hybrid storage is also employed. The cost of reading an equal amount of data on different underlying stores is quite different. On a distributed system, each operator also allows parallel execution on multiple copies of data, data slices. The accuracy of the existing cost estimation model for the complex execution scenario is often not satisfactory.

Disclosure of Invention

The invention aims to provide a query optimization system based on cost estimation, which overcomes the defects of the prior art.

The query optimization system based on cost estimation provided by the invention can sense the stored information and the execution information of the system. As a core component of the optimizer in the database management system, the enumerated execution plans can be accurately estimated by combining multiple aspects of information, and the selection of the final execution plan by the optimizer is facilitated. The method comprises a system information extractor and a cost estimation model based on deep learning:

system information extractor

The system information extractor processes the storage and execution model information of the database management system into structured data for use by the model; the system comprises two modules, namely a storage information extraction module and an execution information extraction module of a database management system; the information extracted by the system information extractor is used as the input of the cost estimation model. Wherein:

the stored information extraction module extracts information related to storage of data from a database management system, specifically, includes: storage model, copy number and compression strategy of data;

the execution information extraction module extracts information related to the physical execution process of the query from the database management system, and comprises the following steps: operator parallelism, buffer size;

the system information extractor integrates the results of the two modules into a cost estimation model.

Further specifically:

the system information extractor obtains and updates the stored information Is and the execution information Ie of the system by periodic extraction, and combines the stored information Is and the execution information Ie into system information I _system ：

As an input feature for constructing a cost estimation model;

for stored information I _s Storage model considering dataCopy number->Compression strategy->Expressed as:

first term storage modelDivided into line memories S _row Sum column store S _column Two classes:

second number of copiesDefined as the number of copies of a piece of data on multiple clusters, specifically row-store copy C _row And column-store copy C _column The sum of the numbers, namely:

third compression strategyDefined as the compression policy that the column store copy, if any, takes on the data; the information is extensible and defined as:

for execution information I _e Consider operator parallelismCache size +.>Two dimensions, expressed as:

operator parallelismDefined as the parallelism of each SQL operator in the execution plan, support parallelism P including index scan operators _{index_scan} Connection ofParallelism P of operators _join Parallelism P of aggregation operator _join Parallelism P of projection operator _join The method comprises the steps of carrying out a first treatment on the surface of the The parallelism value ranges of the four operators are positive integers Z ⁺ The method comprises the steps of carrying out a first treatment on the surface of the Cache size +.>The value range is a positive integer Z ⁺ The default unit is MB.

(II) cost estimation model based on deep learning

The construction process of the cost estimation model comprises the following steps: the characteristics of the historical execution plan and the system information extracted from the system information extractor are required to be used as the input characteristics of the cost estimation model, and then the neural network constructed by the small model is used for fitting according to the historical information of the query and the real execution condition, so that the mapping relation between the execution plan and the cost is established. The method comprises the following specific steps:

(1) Encoding the input features;

(2) Building different sub-models M according to operator types _node Specifically, submodels M are built for aggregation, concatenation, filtering, scanning, etc. operations, respectively _aggr ，M _join ，M _filter ，M _scan ；

Constructing a sub-model into a total cost estimation model according to a tree structure of an execution plan

(3) Adjusting a cost estimation model; comparing the predicted result with the true cost, calculating a quantized estimated error, and adjusting the neural network model using a back propagation mechanism based on the error.

In step (1), the input features are encoded; the input features mainly include two parts: feature I of historical execution plans _plan And system information I extracted by a system information extractor _system The method comprises the steps of carrying out a first treatment on the surface of the Input features I _input Formalized representation is:

the history execution plan is a tree-shaped combination of operators, and thus the characteristics I of the history execution plan _plan Is the set of all operator features in the plan, namely: i _plan ＝UI _node ；

The characteristics of the operators consist of two parts, namely an operator type and meta information of the operators; operator types include full table scanning, merging connection, filtering, hash aggregation, etc.; the meta information of the operators comprises predicate information, estimated base numbers and the like; predicate information is divided into a numerical type and a character string type, and the numerical type is normalized and converted into a numerical value in a (0, 1) range by using a maximum and minimum value; for the string value, a word vector (word vector) model is used for encoding the string value;

system information I _system The code of (2) is shown in the formula (1).

In the step (2), the sub-models M are respectively built _aggr ，M _join ，M _filter ，M _scan The method specifically comprises the following steps:

for N different operators in a database management system, each operator O _i ，i∈[1，N]All have an independent small neural network, denoted NN _i Inside the small-sized neural network is a multi-layer cyclic neural network, and parameters among different neural networks are mutually independent; the design is light, and the high-dimensional characteristics of operators can be extracted more deeply; these small neural networks NN _i Cost estimation specially used for processing corresponding class operators; finally, building a total cost estimation model by the sub-model according to the tree structure of the execution plan

The step (3) is to adjust the cost estimation model, and in particular to obtain the estimated value of the cost layer by layer upwards from the leaf nodesComparing the estimated value with a real cost, and calculating an estimated error as a model loss q:

according to the error, parameters of the neural network are adjusted by using a back propagation mechanism, and finally the neural network converges to obtain a better model.

In actual use, the cost estimation model receives the candidate execution plan from the optimizer, and feeds back the estimated value of the cost to the optimizer to obtain the estimated value of the cost.

In the invention, regarding cost estimation model training, a strategy of layered training is adopted; the model learning method can help the model learn from the batch training data, improve the memory utilization rate of model training, reduce training concussion and accelerate model convergence.

Specifically, the nodes of the tree execution plans are organized according to the tree hierarchy, and the nodes belonging to the same layer are used as a sub-batch. During training, the sublots are input into the cost estimation model neural network in reverse order according to the hierarchy, and the subnodes of each node are positioned to be output by the aid of the auxiliary index structure, as shown in fig. 2. The specific flow is as follows:

(1) For a batch of data with the size of n, firstly layering n execution plans according to a tree structure, and dividing the n execution plans into k layers (k is the maximum depth of n execution plan trees), wherein nodes of the same layer are used as a sub batch; taking the layer where the root operator of the execution plan is located as a 1 st layer, wherein the ith sub-batch contains all n operators of the ith layer of the execution plan;

(2) Indexing sub operators of each layer of operators; introducing a two-dimensional tensor as an auxiliary data structure, and storing the position of a sub operator of each operator of the ith layer in the (i+1) th layer;

(3) Organizing data of each layer in a batch into tensors according to the data, and training the tensors as model input; starting from the k layer, inputting the model, wherein the output result of each layer is added into the input data of the upper layer; during training, the average value in the batch is used as the average loss.

Compared with the traditional cost estimation model, the invention uses the end-to-end training mode to train the model, thereby improving the acquisition capability of the model for the whole information of the execution plan. In addition to this, system information is introduced: the method comprises the steps of storing information and executing information, and the adaptability of a complex system such as a distributed high concurrency system, a line-row mixed drinking storage system and the like is further improved.

Drawings

FIG. 1 is a block diagram of a cost estimation based query optimization system for sensing system information in accordance with the present invention.

FIG. 2 is a particular batch architecture of a particular cost estimation model.

FIG. 3 is a flow chart of the use of the present invention.

Detailed Description

The following is a specific embodiment of the present invention, as shown in fig. 3.

Experiment setting: data set: the IMDb dataset refers to the dataset of an internet movie database (Internet Movie Database) website, which contains millions of movies, television programs, and television movies, and related personal data, comments, scores, rewards, etc. of actors, directors, producers, and other staff. The dataset contains a plurality of tables, mainly seven of the following tables: title. Bases: tables containing basic information about movies, television programs, television movies, etc., such as movie names, genre, release date, country/region of production, etc.

title. Ratios: user scoring data, such as average score, number of scoring, etc., for movies, television shows, and television movies on the IMDb website are included.

title. Brew: including director and drama information for each movie, as well as other producer information.

name. Bases: including personal information of the staff such as actors, directors, producers, etc., such as date of birth, occupation, etc.

title, standards: including the roles and position information of actors and other staff in each movie.

title. Epsilon. Epoode: including episode and season information for television programs and television movies.

title. Akas: including information such as the movie alias, the translation name, the international standard movie code, etc.

Estimating the cardinality and cost of the IMDB dataset is much more difficult than TPC-H due to the correlation and skew distribution of the real world data. The IMDB dataset includes 22 tables that are connected by primary and foreign keys. We build an index over the primary key.

Workload: JOB-light is a set of workloads established based on the imdb dataset. The JOB-light workload contains 1-4 connected queries, for a total of 70 queries.

The implementation process comprises the following steps: (1) first, training data is created from the original data set. The user extracts inquiry information from the historical inquiry information and organizes the inquiry information into a binary combination form of < execution plan, real cost >. If the history information is insufficient, a query statement simulating the real business can be generated and submitted to system execution to supplement the training data. The method comprises the following specific steps: firstly, generating a query template, taking JOB-light as an example, firstly generating 22 query templates covering various tasks, and then randomly and uniformly generating 10 ten thousand SQL queries according to the templates. Of these, 9 ten thousand were used as training data and 1 ten thousand were used as test data.

(2) In the model training phase, the neural network parameters are first initialized randomly: for each parameter, values are randomly extracted from a normal distribution with a mean value of 0 and a standard deviation of 1 as initial values. The cost estimation model extracts the latest system characteristic information I from the system _system Including storing information I _s And execution information I _e And extract I from the execution plan _plan As input. The stored information is obtained by reading the database metadata. Setting training period number epoch=500, and estimating the cost of the execution plan by using a neural network by using a cost estimation model to obtain each training periodThen, calculating an error loss between the estimated cost and the real cost, and training by using a back propagation mechanism of the neural network. After 500 training periods, the cost estimation model convergesTo a more accurate state. Use q-error to measure estimation error,/->Wherein cost is estimated cost, < >>At the cost of opportunity. Through experiments, the average error of q-error is 1.17, the 95-minute error is 1.36, and the average error is 2-3 orders of magnitude better than the traditional cost estimation model.

(3) In the use phase, when the target system receives a new query request Q, candidate plans P generated from the optimizer _candidate ＝{P ₁ ，P ₂ ，...，P _n Extracting features of each plan from }And then inputting a cost estimation model. At this time, the latest system information I is maintained in the cost estimation model _system . Then, the cost model inputs the planning characteristics and the system information together as characteristics into the neural network to obtain estimated cost +.>And then returns it to the optimizer.

(4) The optimizer utilizes cost estimation model to estimate costAnd selecting an optimal execution plan with the minimum estimated cost, and submitting the execution plan. The result is better credibility due to higher accuracy of cost estimation.

Claims

1. The query optimization system based on cost estimation can sense the stored information and the execution information of the system and is characterized by comprising a system information extractor and a cost estimation model based on deep learning; wherein:

the system information extractor processes the storage and execution model information of the database management system into structured data for use by a model; the system comprises two modules, namely a storage information extraction module and an execution information extraction module of a database management system; the information extracted by the system information extractor is used as the input of a cost estimation model; wherein:

the storage information extraction module extracts information related to the storage of data from a database management system, and specifically comprises the following steps: storage model, copy number and compression strategy of data;

the system information extractor integrates the results of the two modules and inputs the results into a cost estimation model;

the cost estimation model based on deep learning comprises the following construction processes: the characteristics of the historical execution plan and the system information extracted from the system information extractor are required to be used as the input characteristics of the cost estimation model, and then the neural network constructed by the small model is used for fitting according to the historical information of the query and the real execution condition, so that the mapping relation between the execution plan and the cost is established.

2. The cost estimation based query optimization system of claim 1, wherein the system information extractor obtains and updates the system's stored information I by periodically extracting _s And execution information I _e Is combined into system information I _system ：

As an input feature for constructing a cost estimation model;

operator parallelismDefined as the parallelism of each SQL operator in the execution plan, support parallelism P including index scan operators _{index_scan} Parallelism P of join operators _join Parallelism P of aggregation operator _join Parallelism P of projection operator _join The method comprises the steps of carrying out a first treatment on the surface of the The parallelism value ranges of the four operators are positive integers Z ⁺ The method comprises the steps of carrying out a first treatment on the surface of the Cache size +.>The value range is a positive integer Z ⁺ The default unit is MB.

3. The cost estimation-based query optimization system of claim 2, wherein the specific steps of constructing the cost estimation model are as follows:

(1) Encoding the input features;

(2) Building different sub-models M according to operator types _node Specifically, submodels M are built for aggregation, concatenation, filtering, and scanning operations, respectively _aggr ,M _join ,M _filter ,M _scan ；

4. A cost estimation based query optimization system as claimed in claim 3, wherein:

encoding an input feature as described in step (1), wherein the input feature comprises two parts: feature I of historical execution plans _plan And system information I extracted by a system information extractor _system The method comprises the steps of carrying out a first treatment on the surface of the Input features I _input Formalized representation is:

the history execution plan is a tree-shaped combination of operators, and thus the characteristics I of the history execution plan _plan Is the set of all operator features in the plan, namely: i _plan ＝∪I _node ；

The operator characteristics consist of two parts, namely an operator type and meta information of the operator; operator types include full table scanning, merging connection, filtering, hash aggregation; the meta information of the operator comprises predicate information and a predicted base number; predicate information is divided into a numerical type and a character string type, and the numerical type is normalized and converted into a numerical value in a (0, 1) range by using a maximum and minimum value; for the string value, encoding it using a word vector model;

the sub-models M are respectively built in the step (2) _aggr ,M _join ,M _filter ,M _scan The method specifically comprises the following steps:

for N different operators in a database management system, each operator O _i ,i∈[1,N]All have an independent small neural network, denoted NN _i Inside the small neural network is a multi-layer fully-connected neural network, and parameters among different neural networksIndependent of each other; these small neural networks NN _i The cost estimation method is used for processing cost estimation of the corresponding class operator; finally, building a total cost estimation model by the sub-model according to the tree structure of the execution plan

according to the error, parameters of the neural network are adjusted by using a back propagation mechanism, and finally, the parameters are converged to obtain a better model;

5. The query optimization system based on cost estimation according to claim 4, wherein a hierarchical training strategy is adopted for the cost estimation model, specifically, nodes of a plurality of tree execution plans are organized according to a tree hierarchy, and nodes belonging to a layer are used as a sub-batch; during training, sublots are input into a neural network in a reverse order according to the hierarchy, and subnodes of each node are positioned to be output by means of an auxiliary index structure, and the specific flow is as follows:

(1) For a batch of data with the size of n, firstly layering n execution plans according to a tree structure, dividing the n execution plans into k layers, wherein k is the maximum depth of n execution plan trees, and the nodes of the same layer are used as a sub batch; taking the layer where the root operator of the execution plan is located as a 1 st layer, wherein the ith sub-batch contains all n operators of the ith layer of the execution plan;

(3) Organizing data of each layer in a batch into tensors according to the data, and training the tensors as model input; starting from the kth layer, inputting a model from the bottom to the top, and adding an output result of each layer into input data of an upper layer; training loss uses intra-batch mean as the average loss.