CN112380243B

CN112380243B - SQL query selectivity estimation method based on machine learning

Info

Publication number: CN112380243B
Application number: CN202011303961.1A
Authority: CN
Inventors: 杨晓春; 舒昭维; 马红; 王斌
Original assignee: Northeastern University China
Current assignee: Northeastern University China
Priority date: 2020-11-19
Filing date: 2020-11-19
Publication date: 2021-11-09
Anticipated expiration: 2040-11-19
Also published as: CN112380243A

Abstract

The invention provides a machine learning-based SQL query selectivity estimation method, and relates to the technical field of databases. The invention provides reliable data support for the query optimization process by predicting the SQL query selectivity, so that a final accurate optimal execution plan is determined by a cost-based estimation prediction (CBO) algorithm, for example, a driving table is determined when merging connections (merge joins). Meanwhile, in the index selection process, the accurate selection degree can provide income prediction for the establishment of the index.

Description

SQL query selectivity estimation method based on machine learning

Technical Field

The invention relates to the technical field of databases, in particular to a SQL query selectivity estimation method based on machine learning.

Background

The field of query optimization has been a hot research direction for decades. Query optimization uses the selectivity prediction to distinguish the quality of a query execution plan, and the execution plan directly influences the execution efficiency of the query. At present, most mainstream databases adopt a method of combining a histogram and sampling to estimate the selectivity, but the estimation by using limited information brings a large amount of errors. Although the mainstream databases currently support limited multi-column statistics, this information is still insufficient to support the prediction of the degree of correlation between a large number of table attributes.

A great deal of research shows that predicate selectivity can be accurately estimated by monitoring the query execution process, and the method can be used for enhancing future selectivity estimation. However, current database systems do not fully exploit this opportunity. As early as 20 years ago, learners used neural network methods to predict UDF predicates. In recent years, with the re-rise and breakthrough development of machine learning methods, optimization of the artificial intelligence method fused with the database field is an emerging research trend. In combination with a machine learning method, quick and accurate selectivity estimation is realized, and new technical innovation in the field of query optimization is brought.

Query optimization relies on a prediction of the selectivity of query predicates to produce a good query execution plan. When a query contains multiple predicates, or join operations, most selectors currently use many assumptions, such as independence between predicates. Although such methods can estimate the selectivity quickly and using a small memory, such methods often have wrong selectivity estimation, resulting in slow queries and unpredictable performance. The method of using histogram to add samples often faces very much problems, for example, there are not enough samples with quality, or there is no suitable index structure to support rapid sampling, and at this time, the method often falls back to a preset method to estimate the degree of selection, for example, a predicate with the lowest degree of selection in predicates is used as the estimated degree of selection, the method obviously has great limitation, and meanwhile, as the table attributes increase, the association degree between the attributes increases, the number of predicates increases, and the like, the effect of predicting the degree of selection of the histogram and sampling method is greatly restricted.

In the query optimization cost estimation algorithm, the current mainstream selectivity estimation method can be used for estimating the selectivity by using an accurate model to quickly aim at corresponding query and obtain the estimation of the selectivity, regardless of histogram statistics, sampling statistics or machine learning model estimation selectivity. In other words, the input is the key factor related to the degree of selection, and the estimated degree of selection is obtained through the calculation model. In the whole process, the key links are input selection and model construction, and then the size of a storage space occupied by the model and the estimated speed of the selectivity are further considered.

With the development and popularization of a machine learning algorithm, various elicitations are brought to the performance optimization of the database. The problem in the process of estimating the selectivity can be well solved in a machine learning model. The whole selectivity prediction model can be used as a supervision type learning model, the input is the characteristics of query conversion, and the output is the predicted selectivity. By using the method, the problem that no proper index structure is available to accelerate sampling can be perfectly solved, and the relevance between table attributes is also taken into consideration. Under the current situation, the machine learning algorithm does not need to perfectly and accurately predict the actual and real selectivity, and only needs to be more accurate than the existing inaccurate estimation method. The output selectivity prediction value of the machine learning model can be directly used by the existing perfect cost prediction model, and meanwhile, the actual structure of the database cannot be influenced.

In the process of optimizing an SQL (Structured Query Language) Query execution plan, a mainstream database basically selects a final execution plan based on a cost estimation (CBO) algorithm, and accurate selectivity prediction can provide reliable data for the parsing process to support the algorithm, such as determining a driving table when merging connections (mergejoin). Meanwhile, in the index selection process, the accurate selection degree can provide income prediction for the establishment of the index (when the index structure is a B tree, the index can bring higher promotion in the query with low selection degree).

Disclosure of Invention

Aiming at the defects of the prior art, the invention provides a SQL query selectivity estimation method based on machine learning;

in order to solve the technical problems, the technical scheme adopted by the invention is as follows:

a SQL query selectivity estimation method based on machine learning comprises the following steps:

step 1: acquiring real-time SQL query information of a database, recording the real selection degree corresponding to the SQL query information, and making an initial data set;

the SQL query information comprises the attribute, the operator, the constant, the table name and the join field in the query statement of the condition predicate, and the table row number and the unique value number counted by the database;

step 2: selecting an encoding method of the SQL query information in the step 1, converting the SQL query information in the initial data set in the step 1 into a vector form by using the encoding method, and obtaining an SQL query information vector and a database statistical information vector as the input of a machine learning model;

and step 3: defining SQL query q, attribute set A contained by condition predicate_qLet A be_q＝{A₁,A₂,…,A_kThen in SQL query q, attribute A_kIs lb_k≤A_k≤ub_kWherein lb_k，ub_kRespectively represent the attribute A in the condition predicates_kIs defined by a constant value of (a), then for any SQL query q, denoted as F_qIn the form shown below: f_q:(lb_k1≤A_k1≤ub_k1…lb_kn≤A_kn≤ub_kn) Where k is the sequence number of the query and n represents the sequence number of the constant value defined boundary;

if the true selectivity of the SQL query q is acl (q), the set S { (q) is set₁:acl(q₁)),…,(q_k:acl(q_k) ) }, as follows: (F)₁:acl(q₁)),…,(F_k:acl(q_k))}

And 4, step 4: and (3) selecting a machine learning model for the query q in a given set S, outputting the estimated selectivity est (q) of the query q, enabling the est (q) to be closest to the corresponding acl (q), taking the SQL query information vector and the database statistical information vector calculated in the step 2 as input, performing feature dimension reduction on the vector through a regression model M, and finally obtaining the estimated selectivity of the SQL query through the output of the regression model N.

And 5: and continuously monitoring the information acquired by the database query optimizer and the actual selectivity feedback of the database statistical information to the SQL query, repeating the steps 1-4 to train the regression model, and finally achieving the stability of the regression model to obtain the selectivity prediction of the SQL query.

Adopt the produced beneficial effect of above-mentioned technical scheme to lie in:

the invention provides a machine learning-based SQL query selectivity estimation method, which can be used for providing reliable data support for a query optimization process, so that a cost-based estimation (CBO) algorithm is used for determining a more accurate final optimal execution plan, such as a driving table during merge join (mergejoin). Meanwhile, in the index selection process, the accurate selection degree can provide income prediction for the establishment of the index.

Drawings

FIG. 1 is a frame diagram of a core regression prediction model according to an embodiment of the present invention;

FIG. 2 is a block diagram of an overall method of an embodiment of the present invention;

FIG. 3 is a diagram illustrating feature vector encoding according to an embodiment of the present invention;

FIG. 4 is a frame diagram of a query selectivity prediction method according to an embodiment of the present invention;

FIG. 5 is a flowchart of a method for estimating query selectivity according to an embodiment of the present invention.

Detailed Description

The following detailed description of embodiments of the present invention is provided in connection with the accompanying drawings and examples. The following examples are intended to illustrate the invention but are not intended to limit the scope of the invention.

The invention provides a method for estimating the selectivity of SQL query based on machine learning, which designs an input characteristic for converting the SQL query into a machine learning model, and designs a structure of a machine learning regression model aiming at the input characteristic, thereby accurately and quickly estimating the selectivity. The core regression model is shown in FIG. 1. Thereby achieving the effect of optimizing the query of the whole database, and the framework is shown in figure 2;

based on the above framework, the invention starts from conversion of query into features, firstly converts factors influencing the selectivity in the query into input acceptable by a machine learning regression model, then converts the features of the query sentence level into features of the same magnitude as the features of the query sentence level by combining the features which can be directly obtained in a database, such as the number of unique values in a column and other information, and jointly uses the features as input features, and then the estimated selectivity is finally obtained through the regression model processing of machine learning. And when the selection degree can be accurately estimated, the storage occupation space of the optimization model and the estimated speed are further considered.

The operating environment in this embodiment:

CPU：Inter(R)Core(TM)I7-8700 CPU@3.20GHz

GPU：RTX 2070

memory: 16.00GB

An operating system platform: ubuntu 16.04 LTS

And (3) developing a language: c + +, Python

Deep learning development framework: pythrch

Developing a tool: VIM, Pycharm, CLion

Querying the data set: data set with high attribute correlation such as IMDb

A method for estimating the selectivity of SQL query based on machine learning is shown in FIG. 5, and includes the following steps:

step 2: the encoding method of the SQL query information in step 1 is selected, such as the property name of the conditional predicate, the operator, the constant value (requiring normalization processing), the join field, the table row number, and the unique value number, may be encoded by the currently popular one-hot, wherein the property column for the string type also needs to adopt possible hash encoding, and the string is mapped into an input vector that can be accepted by the subsequent machine learning model. By using the encoding method, the SQL query information in the initial data set in the step 1 is converted into a vector form, and an SQL query information vector and a database statistical information vector are obtained and used as the input of a machine learning model, as shown in FIG. 3;

and step 3: defining SQL query q, attribute set A contained by condition predicate_qLet A be_q＝{A₁,A₂,…,A_kThen in SQL query q, attribute A_kIs lb_k≤A_k≤ub_kWherein lb_k，ub_kRespectively represent the attribute A in the condition predicates_kThe left and right boundaries of the constant value limit, then for any SQL queryQuery q, denoted F_qIn the form shown below: f_q:(lb_k1≤A_k1≤ub_k1…lb_kn≤A_kn≤ub_kn) Where k is the sequence number of the query and n represents the sequence number of the constant value defined boundary;

And 4, step 4: for a given query q in the set S, selecting a machine learning model regression model M, and outputting the estimated selectivity est (q) of the query q so that est (q) is closest to the corresponding acl (q). The regression models selected in the patent are DNN and SVR, the algorithm flow chart is shown in fig. 4, the SQL query information vector and the database statistical information vector calculated in step 2 are used as input, feature dimension reduction is performed on the vector through the regression model M, overfitting due to excessive data dimensions when the input vector of SQL query information conversion is processed in a high-dimensional space is prevented, meanwhile, the selected regression model has high robustness and fault-tolerant capability (such as DNN), the nonlinear relation can be well solved, then, the regression model M is output as the input of the regression model N (such as SVR), and finally, the pre-estimated value of the selectivity of SQL query is obtained through the output of the regression model N.

And 5: continuously monitoring information obtained by the database query optimizer, such as database statistical table information, unique value number, SQL query information and the like, and the actual selectivity feedback of the database statistical information on the SQL query, repeating the steps 1-4 to train the regression model, and finally achieving the stability of the regression model to obtain the selectivity estimation of the SQL query.

In this embodiment, a specific DNN neural network is constructed, the neurons of the input layer of the neural network are determined according to the input feature vector, the hidden layer is 2, the number of the neuron nodes of each layer is determined according to the number of the neuron nodes of the input layer, and the number of the neuron nodes of the output layer is determined according to the SVR model. The specific parameters are set by data set and are conventional parameters.

A specific SVR model is constructed, the kernel parameter uses 'rbf', because the training result only needs to be relatively accurate in size, for example, if act (q1) is 200 and act (q2) is 300, then the predicted value est (q1) > est (q2) is sufficient, so the following error formula is adopted,

the penalty parameter C of the error term can use pow (10, n), wherein n is the value of [ -5, inf), the larger C, the greater the penalty for misclassification, which tends to be the case of totally pairing the training set, so that the accuracy is very high during the test of the training set, but the generalization capability is weak. The C value is small, the punishment on misclassification is reduced, the fault-tolerant capability is enhanced, and the generalization capability is strong. The iteration times are determined according to the data set, and the rest parameters are default parameters.

The foregoing description is only exemplary of the preferred embodiments of the disclosure and is illustrative of the principles of the technology employed. It will be appreciated by those skilled in the art that the scope of the invention in the embodiments of the present disclosure is not limited to the specific combination of the above-mentioned features, but also encompasses other embodiments in which any combination of the above-mentioned features or their equivalents is made without departing from the inventive concept as defined above. For example, the above features and (but not limited to) technical features with similar functions disclosed in the embodiments of the present disclosure are mutually replaced to form the technical solution.

Claims

1. A SQL query selectivity estimation method based on machine learning is characterized in that: the method comprises the following steps:

and step 3: defining SQL query q, attribute set A contained by condition predicate_qLet A be_q＝{A₁，A₂，...，A_kThen in SQL query q, attribute A_kIs lb_k≤A_k≤ub_kWherein lb_k，ub_kRespectively represent the attribute A in the condition predicates_kIs defined by a constant value of (a), then for any SQL query q, denoted as F_qIn the form shown below: f_q：(lb_k1≤A_k1≤ub_k1...lb_kn≤A_kn≤ub_kn) Where k is the sequence number of the query and n represents the sequence number of the constant value defined boundary;

if the true selectivity of the SQL query q is acl (q), the set S { (q) is set₁：acl(q₁))，...，(q_k：acl(q_k) ) }, as follows: (F)₁：acl(q₁))，...，(F_k：acl(q_k))}

2. The method according to claim 1, wherein the SQL query information in step 1 includes attributes of conditional predicates, operators, constants, table names in query statements, join fields, table row numbers of database statistics, and unique value numbers.