CN112380243B - SQL query selectivity estimation method based on machine learning - Google Patents

SQL query selectivity estimation method based on machine learning Download PDF

Info

Publication number
CN112380243B
CN112380243B CN202011303961.1A CN202011303961A CN112380243B CN 112380243 B CN112380243 B CN 112380243B CN 202011303961 A CN202011303961 A CN 202011303961A CN 112380243 B CN112380243 B CN 112380243B
Authority
CN
China
Prior art keywords
sql query
query
selectivity
machine learning
acl
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202011303961.1A
Other languages
Chinese (zh)
Other versions
CN112380243A (en
Inventor
杨晓春
舒昭维
马红
王斌
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Northeastern University China
Original Assignee
Northeastern University China
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Northeastern University China filed Critical Northeastern University China
Priority to CN202011303961.1A priority Critical patent/CN112380243B/en
Publication of CN112380243A publication Critical patent/CN112380243A/en
Application granted granted Critical
Publication of CN112380243B publication Critical patent/CN112380243B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/24Querying
    • G06F16/242Query formulation
    • G06F16/2433Query languages
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/24Querying
    • G06F16/245Query processing
    • G06F16/2453Query optimisation
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/24Querying
    • G06F16/245Query processing
    • G06F16/2458Special types of queries, e.g. statistical queries, fuzzy queries or distributed queries
    • G06F16/2462Approximate or statistical queries
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N20/00Machine learning

Abstract

The invention provides a machine learning-based SQL query selectivity estimation method, and relates to the technical field of databases. The invention provides reliable data support for the query optimization process by predicting the SQL query selectivity, so that a final accurate optimal execution plan is determined by a cost-based estimation prediction (CBO) algorithm, for example, a driving table is determined when merging connections (merge joins). Meanwhile, in the index selection process, the accurate selection degree can provide income prediction for the establishment of the index.

Description

SQL query selectivity estimation method based on machine learning
Technical Field
The invention relates to the technical field of databases, in particular to a SQL query selectivity estimation method based on machine learning.
Background
The field of query optimization has been a hot research direction for decades. Query optimization uses the selectivity prediction to distinguish the quality of a query execution plan, and the execution plan directly influences the execution efficiency of the query. At present, most mainstream databases adopt a method of combining a histogram and sampling to estimate the selectivity, but the estimation by using limited information brings a large amount of errors. Although the mainstream databases currently support limited multi-column statistics, this information is still insufficient to support the prediction of the degree of correlation between a large number of table attributes.
A great deal of research shows that predicate selectivity can be accurately estimated by monitoring the query execution process, and the method can be used for enhancing future selectivity estimation. However, current database systems do not fully exploit this opportunity. As early as 20 years ago, learners used neural network methods to predict UDF predicates. In recent years, with the re-rise and breakthrough development of machine learning methods, optimization of the artificial intelligence method fused with the database field is an emerging research trend. In combination with a machine learning method, quick and accurate selectivity estimation is realized, and new technical innovation in the field of query optimization is brought.
Query optimization relies on a prediction of the selectivity of query predicates to produce a good query execution plan. When a query contains multiple predicates, or join operations, most selectors currently use many assumptions, such as independence between predicates. Although such methods can estimate the selectivity quickly and using a small memory, such methods often have wrong selectivity estimation, resulting in slow queries and unpredictable performance. The method of using histogram to add samples often faces very much problems, for example, there are not enough samples with quality, or there is no suitable index structure to support rapid sampling, and at this time, the method often falls back to a preset method to estimate the degree of selection, for example, a predicate with the lowest degree of selection in predicates is used as the estimated degree of selection, the method obviously has great limitation, and meanwhile, as the table attributes increase, the association degree between the attributes increases, the number of predicates increases, and the like, the effect of predicting the degree of selection of the histogram and sampling method is greatly restricted.
In the query optimization cost estimation algorithm, the current mainstream selectivity estimation method can be used for estimating the selectivity by using an accurate model to quickly aim at corresponding query and obtain the estimation of the selectivity, regardless of histogram statistics, sampling statistics or machine learning model estimation selectivity. In other words, the input is the key factor related to the degree of selection, and the estimated degree of selection is obtained through the calculation model. In the whole process, the key links are input selection and model construction, and then the size of a storage space occupied by the model and the estimated speed of the selectivity are further considered.
With the development and popularization of a machine learning algorithm, various elicitations are brought to the performance optimization of the database. The problem in the process of estimating the selectivity can be well solved in a machine learning model. The whole selectivity prediction model can be used as a supervision type learning model, the input is the characteristics of query conversion, and the output is the predicted selectivity. By using the method, the problem that no proper index structure is available to accelerate sampling can be perfectly solved, and the relevance between table attributes is also taken into consideration. Under the current situation, the machine learning algorithm does not need to perfectly and accurately predict the actual and real selectivity, and only needs to be more accurate than the existing inaccurate estimation method. The output selectivity prediction value of the machine learning model can be directly used by the existing perfect cost prediction model, and meanwhile, the actual structure of the database cannot be influenced.
In the process of optimizing an SQL (Structured Query Language) Query execution plan, a mainstream database basically selects a final execution plan based on a cost estimation (CBO) algorithm, and accurate selectivity prediction can provide reliable data for the parsing process to support the algorithm, such as determining a driving table when merging connections (mergejoin). Meanwhile, in the index selection process, the accurate selection degree can provide income prediction for the establishment of the index (when the index structure is a B tree, the index can bring higher promotion in the query with low selection degree).
Disclosure of Invention
Aiming at the defects of the prior art, the invention provides a SQL query selectivity estimation method based on machine learning;
in order to solve the technical problems, the technical scheme adopted by the invention is as follows:
a SQL query selectivity estimation method based on machine learning comprises the following steps:
step 1: acquiring real-time SQL query information of a database, recording the real selection degree corresponding to the SQL query information, and making an initial data set;
the SQL query information comprises the attribute, the operator, the constant, the table name and the join field in the query statement of the condition predicate, and the table row number and the unique value number counted by the database;
step 2: selecting an encoding method of the SQL query information in the step 1, converting the SQL query information in the initial data set in the step 1 into a vector form by using the encoding method, and obtaining an SQL query information vector and a database statistical information vector as the input of a machine learning model;
and step 3: defining SQL query q, attribute set A contained by condition predicateqLet A beq={A1,A2,…,AkThen in SQL query q, attribute AkIs lbk≤Ak≤ubkWherein lbk,ubkRespectively represent the attribute A in the condition predicateskIs defined by a constant value of (a), then for any SQL query q, denoted as FqIn the form shown below: fq:(lbk1≤Ak1≤ubk1…lbkn≤Akn≤ubkn) Where k is the sequence number of the query and n represents the sequence number of the constant value defined boundary;
if the true selectivity of the SQL query q is acl (q), the set S { (q) is set1:acl(q1)),…,(qk:acl(qk) ) }, as follows: (F)1:acl(q1)),…,(Fk:acl(qk))}
And 4, step 4: and (3) selecting a machine learning model for the query q in a given set S, outputting the estimated selectivity est (q) of the query q, enabling the est (q) to be closest to the corresponding acl (q), taking the SQL query information vector and the database statistical information vector calculated in the step 2 as input, performing feature dimension reduction on the vector through a regression model M, and finally obtaining the estimated selectivity of the SQL query through the output of the regression model N.
And 5: and continuously monitoring the information acquired by the database query optimizer and the actual selectivity feedback of the database statistical information to the SQL query, repeating the steps 1-4 to train the regression model, and finally achieving the stability of the regression model to obtain the selectivity prediction of the SQL query.
Adopt the produced beneficial effect of above-mentioned technical scheme to lie in:
the invention provides a machine learning-based SQL query selectivity estimation method, which can be used for providing reliable data support for a query optimization process, so that a cost-based estimation (CBO) algorithm is used for determining a more accurate final optimal execution plan, such as a driving table during merge join (mergejoin). Meanwhile, in the index selection process, the accurate selection degree can provide income prediction for the establishment of the index.
Drawings
FIG. 1 is a frame diagram of a core regression prediction model according to an embodiment of the present invention;
FIG. 2 is a block diagram of an overall method of an embodiment of the present invention;
FIG. 3 is a diagram illustrating feature vector encoding according to an embodiment of the present invention;
FIG. 4 is a frame diagram of a query selectivity prediction method according to an embodiment of the present invention;
FIG. 5 is a flowchart of a method for estimating query selectivity according to an embodiment of the present invention.
Detailed Description
The following detailed description of embodiments of the present invention is provided in connection with the accompanying drawings and examples. The following examples are intended to illustrate the invention but are not intended to limit the scope of the invention.
The invention provides a method for estimating the selectivity of SQL query based on machine learning, which designs an input characteristic for converting the SQL query into a machine learning model, and designs a structure of a machine learning regression model aiming at the input characteristic, thereby accurately and quickly estimating the selectivity. The core regression model is shown in FIG. 1. Thereby achieving the effect of optimizing the query of the whole database, and the framework is shown in figure 2;
based on the above framework, the invention starts from conversion of query into features, firstly converts factors influencing the selectivity in the query into input acceptable by a machine learning regression model, then converts the features of the query sentence level into features of the same magnitude as the features of the query sentence level by combining the features which can be directly obtained in a database, such as the number of unique values in a column and other information, and jointly uses the features as input features, and then the estimated selectivity is finally obtained through the regression model processing of machine learning. And when the selection degree can be accurately estimated, the storage occupation space of the optimization model and the estimated speed are further considered.
The operating environment in this embodiment:
CPU:Inter(R)Core(TM)I7-8700 CPU@3.20GHz
GPU:RTX 2070
memory: 16.00GB
An operating system platform: ubuntu 16.04 LTS
And (3) developing a language: c + +, Python
Deep learning development framework: pythrch
Developing a tool: VIM, Pycharm, CLion
Querying the data set: data set with high attribute correlation such as IMDb
A method for estimating the selectivity of SQL query based on machine learning is shown in FIG. 5, and includes the following steps:
step 1: acquiring real-time SQL query information of a database, recording the real selection degree corresponding to the SQL query information, and making an initial data set;
the SQL query information comprises the attribute, the operator, the constant, the table name and the join field in the query statement of the condition predicate, and the table row number and the unique value number counted by the database;
step 2: the encoding method of the SQL query information in step 1 is selected, such as the property name of the conditional predicate, the operator, the constant value (requiring normalization processing), the join field, the table row number, and the unique value number, may be encoded by the currently popular one-hot, wherein the property column for the string type also needs to adopt possible hash encoding, and the string is mapped into an input vector that can be accepted by the subsequent machine learning model. By using the encoding method, the SQL query information in the initial data set in the step 1 is converted into a vector form, and an SQL query information vector and a database statistical information vector are obtained and used as the input of a machine learning model, as shown in FIG. 3;
and step 3: defining SQL query q, attribute set A contained by condition predicateqLet A beq={A1,A2,…,AkThen in SQL query q, attribute AkIs lbk≤Ak≤ubkWherein lbk,ubkRespectively represent the attribute A in the condition predicateskThe left and right boundaries of the constant value limit, then for any SQL queryQuery q, denoted FqIn the form shown below: fq:(lbk1≤Ak1≤ubk1…lbkn≤Akn≤ubkn) Where k is the sequence number of the query and n represents the sequence number of the constant value defined boundary;
if the true selectivity of the SQL query q is acl (q), the set S { (q) is set1:acl(q1)),…,(qk:acl(qk) ) }, as follows: (F)1:acl(q1)),…,(Fk:acl(qk))}
And 4, step 4: for a given query q in the set S, selecting a machine learning model regression model M, and outputting the estimated selectivity est (q) of the query q so that est (q) is closest to the corresponding acl (q). The regression models selected in the patent are DNN and SVR, the algorithm flow chart is shown in fig. 4, the SQL query information vector and the database statistical information vector calculated in step 2 are used as input, feature dimension reduction is performed on the vector through the regression model M, overfitting due to excessive data dimensions when the input vector of SQL query information conversion is processed in a high-dimensional space is prevented, meanwhile, the selected regression model has high robustness and fault-tolerant capability (such as DNN), the nonlinear relation can be well solved, then, the regression model M is output as the input of the regression model N (such as SVR), and finally, the pre-estimated value of the selectivity of SQL query is obtained through the output of the regression model N.
And 5: continuously monitoring information obtained by the database query optimizer, such as database statistical table information, unique value number, SQL query information and the like, and the actual selectivity feedback of the database statistical information on the SQL query, repeating the steps 1-4 to train the regression model, and finally achieving the stability of the regression model to obtain the selectivity estimation of the SQL query.
In this embodiment, a specific DNN neural network is constructed, the neurons of the input layer of the neural network are determined according to the input feature vector, the hidden layer is 2, the number of the neuron nodes of each layer is determined according to the number of the neuron nodes of the input layer, and the number of the neuron nodes of the output layer is determined according to the SVR model. The specific parameters are set by data set and are conventional parameters.
A specific SVR model is constructed, the kernel parameter uses 'rbf', because the training result only needs to be relatively accurate in size, for example, if act (q1) is 200 and act (q2) is 300, then the predicted value est (q1) > est (q2) is sufficient, so the following error formula is adopted,
Figure BDA0002787722010000051
the penalty parameter C of the error term can use pow (10, n), wherein n is the value of [ -5, inf), the larger C, the greater the penalty for misclassification, which tends to be the case of totally pairing the training set, so that the accuracy is very high during the test of the training set, but the generalization capability is weak. The C value is small, the punishment on misclassification is reduced, the fault-tolerant capability is enhanced, and the generalization capability is strong. The iteration times are determined according to the data set, and the rest parameters are default parameters.
The foregoing description is only exemplary of the preferred embodiments of the disclosure and is illustrative of the principles of the technology employed. It will be appreciated by those skilled in the art that the scope of the invention in the embodiments of the present disclosure is not limited to the specific combination of the above-mentioned features, but also encompasses other embodiments in which any combination of the above-mentioned features or their equivalents is made without departing from the inventive concept as defined above. For example, the above features and (but not limited to) technical features with similar functions disclosed in the embodiments of the present disclosure are mutually replaced to form the technical solution.

Claims (2)

1. A SQL query selectivity estimation method based on machine learning is characterized in that: the method comprises the following steps:
step 1: acquiring real-time SQL query information of a database, recording the real selection degree corresponding to the SQL query information, and making an initial data set;
step 2: selecting an encoding method of the SQL query information in the step 1, converting the SQL query information in the initial data set in the step 1 into a vector form by using the encoding method, and obtaining an SQL query information vector and a database statistical information vector as the input of a machine learning model;
and step 3: defining SQL query q, attribute set A contained by condition predicateqLet A beq={A1,A2,...,AkThen in SQL query q, attribute AkIs lbk≤Ak≤ubkWherein lbk,ubkRespectively represent the attribute A in the condition predicateskIs defined by a constant value of (a), then for any SQL query q, denoted as FqIn the form shown below: fq:(lbk1≤Ak1≤ubk1...lbkn≤Akn≤ubkn) Where k is the sequence number of the query and n represents the sequence number of the constant value defined boundary;
if the true selectivity of the SQL query q is acl (q), the set S { (q) is set1:acl(q1)),...,(qk:acl(qk) ) }, as follows: (F)1:acl(q1)),...,(Fk:acl(qk))}
And 4, step 4: and (3) selecting a machine learning model for the query q in a given set S, outputting the estimated selectivity est (q) of the query q, enabling the est (q) to be closest to the corresponding acl (q), taking the SQL query information vector and the database statistical information vector calculated in the step 2 as input, performing feature dimension reduction on the vector through a regression model M, and finally obtaining the estimated selectivity of the SQL query through the output of the regression model N.
2. The method according to claim 1, wherein the SQL query information in step 1 includes attributes of conditional predicates, operators, constants, table names in query statements, join fields, table row numbers of database statistics, and unique value numbers.
CN202011303961.1A 2020-11-19 2020-11-19 SQL query selectivity estimation method based on machine learning Active CN112380243B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202011303961.1A CN112380243B (en) 2020-11-19 2020-11-19 SQL query selectivity estimation method based on machine learning

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202011303961.1A CN112380243B (en) 2020-11-19 2020-11-19 SQL query selectivity estimation method based on machine learning

Publications (2)

Publication Number Publication Date
CN112380243A CN112380243A (en) 2021-02-19
CN112380243B true CN112380243B (en) 2021-11-09

Family

ID=74584606

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202011303961.1A Active CN112380243B (en) 2020-11-19 2020-11-19 SQL query selectivity estimation method based on machine learning

Country Status (1)

Country Link
CN (1) CN112380243B (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20200320436A1 (en) * 2019-04-08 2020-10-08 Google Llc Transformation for machine learning pre-processing

Families Citing this family (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113656437B (en) * 2021-07-02 2023-10-03 阿里巴巴新加坡控股有限公司 Model construction method for predicting execution cost stability of reference

Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN105354255A (en) * 2015-10-21 2016-02-24 华为技术有限公司 Data query method and apparatus
CN111274270A (en) * 2020-02-24 2020-06-12 北京东方金信科技有限公司 Statistical information processing and using method of database optimizer and storage device

Family Cites Families (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20140244641A1 (en) * 2013-02-27 2014-08-28 Wal-Mart Stores, Inc. Holistic customer record linkage via profile fingerprints

Patent Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN105354255A (en) * 2015-10-21 2016-02-24 华为技术有限公司 Data query method and apparatus
CN111274270A (en) * 2020-02-24 2020-06-12 北京东方金信科技有限公司 Statistical information processing and using method of database optimizer and storage device

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
Estimating a1Phanumeric seleetivity in the Presenee of wildcards;Krishnan P等;《In Proeeedings of SIGMOD》;19961231;全文 *
面向关系数据库的智能索引调优方法;邱涛等;《软件学报》;20200110;全文 *

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20200320436A1 (en) * 2019-04-08 2020-10-08 Google Llc Transformation for machine learning pre-processing

Also Published As

Publication number Publication date
CN112380243A (en) 2021-02-19

Similar Documents

Publication Publication Date Title
CN111310438B (en) Chinese sentence semantic intelligent matching method and device based on multi-granularity fusion model
CN111753101B (en) Knowledge graph representation learning method integrating entity description and type
CN110674850A (en) Image description generation method based on attention mechanism
CN112380243B (en) SQL query selectivity estimation method based on machine learning
CN112000772B (en) Sentence-to-semantic matching method based on semantic feature cube and oriented to intelligent question and answer
CN105740984A (en) Product concept performance evaluation method based on performance prediction
CN110888859B (en) Connection cardinality estimation method based on combined deep neural network
EP4075292A1 (en) Method and apparatus for processing database
CN112000770A (en) Intelligent question and answer oriented sentence-to-sentence matching method based on semantic feature map
Zhao et al. Queryformer: A tree transformer model for query plan representation
CN114911844B (en) Approximate query optimization system based on machine learning
CN115617830A (en) Data query optimization processing method and device based on machine learning
CN114547230A (en) Intelligent administrative law enforcement case information extraction and case law identification method
CN112749191A (en) Intelligent cost estimation method and system applied to database and electronic equipment
CN112883066B (en) Method for estimating multi-dimensional range query cardinality on database
CN116643759A (en) Code pre-training model training method based on program dependency graph prediction
Li et al. GLDH: Toward more efficient global low-density locality-sensitive hashing for high dimensions
CN114564410A (en) Software defect prediction method based on class level source code similarity
CN115203206A (en) Data content searching method and device, computer equipment and readable storage medium
Gao et al. Automatic index selection with learned cost estimator
CN111897932A (en) Query processing method and system for text big data
CN116383239B (en) Mixed evidence-based fact verification method, system and storage medium
CN117909363A (en) Radix estimation method and system based on dynamic sample recommendation
Su World CO2 Emissions: Simple Analysis and its Relationship with Global Temperature Change
CN117390063A (en) Listwise ordering learning-based database querier optimization method

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant