CN115576970A - Database cost model parameter optimization method based on decision tree and query method thereof - Google Patents

Database cost model parameter optimization method based on decision tree and query method thereof Download PDF

Info

Publication number
CN115576970A
CN115576970A CN202211054493.8A CN202211054493A CN115576970A CN 115576970 A CN115576970 A CN 115576970A CN 202211054493 A CN202211054493 A CN 202211054493A CN 115576970 A CN115576970 A CN 115576970A
Authority
CN
China
Prior art keywords
cost
cost model
database
query
execution
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202211054493.8A
Other languages
Chinese (zh)
Inventor
伍赛
贺婷婷
杨佳妮
陈刚
陈珂
寿黎但
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Zhejiang University ZJU
Original Assignee
Zhejiang University ZJU
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Zhejiang University ZJU filed Critical Zhejiang University ZJU
Priority to CN202211054493.8A priority Critical patent/CN115576970A/en
Publication of CN115576970A publication Critical patent/CN115576970A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/24Querying
    • G06F16/242Query formulation
    • G06F16/2433Query languages
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/24Querying
    • G06F16/245Query processing
    • G06F16/2453Query optimisation

Abstract

The invention discloses a database cost model parameter optimization method based on a decision tree and a query method thereof. Aiming at a database example under specific software and hardware settings, the invention establishes a database cost model parameter tree, the parameter space is partitioned by using database configuration parameters and query statement characteristics as splitting dimensions, and the optimal cost model parameters are solved by linear fitting of training samples in each partition. In the operation process, the parameter tree distributes different cost model parameters for query statements under different parameter configurations and data distribution, so that accurate cost prediction is performed. Experiments show that the method improves the prediction accuracy of the traditional rule-based estimation model and optimizes the query performance of the database.

Description

Database cost model parameter optimization method based on decision tree and query method thereof
Technical Field
The invention designs a database cost model parameter optimization method based on a decision tree, and belongs to the field of query optimization in database management system software.
Background
Database management system software (hereinafter, referred to as a database) is responsible for completing storage and reading of data, and after a user inputs a query statement, the database acquires the data from the storage according to the generated execution plan and performs corresponding processing, and returns the data meeting the requirements. The same query statement may generate multiple possible execution plans in the database, and the differences include data scanning modes, table connection methods, table connection orders, and the like. Different execution plans may result in different execution times and even an order of magnitude gap. To improve the performance of the database, the database should be able to select an execution plan with a shorter execution time in the execution plan search space.
At present, most of databases are implemented by using a cost-based query optimization method, a cost model is set to perform cost estimation on different query execution plans, and the execution plan with the minimum cost is selected for execution. A good cost model needs to provide accurate cost estimation, so that the database is helped to optimize the execution time of query sentences, and the performance of the database is improved; meanwhile, the cost model should be able to quickly estimate the cost of the execution plan because the database query optimizer module may call the cost model multiple times in selecting the execution plan to evaluate the performance of the execution plan.
The traditional cost-based query optimization method uses a rule-based cost model, estimates the execution cost of operation by reading the number of pages and processing the number of tuples, and allows database management personnel to perform parameter optimization according to the performance of a physical machine through adjustable cost parameters. However, the cost model which is simplified excessively ignores the influence of some hardware and software configurations, database parameter configurations and data distribution of a data set, so that the cost estimation result of the query statement deviates from the real execution cost in the actual operation process of the database. Therefore, a cost model based on learning appears, the execution cost of the query plan is predicted by extracting the features of the data operation or the query plan, and the prediction accuracy of the cost model is improved. However, the cost model based on learning requires long training, inference time, and retraining when the database configuration or data distribution changes, and the application range is narrow.
Disclosure of Invention
The cost model based on the rules considers a plurality of fine-grained characteristics in the implementation process, has better interpretability and universality compared with the cost model based on learning, but the effect depends on the selection of parameters, and the parameters are influenced by hardware and software configuration, database parameter configuration and data distribution of a data set. Aiming at the defects of the prior art, the invention aims to provide a database cost model parameter optimization method based on a decision tree, which selects cost model parameters most conforming to the actual execution environment for a cost model based on rules according to the parameter configuration condition of a database and a data set when an inquiry statement runs. According to the method, firstly, variables influencing cost model parameters based on rules are determined, then, the root nodes of the cost model parameter tree are obtained through the training of a predefined query statement set, and the cost model parameter tree is continuously trained in an incremental learning mode when a database executes a work load query statement set. In actual use, given a configuration combination, the model returns suggested settings of cost model parameters that the cost model uses to make a cost estimate of the execution plan.
The invention adopts the following specific technical scheme:
in a first aspect, the present invention provides a database cost model parameter optimization method based on a decision tree, which includes the following steps:
s1, running a predefined query statement set in a database instance to obtain a query statement execution result corresponding to each query statement in the set; respectively extracting key features and execution time of the data operation from each query statement execution result to serve as a first training sample; fitting a linear model type cost model by using all the first training samples as fitting data, and using initial values of cost model parameters obtained by fitting as root nodes of a cost model parameter tree;
s2, executing different query sentences in the database workload, navigating to corresponding nodes in the decision tree by using the latest cost model parameter tree as the decision tree for each query sentence through database configuration parameters and query sentence characteristics, thereby determining cost model parameters for calculating different execution plan costs, and extracting key characteristics and execution time of data operation from the query sentence execution result as second training samples associated with the corresponding nodes; when a second training sample associated with a node in the decision tree reaches a node splitting condition, taking database configuration parameters and query statement characteristics as splittable dimensions, performing node splitting on the cost Model parameter tree by adopting a Model-based Recursive Partitioning (MOB) method so as to form child nodes corresponding to different subspaces, correspondingly dividing the second training sample on the father node to each child node, and respectively fitting to obtain cost Model parameters corresponding to each child node;
s3, continuously executing S2, and performing iterative training on the decision tree in an incremental learning mode so as to continuously perform node splitting, so that different leaf nodes on the trained decision tree respectively have cost model parameters corresponding to different database configuration parameters and query statement characteristics; and finally, determining cost model parameters for the query statement under the given database configuration parameters by using the trained decision tree, wherein the cost model parameters are used for the database instance to estimate the cost of the execution plan.
The present invention can further provide the following preferred embodiments based on the above technical means. And the technical characteristics in each preferred mode of the invention can be correspondingly combined on the premise of no conflict.
As a preference of the first aspect, in S1, the predefined query statement set is a set of single table scan statements generated for a test data set, and each single table scan statement includes two data operations of index scan and sequential scan.
As the first partyPreferably, in S1, the key feature of the data operation includes the number N of times of sequentially reading a disk page s And the number N of times of randomly reading the disk page r Number of executions N of operator or function o Number of data lines processed in query N t The number N of index items processed in index scanning i
As a preferable aspect of the first aspect, in S1, the cost model is in the form of
Cost=N t ×r t +N o ×r o +N i ×r i +N s ×r s +N r ×r r
In the formula: cost denotes the Cost, r t Cost estimate, r, representing one sequential disk page fetch r Representing an estimate of the cost, r, of a random read of a disk page at a time o Representing the cost estimate, r, for processing each operator or function in a query t Representing the cost estimate, r, for each line processed in a query i Representing the cost estimate for processing each index entry in one index scan.
Preferably, in S1, when fitting the cost model in the form of a linear model, the objective function is minimized
Figure BDA0003824431590000031
Obtaining optimal cost model parameters
Figure BDA0003824431590000032
The cost model parameters are combined
Figure BDA0003824431590000033
As root node element of cost model parameter tree; wherein, I is a training sample set composed of all the first training samples, and the error term of the ith training sample
Figure BDA0003824431590000034
r=[r t ,r o ,r i ,r s ,r r ] T Is to standOptimized database cost model parameter, x i =[N t ,N o ,N i ,N s ,N r ] T Is a key feature of data manipulation in the result of query statement execution in the ith training sample,
Figure BDA0003824431590000035
is the cost estimation result of the ith training sample,
Figure BDA0003824431590000036
is the actual execution time of the ith training sample.
Preferably, the specific process of S2 is as follows:
s21, the database executes the workload, records an execution plan of the query statement and database configuration parameters during the execution of the query statement during the execution, searches corresponding leaf nodes from a decision tree according to the characteristics of the query statement and the database configuration parameters during the execution of the query statement, inputs cost model parameters owned by the leaf nodes into the cost model, calculates the execution cost of each execution plan, and selects the execution plan with the minimum cost for execution; after the execution is finished, the database extracts key features and execution time of data operation in the execution plan from recorded query statement execution results to form second training samples under corresponding leaf nodes in the cost model parameter tree;
and S22, continuously repeating S21, accumulating second training samples on leaf nodes of the cost model parameter tree, selecting splitting dimensionality and splitting value to form a plurality of child nodes by adopting a model-based recursive partitioning method aiming at one leaf node after a second training sample set under the leaf node reaches a node splitting condition, correspondingly dividing the second training samples on the father node to each child node, and fitting the cost model again by taking all the second training samples on each child node as fitting data to obtain cost model parameters corresponding to each child node.
As a preferable aspect of the first aspect, in the divisible dimensions of the cost model parameter tree, the dimensions of the database configuration parameters select configuration parameters that are adjustable when the database executes the query statement, and the dimensions of the query statement features select information related to data distribution in the query statement.
Preferably, the dimension of the database configuration parameter includes, but is not limited to, work _ mem and temp _ buffer.
Preferably, the dimensions of the query statement feature include, but are not limited to, offset of columns, inter-column relevance, and data type.
In a second aspect, the present invention provides a database query method, which comprises: after the trained decision tree is obtained according to the optimization method of the first aspect, a database analyzes a target query statement to generate a logic execution plan consisting of a plurality of operations, and then a plurality of feasible physical execution plans are generated according to the logic execution plan; inputting the database configuration parameters and the query sentence characteristics when the target query sentence is executed into the trained decision tree, navigating to the corresponding leaf node in the decision tree and obtaining the cost model parameters corresponding to the leaf node; and substituting the obtained cost model parameters into the cost model, carrying out cost estimation on each physical execution plan of the target query statement, selecting the physical execution plan with the minimum cost according to the estimation result to execute data query, and returning the query result.
Compared with the prior art, the invention has the following beneficial effects:
for one query statement, the invention can match the subspace in the cost model parameter tree according to the configuration parameters of the database and the information provided by the query statement to obtain the recommended cost model parameters in the space, thereby providing accurate cost estimation. Compared with a cost model based on rules under default parameters, the estimated cost provided by the invention is more consistent with the real running time of the execution plan, and the database can be assisted to select the execution plan with shorter running time, so that the query performance of the database is improved. Compared with a cost model based on learning, the method has the advantages that the required training time is short, and an available model can be quickly trained to deal with the change of the software and hardware configuration of the database and the data distribution of the data set. In conclusion, the invention provides an accurate and efficient cost model optimization method.
Drawings
FIG. 1 is a flow chart of implementation steps of a decision tree-based database cost model parameter optimization method of the present invention.
FIG. 2 is a schematic diagram of a database query statement execution flow.
FIG. 3 is a schematic diagram of the database cost model parameter tree usage of the present invention.
FIG. 4 is a schematic diagram of node splitting of a database cost model parameter tree according to the present invention.
Detailed Description
In order to make the aforementioned objects, features and advantages of the present invention comprehensible, embodiments accompanied with figures are described in detail below. In the following description, numerous specific details are set forth in order to provide a thorough understanding of the present invention. This invention may, however, be embodied in many different forms and should not be construed as limited to the embodiments set forth herein, as those skilled in the art will recognize without departing from the spirit and scope of the present invention.
In the description of the present invention, it is to be understood that the terms "first" and "second" are used for descriptive purposes only and are not to be construed as indicating or implying relative importance or implying any number of technical features indicated. Thus, a feature defined as "first" or "second" may explicitly or implicitly include at least one of the feature.
As shown in fig. 1, in a preferred embodiment of the present invention, a method for optimizing database cost model parameters based on a decision tree is provided, which specifically includes the following steps:
step 1: aiming at a database instance under specific hardware and software configuration, executing a predefined query statement set to obtain a query statement execution result corresponding to each query statement in the set; respectively extracting key features and execution time of the data operation from each query statement execution result to serve as a first training sample; all the first training samples are used as fitting data, the cost model in the form of the linear model is fitted through a fitting method based on the linear model, the optimal initial value of the cost model parameter under software and hardware configuration of the database instance is obtained, and the initial value of the cost model parameter obtained through fitting is used as the root node of the cost model parameter tree. Therefore, the operation that the database manager adjusts the cost model parameters according to the hardware and software configuration under the traditional situation can be considered.
It should be noted that the software and hardware configuration of the database instance in the present invention includes hardware configuration (hard disk type, memory size, CPU frequency, etc.), software configuration (system version, database software version, etc.), and does not change during the operation of the database instance. Since the hardware and software configuration may affect the selection of the optimal parameters of the cost model, the specific database instance software and hardware configuration needs to be set according to the database instance that is finally required to be applied.
It should be noted that, the data operation in the present invention refers to a scanning operation, a connection operation, and the like, and the key features thereof include five parameters: n is a radical of hydrogen s Indicating the number of sequential reads of a disk page, N r Representing the number of random reads of a disk page, N o Representing the number of executions, N, of an operator or function t Representing the number of data lines processed in a query, N i Indicating the number of index entries processed in the index scan. The execution time of the data operation is composed of IO time and CPU time. In actual application, the data can be acquired according to functions provided by database software.
It should be noted that, in the database adopting the cost-based optimization method, the specific execution process of the query statement is as follows: the database analyzes the statement after receiving the query statement to generate a logic execution plan composed of a plurality of operations, and the same logic operation can have different physical implementations, for example, the implementation of the scanning operation includes sequential scanning, index scanning, bitmap scanning, and the like. Therefore, the logic execution plan can further generate various feasible physical execution plans, the database operation cost model carries out cost estimation on each physical execution plan, the physical execution plan with lower cost is finally selected for execution, the query result is returned, and the key characteristics and the execution time of data operation in the process are recorded. As shown in fig. 2, for the execution flow of the query statement in an example, for the four physical execution plans a to D, the cost of the B plan among the costs of the execution plans obtained by the final cost model estimation is the minimum, which is 150, so that the physical execution plan is selected as the final execution plan, and the execution result including the query result, the operation feature and the execution time is finally obtained.
Traditional rule-based cost models provide the database administrator with five adjustable parameters, each data operation having a different cost formula depending on the type of operation. One particular query plan is a tree structure of operations whose costs are to be combined from the costs of the corresponding operations. When calculating initial values of parameters of a cost model of a database instance, a predefined query statement set refers to a single-table scan statement generated for a test data set, and includes two data operations of index scan and sequential scan, and the cost model can be unified into a linear model involving five adjustable parameters:
Cost=N t ×r t +N o ×r o +N i ×r i +N s ×r s +N r ×r r
wherein r is t Cost estimate, r, representing one sequential disk page fetch r Representing an estimate of the cost of acquiring a disk page non-sequentially (i.e., reading a disk page randomly), r o Representing an estimate of the cost of processing each operator or function in a query, r t Represents the cost estimate, r, for each line processed in a query i Representing the cost estimate for processing each index entry in one index scan. Cost represents the Cost, and is equivalent to the execution time of the database query to be predicted.
Step 1-3: in calculating the initial value of the cost model of the database instance, a fitting method based on a linear model is used. The linear model-based fitting method is described in detail below.
Recording a training sample set consisting of all first training samples obtained after executing a predefined query statement set as I, wherein the cost is in a linear model formWhen the model is fitted, the objective function can be minimized
Figure BDA0003824431590000071
Obtaining optimal cost model parameters
Figure BDA0003824431590000072
Namely, the initial value of the cost model parameter is obtained, and the cost model parameter is used
Figure BDA0003824431590000073
As a root node element of the cost model parameter tree. Wherein, the error term of the ith training sample
Figure BDA0003824431590000074
r=[r t ,r o ,r i ,r s ,r r ] T Is the database cost model parameter to be optimized, x i =[N t ,N o ,N i ,N s ,N r ] T Is a key feature of data manipulation in the query sentence execution result in the ith training sample,
Figure BDA0003824431590000075
is the cost estimation result of the ith training sample,
Figure BDA0003824431590000076
is the actual execution time of the ith training sample.
And 2, step: and executing different query sentences in the database workload, and navigating to corresponding nodes in the decision tree by using the latest cost model parameter tree as the decision tree for each query sentence through the database configuration parameters and the query sentence characteristics, so as to determine the cost model parameters for calculating different execution plan costs, and extracting the key characteristics and the execution time of data operation from the query sentence execution result as second training samples associated with the corresponding nodes. When a second training sample associated with a node in the decision tree reaches a node splitting condition, taking the database configuration parameters and the query statement characteristics as splittable dimensions, performing node splitting on the cost Model parameter tree by adopting a Model-based Recursive Partitioning (MOB) method so as to form child nodes corresponding to different subspaces, correspondingly dividing the second training sample on the father node to each child node, and respectively fitting to obtain cost Model parameters corresponding to each child node.
As a preferred implementation manner of the embodiment of the present invention, the step 2 can be implemented by the following steps 2-1 and 2-2.
Step 2-1: the database executes the workload, records the execution plan of the query statement and the database configuration parameters when the query statement is executed, searches the corresponding leaf nodes from the decision tree according to the characteristics of the query statement and the database configuration parameters when the query statement is executed, inputs the cost model parameters owned by the leaf nodes into the cost model, calculates the execution cost of each execution plan, and selects the execution plan with the minimum cost for execution; and after the execution is finished, the database extracts key features and execution time of data operation in the execution plan from the recorded query statement execution result to form a second training sample under the corresponding leaf node in the cost model parameter tree.
It should be noted that, when executing the query statement in the database workload, various data can be recorded in the database history query record. The query plan is extracted from the historical query records of the database and can be used for constructing training data.
It should be noted that the splittable dimension of the cost model parameter tree includes two aspects, namely, configuring parameters and query statement features for the database respectively. The dimension of the database configuration parameters may select configuration parameters (such as work _ mem, temp _ buffer, and other configuration parameters) that can be adjusted when the database executes the query statement, and the dimension of the query statement feature may select information (such as offset of a column, correlation between columns, and data type) related to data distribution in the query statement. According to the query statement characteristics and the database configuration parameters during query statement operation, the database finds corresponding leaf nodes for data operation, so that the recommended cost model parameters can be obtained and input into the cost model calculation formula to obtain the execution cost of the data operation, and the execution plan with the minimum cost is selected for execution. And then, recording the operation characteristics and the execution time in the query sentence execution result by the database to form a training sample under the corresponding leaf node of the cost model parameter tree.
In one specific example of the present invention, the structure and usage of the cost model parameter tree are shown in fig. 3. And the root node of the cost model parameter tree is a cost model parameter initial value formed by fitting all the first training samples obtained by running the test data set. The root node subordination is split according to the scanning operation mode, and comprises child nodes such as sequential scanning SeqScan and Index scanning IndexScan, and the node subordination of IndexScan is divided into different child nodes according to the inter-column correlation Index _ correlation. The illustrated query statement generates two types of execution plans, a sequential scan SeqScan and an Index scan IndexScan, during the running of the workload, so that the execution plan navigation of SeqScan locates to SeqScan leaf nodes and computes the cost of the execution plan with the cost model parameters on the nodes, while the execution plan navigation of IndexScan locates to leaf nodes with Index _ correlation ≦ 0.5 and computes the cost of the execution plan with the cost model parameters on the nodes. And finally selecting the execution plan of the IndexScan for execution after the cost comparison of the two execution plans, obtaining an execution result containing the operation characteristics and the execution time, and associating the execution result to a leaf node with the Index _ correlation being less than or equal to 0.5.
It should be particularly noted that only a root node exists after the initial construction of the cost model parameter tree, and at this time, a leaf node, i.e., a root node, found from the decision tree is associated with the second training sample generated according to the query statement feature and the database configuration parameter during the execution of the query statement.
Step 2-2: and (3) continuously repeating the step (2-1), accumulating second training samples on leaf nodes of the cost model parameter tree, selecting splitting dimensions and splitting values to form a plurality of child nodes by adopting a model-based recursive partitioning method aiming at the leaf nodes after a second training sample set under one leaf node reaches a node splitting condition, correspondingly dividing the second training samples on father nodes to the child nodes, and fitting the cost model again by taking all the second training samples on each child node as fitting data to obtain cost model parameters corresponding to the child nodes.
It should be noted that the MOB method, which is the recursive partitioning method based on the model, belongs to the prior art, and specific reference may be made to the prior art documents: zeileis A, hothorn T, hornik K.model-based reflective characterization [ J ]. Journal of Computational and Graphical Statistics,2008,17 (2): 492-514. For convenience of understanding, the following detailed description is provided for a specific process of splitting sub-nodes of a cost model parameter tree by using the MOB method in an embodiment:
step 2-2-1: and defining an optimization target of the node after the child node is split. The partition mode is recorded as
Figure BDA0003824431590000091
I b To represent
Figure BDA0003824431590000092
Sample set of partitions, for each partition
Figure BDA0003824431590000093
I.e. the split child nodes, by means of a linear model
Figure BDA0003824431590000094
To fit a cost model. Parent node needs to minimize global objective function
Figure BDA0003824431590000095
Obtaining
Figure BDA0003824431590000096
For a given set of partitions
Figure BDA0003824431590000097
By greedily calculating the optimal parameter estimate for each partition
Figure BDA0003824431590000098
To obtain a globally optimal parameter estimate
Figure BDA0003824431590000099
Step 2-2-2: and determining a value range H of the splittable dimension variable of the cost model parameter tree, setting a confidence level alpha, and testing the parameter instability of all the splittable dimension variables H belonging to H according to the MOB partitioning method. By adopting supLM test on numerical variables and chi-square test on category variables, each splittable dimension variable obtains a p value to represent the stability of the splittable dimension variable, the minimum p value is selected to represent the variable with the highest instability, and if the value is smaller than the confidence level alpha, splitting operation can be carried out on the original node according to the corresponding splittable dimension variable.
Step 2-2-3: for the selected splitting dimension variable h, the value range of the candidate splitting value is u h . Splitting value s ∈ u for each candidate h The original node is split into two sub-nodes with x being less than or equal to s and x being more than s according to the split value s, the training samples in the original node are distributed to the corresponding sub-nodes, and then the optimal global objective function value G(s) = minJ of the original node at the moment is calculated according to the global objective function given in the step 2-2-1 s (I, r) splitting the original node by selecting a split value which minimizes the global objective function of the original node after splitting, i.e. splitting the original node
Figure BDA00038244315900000910
In a specific example of the present invention, the process of splitting nodes of the cost model parameter tree in fig. 3 is shown in fig. 4.
It should be noted that, because the cost model parameter tree of the present invention is continuously updated iteratively in the training process, the decision tree relied on when the database executes the query statement needs to adopt the cost model parameter tree whose execution time is the latest.
And 3, step 3: and (3) repeating the step (2), and continuously dividing nodes to partition a parameter subspace according to the database configuration parameters and the query statement characteristics, so that query statements with different configurations and different data distributions have independent cost model parameters.
And (3) continuously and repeatedly executing the step (2), and performing iterative training on the decision tree in an incremental learning mode, so that the decision tree can continuously split nodes according to the configuration parameters of the database and the characteristics of the query statement to partition a parameter subspace, and finally different leaf nodes on the trained decision tree respectively have cost model parameters corresponding to different configuration parameters of the database and the characteristics of the query statement.
The training process of the decision tree belongs to the prior art, and is not repeated, and the final training termination condition can refer to the existing decision tree construction algorithm framework.
And after the decision tree is trained, determining cost model parameters for the query statement under the given database configuration parameters by using the trained decision tree, wherein the cost model parameters are used for the database instance to estimate the cost of the execution plan.
Further, in another preferred embodiment of the present invention, a database query method is provided, which specifically includes obtaining a trained decision tree according to the method for optimizing database cost model parameters based on a decision tree provided in the above embodiment, analyzing a target query statement by a database to generate a logic execution plan composed of a plurality of operations, and generating a plurality of feasible physical execution plans according to the logic execution plan; inputting the database configuration parameters and the query sentence characteristics when the target query sentence is executed into the trained decision tree, navigating to the corresponding leaf node in the decision tree and obtaining the cost model parameters corresponding to the leaf node; and substituting the obtained cost model parameters into the cost model, carrying out cost estimation on each physical execution plan of the target query statement, selecting the physical execution plan with the minimum cost according to the estimation result to execute data query, and returning a query result.
In summary, for a database example under specific software and hardware settings, the invention establishes a database cost model parameter tree, partitions the parameter space by using the database configuration parameters and the query sentence characteristics as the splitting dimensions, and solves the optimal cost model parameters through linear fitting of training samples in each partition. In practical application, the cost model parameter tree can allocate different cost model parameters to query statements under different parameter configurations and data distributions, so that accurate cost prediction is performed. Experiments show that the method improves the prediction accuracy of the traditional rule-based estimation model and optimizes the query performance of the database.
The above-described embodiments are merely preferred embodiments of the present invention, and are not intended to limit the present invention. Various changes and modifications may be made by one of ordinary skill in the pertinent art without departing from the spirit and scope of the present invention. Therefore, the technical scheme obtained by adopting the mode of equivalent replacement or equivalent transformation is within the protection scope of the invention.

Claims (10)

1. A database cost model parameter optimization method based on a decision tree is characterized by comprising the following steps:
s1, running a predefined query statement set in a database instance to obtain a query statement execution result corresponding to each query statement in the set; extracting key features and execution time of the data operation from each query statement execution result respectively to serve as a first training sample; fitting a linear model type cost model by using all the first training samples as fitting data, and using initial values of cost model parameters obtained by fitting as root nodes of a cost model parameter tree;
s2, executing different query sentences in the database workload, navigating to corresponding nodes in the decision tree by using the latest cost model parameter tree as the decision tree for each query sentence through database configuration parameters and query sentence characteristics, thereby determining cost model parameters for calculating different execution plan costs, and extracting key characteristics and execution time of data operation from the query sentence execution result as second training samples associated with the corresponding nodes; when a second training sample associated with one node in the decision tree reaches a node splitting condition, taking database configuration parameters and query statement characteristics as splittable dimensions, performing node splitting on the cost Model parameter tree by adopting a Model-based Recursive Partitioning (MOB) method so as to form child nodes corresponding to different subspaces, correspondingly dividing the second training sample on the father node to each child node, and respectively fitting to obtain cost Model parameters corresponding to each child node;
s3, continuously executing S2, and performing iterative training on the decision tree in an incremental learning mode so as to continuously perform node splitting, so that different leaf nodes on the trained decision tree respectively have cost model parameters corresponding to different database configuration parameters and query statement characteristics; and finally, determining cost model parameters for the query statement under the given database configuration parameters by using the trained decision tree, wherein the cost model parameters are used for the database instance to estimate the cost of the execution plan.
2. The decision tree-based database cost model parameter optimization method according to claim 1, wherein in S1, the predefined query statement set is a set of single table scan statements generated for a test data set, each single table scan statement containing both index scan and sequential scan data operations.
3. The decision tree-based database cost model parameter optimization method of claim 2, wherein in S1, the key features of the data operation include the number N of times of sequentially reading disk pages s Number of times N of random reading of disk page r N number of executions of an operator or function o Number of data lines processed in query N t And the number N of index items processed in the index scanning i
4. The decision tree-based database Cost model parameter optimization method of claim 3, wherein in S1, the Cost model is in the form of Cost = N t ×r t +N o ×r o +N i ×r i +N s ×r s +N r ×r r
In the formula: cost denotes the Cost, r t Cost estimate, r, representing one sequential disk page fetch r Represents once for one timeEstimate of the cost of machine-reading a disk page, r o Representing an estimate of the cost of processing each operator or function in a query, r t Represents the cost estimate, r, for each line processed in a query i Representing the cost estimate for processing each index entry in one index scan.
5. The decision tree-based database cost model parameter optimization method of claim 1, wherein in S1, the cost model in the form of a linear model is fitted by minimizing an objective function
Figure FDA0003824431580000021
Obtaining optimal cost model parameters
Figure FDA0003824431580000022
The cost model parameters are combined
Figure FDA0003824431580000023
As root node element of cost model parameter tree; wherein, I is a training sample set composed of all the first training samples, and the error term of the ith training sample
Figure FDA0003824431580000024
r=[r t ,r o ,r i ,r s ,r r ] T Is the database cost model parameter to be optimized, x i =[N t ,N o ,N i ,N s ,N r ] T Is a key feature of data manipulation in the query sentence execution result in the ith training sample,
Figure FDA0003824431580000025
is the cost estimation result of the ith training sample,
Figure FDA0003824431580000026
is the actual execution time of the ith training sample.
6. The decision tree-based database cost model parameter optimization method according to claim 1, wherein the specific process of S2 is as follows:
s21, the database executes the workload, records an execution plan of the query statement and database configuration parameters during the execution of the query statement during execution, searches corresponding leaf nodes from the decision tree according to the characteristics of the query statement and the database configuration parameters during the execution of the query statement, inputs cost model parameters owned by the leaf nodes into the cost model, calculates the execution cost of each execution plan, and selects the execution plan with the minimum cost for execution; after the execution is finished, the database extracts key features and execution time of data operation in the execution plan from recorded query statement execution results to form second training samples under corresponding leaf nodes in the cost model parameter tree;
and S22, continuously repeating S21, accumulating second training samples on leaf nodes of the cost model parameter tree, selecting splitting dimensionality and splitting value to form a plurality of child nodes by adopting a model-based recursive partitioning method aiming at one leaf node after a second training sample set under the leaf node reaches a node splitting condition, correspondingly dividing the second training samples on the father node to each child node, and fitting the cost model again by taking all the second training samples on each child node as fitting data to obtain cost model parameters corresponding to each child node.
7. The decision tree-based database cost model parameter optimization method according to claim 6, wherein in the splittable dimensions of the cost model parameter tree, the dimensions of the database configuration parameters select configuration parameters that can be adjusted when a query statement is executed, and the dimensions of the query statement features select information in the query statement related to data distribution.
8. The decision tree-based database cost model parameter optimization method of claim 6, wherein the dimensions of the database configuration parameters include, but are not limited to, work mem and temp buffer.
9. The decision tree-based database cost model parameter optimization method of claim 6, wherein the dimensions of the query statement feature include, but are not limited to, offset of column, inter-column correlation, and data type.
10. A database query method is characterized in that after a trained decision tree is obtained according to the optimization method of claim 1, a database analyzes a target query statement to generate a logic execution plan consisting of a plurality of operations, and then a plurality of feasible physical execution plans are generated according to the logic execution plan; inputting the database configuration parameters and the query sentence characteristics when the target query sentence is executed into the trained decision tree, navigating to the corresponding leaf node in the decision tree and obtaining the cost model parameters corresponding to the leaf node; and substituting the obtained cost model parameters into the cost model, carrying out cost estimation on each physical execution plan of the target query statement, selecting the physical execution plan with the minimum cost according to the estimation result to execute data query, and returning a query result.
CN202211054493.8A 2022-08-31 2022-08-31 Database cost model parameter optimization method based on decision tree and query method thereof Pending CN115576970A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202211054493.8A CN115576970A (en) 2022-08-31 2022-08-31 Database cost model parameter optimization method based on decision tree and query method thereof

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202211054493.8A CN115576970A (en) 2022-08-31 2022-08-31 Database cost model parameter optimization method based on decision tree and query method thereof

Publications (1)

Publication Number Publication Date
CN115576970A true CN115576970A (en) 2023-01-06

Family

ID=84580184

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202211054493.8A Pending CN115576970A (en) 2022-08-31 2022-08-31 Database cost model parameter optimization method based on decision tree and query method thereof

Country Status (1)

Country Link
CN (1) CN115576970A (en)

Similar Documents

Publication Publication Date Title
US6108648A (en) Optimizer with neural network estimator
US8386463B2 (en) Method and apparatus for dynamically associating different query execution strategies with selective portions of a database table
EP2369506B1 (en) System and method of optimizing performance of schema matching
US20190034485A1 (en) System and method for optimizing large database management systems with multiple optimizers
US8108399B2 (en) Filtering of multi attribute data via on-demand indexing
Wolf et al. Query processing over incomplete autonomous databases: query rewriting using learned data dependencies
EP3531314B1 (en) Optimal ranges for relational query execution plans
EP4075292A1 (en) Method and apparatus for processing database
CN115617830A (en) Data query optimization processing method and device based on machine learning
CN114637760A (en) Intelligent question and answer method and system
CN110990580A (en) Knowledge graph construction method and device, computer equipment and storage medium
CN116830097A (en) Automatic linear clustering recommendation for database region maps
US20230126509A1 (en) Database management system and method for graph view selection for a relational-graph database
CN111723076A (en) Method and device for generating database index
Feldman et al. A knowledge-based approach for index selection in relational databases
CN111597400A (en) Computer retrieval system and method based on way-finding algorithm
CN115576970A (en) Database cost model parameter optimization method based on decision tree and query method thereof
Ameri et al. On a new approach to the index selection problem using mining algorithms
CN112948357A (en) Tuning mechanism facing multimode database OrientDB and construction method thereof
Gibas et al. Online index recommendations for high-dimensional databases using query workloads
CN117390064B (en) Database query optimization method based on embeddable subgraph
Akash et al. Online aggregation based approximate query processing: A literature survey
Mishra et al. Localization of Data Sets in Distributed Database Systems Using Slope-Based Vertical Fragmentation
Arsov et al. Prediction of Horizontal Data Partitioning Through Query Execution Cost Estimation
Huang et al. DITune: A Reinforcement Learning Based Framework for Automated Database Index Selection

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination