Detailed Description
Reference will now be made in detail to the exemplary embodiments, examples of which are illustrated in the accompanying drawings. When the following description refers to the accompanying drawings, like numbers in different drawings represent the same or similar elements unless otherwise indicated. The implementations described in the following exemplary embodiments do not represent all implementations consistent with one or more embodiments of the present specification. Rather, they are merely examples of apparatus and methods consistent with certain aspects of one or more embodiments of the specification, as detailed in the claims which follow.
It should be noted that: in other embodiments, the steps of the corresponding methods are not necessarily performed in the order shown and described herein. In some other embodiments, the method may include more or fewer steps than those described herein. Moreover, a single step described in this specification may be broken down into multiple steps for description in other embodiments; multiple steps described in this specification may be combined into a single step in other embodiments.
The basic statistical information mainly refers to a type of information describing tables, sizes and scales of indexes, distribution conditions of the database, and the like, for example, the number of rows of the tables, the number of blocks, the size of each row on average, the number of rows of the index field, the sizes of different values, and the like. The basic statistical information can be obtained only by traversing each table, and the tables are necessarily accessed by frequently calling an I/O process during traversal. However, in practical applications, when data is queried, related calculation is not the most time-consuming, and it is only a time-consuming part when an I/O process is called to access each table for I/O, and it is obvious that resources of the I/O process are precious, so that in order to ensure efficient external access to the database, resources are not wasted for statistics of basic statistical information when the database is frequently accessed externally (when the database is busy).
On one hand, the optimizer engine carries out cost evaluation on the query plan by means of basic statistical information, the accuracy of the basic statistical information influences the accuracy of optimization, but when the database is busy, the basic statistical information is difficult to update, and the accuracy of the basic statistical information is difficult to ensure; on the other hand, the basic statistical information is the macroscopic statistical information belonging to the database, and the estimated execution cost is the macroscopic execution cost.
The inventor finds in practice that in some scenarios where there are a large number of access requests, the query data and the query type have temporal locality and spatial locality, for example, in an analytic database application scenario, a large number of select statements (query statements) are encountered, and in a period of time, frequent accesses may be made to some banks or tables of the database, in another period of time, frequent accesses may be made to other banks or tables, and the queries may be the same or similar. The query step is composed of a plurality of query steps, for example, the table a and the table B are connected in the first step, the row with a keyword not being empty is selected in the second step, and the executor engine completes the query step by step when executing the query plan, and accordingly, after executing a step, the relevant execution process information of the step can be obtained, such as how many rows are read, how many rows are output, how long the time is spent, and the like. Therefore, if the database optimizer engine evaluates the execution cost of a certain query step and the same or similar query step is executed and completed in the recent period of time, the optimizer engine can evaluate the execution cost of completing the query step by directly using the actual execution cost of the same or similar query step in the recent period of time after the query step is executed and completed, which is obviously more accurate.
Based on this, the present specification provides a method for updating the reference execution cost of the features of the query step, and the basic information of the same or similar query steps has the same features, so that the corresponding relationship between the features of the query step and the reference execution cost is constructed by using the features of the query step as identifiers. And updating the reference execution cost of the query step features: the method comprises the steps of obtaining execution result information after a database execution engine finishes any query step, determining the characteristics of the query step according to basic information of the query step, determining the actual execution cost of the query step information according to the execution result information, and taking the actual execution cost as the reference execution cost corresponding to the characteristics of the query step. Therefore, when the execution cost of a certain query step needs to be evaluated, the corresponding reference execution cost can be queried through the characteristics of the query step.
In one or more embodiments of the present disclosure, when performing the execution cost evaluation for one query step, the optimizer engine has a more specific reference execution cost than the basic statistical information in the prior art, and has a more reference meaning, and when updating the reference execution cost, the optimizer engine does not need to occupy the query resource of the database, thereby reducing the query load.
Based on the above thought, the present specification provides a method for optimizing a query plan, a method for determining an optimal query plan, a model construction method for predicting the stability of the reference execution cost, and a method for predicting the stability of the reference execution cost.
The method for optimizing the query plan utilizes the corresponding relation between the constructed query step characteristics and the reference execution cost, and the corresponding relation referred by other methods for evaluating the cost of any query step is updated according to the method.
The reference execution cost determination method is to determine the reference execution cost of the query step by querying the corresponding relationship updated by the updating method according to the characteristics of the query step.
In the model construction method, when constructing the training sample, the correspondence as described above needs to be queried according to the characteristics of the sample query step, and then the model executing the cost stability from the prediction reference is trained by using the constructed training sample set.
The method for predicting the stability of the reference execution cost uses the model constructed by the model construction method, before the reference execution cost of the query step is determined, the stability of the current reference execution cost can be predicted, and the higher the stability is, the more accurate the current reference execution cost is.
These methods will be described in detail below.
First, an optimized query plan method is explained, as shown in fig. 1, which is a flow diagram of the optimized query plan method illustrated in this description.
Before the method is executed, the characteristics of the query step are used as identifiers, and the corresponding relation between the characteristics of the query step and the reference execution cost is constructed. And evaluating the cost of any query step when the corresponding relation between the query step characteristics and the reference execution cost is used for optimizing the query plan.
Each query step has its basic information, including the query plan to which the query step belongs, the query type, the expression of the query step (including the involved operators, databases, tables, columns), the submission time of the query step, etc., for example, the query step is "table a inner JOIN table B on a. key ═ b.key", the query plan to which the query step belongs has ID 1 (each query plan has its unique identifier), the query type is JOIN type, the involved operators are "═ c", the key column to which a table and a table are involved, and the key column to which B table and B table are involved.
In which the same or similar query steps will have the same or similar query statements, i.e. the same references to databases, tables and columns, and the same or similar expressions, so that in practical applications, the database, tables, columns and expressions, which are referred to in each query step, can be used as the characteristics of the query step, such as List < expressontype, List < Symbol > >.
The reference execution cost is used to evaluate the execution cost for executing the query step, such as how many rows need to be input, how many rows need to be output, how much memory is consumed, and the like.
And 102, acquiring the execution result information of any query step.
After receiving a query plan (also called as an execution plan), the executor engine executes the query plan according to steps in the query plan, wherein each query step has a plurality of corresponding query steps, each query step is executed and completed by an operator in the executor engine (one operator is responsible for completing a basic data processing logic, one group of operators completes a group of query steps of data according to the query plan, in other words, one query step is executed and completed by at least one operator to obtain a result), after each operator is executed, how many rows (rows) are read by itself, how many rows are output, how many data (KB) of how many sizes are read in total, how many data of how many sizes are output in total, how many times of total consumption are consumed, and how many total memory usage sums are reported. The execution engine generally encapsulates each query step belonging to one query plan as the execution result information of one query plan, that is, the execution result information of one query plan includes the execution result information of a plurality of query steps. Therefore, in practical applications, the execution result information may be acquired from the execution engine in units of one query plan.
For example, a database has three tables, nation, region, customer, and a query statement:
SELECT count(*)
FROM
nation,region,customer
WHERE
c_nationkey=n_nationkey
AND n_regionkey=r_regionkey
AND r_name='ASIA';
after the optimization engine completes optimization, the query plan sent to the execution engine is as follows:
1. scan the table nation to find the column n _ nationkey, n _ regionkey.
2. Scan the table region, find the column r _ region key, and screen out the r _ name ═ ASIA' row.
3. And internally connecting the tables obtained in the first step and the second step by c _ nationkey ═ n _ nationkey (inner join).
4. The table customer is scanned for column c _ nationkey.
5. And internally connecting the tables obtained in the fourth step and the fifth step by n _ regionkey-r _ regionkey (inner join).
6. And calculating the number of rows in the table obtained in the fifth step.
As shown in fig. 2, the execution relationship of each query step executed for the above query plan. For example, steps 1 and 2 may be performed simultaneously, and the two operators perform the operation to obtain results respectively, and then perform the interconnection according to the results obtained in steps 1 and 2, and the interconnection step is performed by the other operator, and so on.
In practical applications, the order of magnitude of query plans completed by the database executor engine may be large, and therefore, in one or more embodiments of the present description, the obtaining condition may be set, for example, the sampling ratio may be set, and the database executor engine obtains the execution result information of one query plan every time5 query plans are completed; or setting an acquisition period, acquiring the execution result information of a query plan and the like every 1 ms; or receiving a specified command, and starting to acquire the execution result information of each query plan.
And 104, determining the characteristics and the actual execution cost of the query step.
As described above, each query plan has a plurality of query steps, and after each query step is executed, an execution result message is obtained, including the execution process of the query step, including how many rows (rows) are read during execution, how many rows are output, how many data sizes (KB) are read in total, how many data sizes are output in total, how many total time consumption is, how many total memory usage is, and the like.
The execution process of the query step can reflect how much cost is consumed for executing the query step, and therefore, one or more execution process data are selected as the actual execution cost for executing the query step by acquiring the execution result information of the query step.
And step 106, updating the reference execution cost corresponding to the characteristics of the query step.
Wherein, according to the actual execution cost determined in step 104, the reference execution cost corresponding to the query step feature is updated.
In practical application, the corresponding relationship between the query step features and the reference execution cost may be updated only, and the following table is the corresponding relationship before updating:
characteristics of query steps
|
Reference execution cost
|
Characteristic A
|
Cost A1
|
Characteristic B
|
Cost B1 |
TABLE 1
When a query step is received, the query step being characterized as feature a, and the actual execution cost of the query step is cost a2, then after updating, the correspondence becomes:
characteristics of query steps
|
Reference execution cost
|
Characteristic A
|
Cost A2
|
Characteristic B
|
Cost B1 |
TABLE 2
The corresponding relationship may also be constructed by constructing an actual execution cost and a corresponding relationship between the timestamp and the query step feature, at this time, the actual execution cost of the query step feature obtained each time may be sequentially saved according to the time sequence of the timestamp (i.e., the actual execution cost and the corresponding relationship between the timestamp and the query step feature are also updated), and the corresponding actual execution cost with the latest time is used as the reference actual execution cost of the query step, which is the corresponding relationship before updating as shown in the following table:
TABLE 3
After receiving a query step with a query step characteristic of feature a, the actual execution cost of the query step is actual cost a3, the timestamp is TimeA3, and then the actual execution cost and the completion timestamp are added to the identifier corresponding to the query step characteristic, then after updating, the correspondence relationship becomes:
TABLE 4
The reference execution cost value corresponding to each query step feature is a value of an actual execution cost corresponding to a latest timestamp of the timestamp, as in table 3, the actual execution cost corresponding to the latest timestamp of the feature a is a2, so the reference execution cost corresponding to the feature a is a reference cost a2 (that is, the reference cost a2 is an actual cost a2), and the actual execution cost corresponding to a latest timestamp of the feature B is 2, so the reference execution cost corresponding to the feature B is a reference cost B2, as in table 4, the actual reference cost corresponding to the latest timestamp of the feature a is a reference cost A3, so the reference execution cost corresponding to the feature a is a reference cost A3.
In practical application, if the actual execution cost and the timestamp after the execution of each query step is completed are stored without limitation, and the data amount is accumulated for a long time, therefore, in one or more embodiments of the present specification, a storage condition may be set, for example, if the actual execution cost corresponding to any feature exceeds 20, the previous 10 are deleted and then stored; or if the time length of the time stamp from the current time stamp exceeds 4h, deleting the corresponding actual execution cost and the corresponding time stamp.
The present specification shows that the query step features, the reference execution cost, the actual execution cost, and the timestamp are stored in one table, and in fact, a corresponding relationship table may be established between the query step features and the reference execution cost, and a corresponding relationship table may be established between the query step features and the actual execution cost, and the timestamp.
In practical applications, in some cases, the query step feature corresponding to a certain query step is determined not to be found in the corresponding relationship (for example, when the corresponding relationship is constructed, no query step feature is available, or similar query steps have not been executed before), at this time, the identifier of the query step feature may be added to the corresponding relationship, and the actual execution cost corresponding to the query step is taken as the reference execution cost.
In practical application, the corresponding relationship may be stored in a table form, or may be stored in an index method, and only the corresponding reference execution cost may be found according to the query step. During storage, the system table may be stored in a database, as may the basic statistical information, may also be stored in a memory, and may also be stored in a hard disk.
The above is a description of the method for optimizing query planning, and the corresponding relationship between the query step features and the reference execution cost is constructed and updated, and the reference meaning is more specific with respect to the basic statistical information, for example, when table join is performed, table a has 20 rows, table B has 10 rows, so that after the table join is performed, there are 10 rows at most, and the optimizer engine will only evaluate as 10 rows or a rough number when evaluating. The reference execution cost is obtained by using an actual execution result (i.e., an actual execution cost) after the table a and the table B are connected in the row, for example, 4 rows are obtained after the table a and the table B are actually connected in the row, and then after the execution cost of the table a and the table B is evaluated, it can be determined that 4 rows are output after the table a and the table B are connected in the row. Moreover, since the execution result information of any query step executed by the executor engine is obtained (that is, the execution result information of the query step that the execution engine originally needs to execute is obtained), the query resource of the database is not occupied, and the query load of the database is not caused (the query irrelevant to the service is not performed).
The following describes a method for determining an optimal query plan, which uses the correspondence between the query step characteristics updated by the above-described updating method and the reference execution costs. The updating method improves the accuracy of the reference execution cost of the characteristics of the query steps (compared with the basic statistical information, the reference meaning is more specific, the updating speed is high, and therefore the updating method is more accurate), therefore, no matter technicians obtain the reference execution cost of each query step to determine the operation state of the database, or an optimizer engine generates an optimal query plan by using the reference execution cost, the reference meaning of the reference execution cost is higher than that of the basic statistical information.
As shown in fig. 3, a flowchart of a method for determining an optimal query plan, applied to an optimizer engine, is shown in this specification, and includes the following steps:
step 302, analyzing each query step in the query plan.
After receiving a query plan, the optimizer engine parses out each query step in the query plan, which is in the form of a tree.
Step 304, for any query step, determining an alternative query step to the query step.
Some query steps are provided with alternative query steps, such as left-hand or right-hand connection when two tables are connected, as described above.
Step 306, for any alternative query step, determining the characteristics of the alternative query step, and querying the reference execution cost corresponding to the alternative query step according to the correspondence.
The optimizer engine optimizes each query step based on cost, and therefore, the cost of executing each query step needs to be evaluated to select the alternative query step with the lowest execution cost.
In this specification, the optimizer engine optimizes and evaluates the cost of executing each query step, and obtains the execution cost of each query step by querying the corresponding relationship between the query step characteristics and the reference execution cost.
Wherein, the corresponding relationship between the query step and the reference execution cost is the corresponding relationship obtained by the above updating method.
And 308, selecting the alternative query step with the minimum execution cost of each query step, and determining the optimal query plan.
In practical application, when the reference execution cost of a certain query step needs to be determined, it is found that the query step features of the query step are not in the corresponding relationship (that is, no similar query step has been executed before), at this time, the basic statistical information can be referred to as an evaluation basis, so that after the query step is executed, the actual execution cost can be obtained and used as the reference execution cost of the subsequent same or similar query step.
By using the optimization query planning method and the determination method, the accuracy of the optimizer engine in evaluating each query step can be greatly improved, but in some cases, the data change is fast, so that the difference between the actual execution cost and the actual execution cost is large, and based on the method, the model construction method for predicting the stability of the reference execution cost is provided.
The following describes in detail a model construction method for predicting a reference execution cost stability, as shown in fig. 4, which is a flow diagram of the model construction method for predicting a reference execution cost stability shown in this specification, and includes the following steps:
step 402, obtaining a data sample of the query step.
When acquiring data samples, acquiring a set of query step data samples executed within a history time period of a time length T1 according to a preset time length T1, wherein any query step data sample includes: basic information of the query step and actual execution cost corresponding to the query step.
Since the data changes are temporal locality and spatial locality, the data samples of the query step selected by the building model need to be samples within a near time period, such as within 30 mis. For query data samples long before, the significance of model construction is not great, for example, some data modification operations in the period of 6:00-6:10 cause the reference execution cost of partial query steps to be very unstable, and if the query data samples in the period of 6:00-6:10 are used for constructing the model, unstable prediction results can be obtained when the reference execution cost of some query steps is predicted. If the data changes are not much during 16:00-16:10, i.e., the reference execution cost of each query step is stable, a wrong prediction result may be obtained if the model constructed by 6:00-6:10 is used. Thus, in building the model, it is necessary to obtain a set of data samples of query steps that are performed to completion within a historical period of time of duration T1.
And step 404, constructing a training sample set.
For any query step data sample, converting the query step data sample into a training sample in the following way:
determining the characteristics and the completion timestamp of the query step according to the basic information of the query step, and obtaining the reference execution cost when the query step is evaluated by querying the corresponding relation shown in tables 3 and 4;
calculating to obtain the stability of the reference execution cost corresponding to the query step according to the actual execution cost and the reference execution cost;
obtaining a training sample corresponding to the query step by taking the characteristic of the query step as a characteristic value and taking the reference execution cost stability of the query step as a label;
and traversing the data sample set and executing the conversion step to obtain a training sample set.
As shown in table 4, if the completion timestamp of a certain query step is Time5, the query step uses data set 2 corresponding to Time2 when the execution cost is evaluated.
As shown in fig. 5, which is a schematic flow chart of a predictive model training method shown in this specification, a data sample of any query step includes basic information of the query step and an actual execution cost after the query step is executed, a feature of the query step is extracted through the basic information of the query step, and the actual execution cost for actually executing the query step is determined through execution result information of the query step.
Taking the execution cost as the input line number and the output line number as an example, let statsfiff (place node states-operators)/Max (place node states, operators), where statsfiff is used to represent the difference between the actual execution cost and the reference execution cost, and place node states represents the obtained reference execution cost, and if 1-statsfiff is used as a label, the closer the label is to 1, the more stable the reference execution cost is, the closer to 0, the more unstable the reference execution cost is.
In addition, in one or more embodiments of the present description, the reference execution cost of each query step in the evaluation and the actual execution cost of the query step may be saved as a query step data sample when the query step is completed.
Step 406, training a model for predicting the stability of the reference execution cost of any query step.
And training a model for predicting the stability of the reference execution cost of any query step by using the constructed training sample set. The model for predicting the execution cost stability of the reference can be constructed based on different algorithms, for example, the model can be constructed based on a support vector machine, the model can be constructed based on a neural network, the model can be constructed based on a random forest, and the constructed model can achieve the purpose shown in the specification.
Taking a random forest as an example:
assuming that the number of samples of the constructed training sample set is N, for any tree, N (N < N) training samples are randomly and replaceably extracted from the training sample set as the training sample set of the tree.
If the feature dimension of each sample is M, a constant M < < M is assigned, M feature subsets are randomly selected from the M feature monkeys, and each time the tree is split, the optimal feature subset is selected from the M feature collections.
Each tree grows to the greatest extent possible without pruning.
In practical applications, the model may be constructed under a preset condition, for example, the preset condition may be that the model is constructed at a time point of a preset period, and the prediction model is constructed once every time period passes. The model prediction method can also be used for constructing a model, and when the model is determined not to be updated for a long time, the model needs to be updated, namely reconstructed, in order to ensure the accuracy of the model prediction.
The above is a description of a model construction method for predicting the stability of the reference execution cost, and based on the above model construction method, the present description also provides a corresponding model using method, that is, a method for predicting the stability of the reference execution cost. Next, a method of predicting the execution cost stability of the reference will be described in detail.
In practical applications, the data of some tables and columns of the database may change dramatically, and the reference execution cost of the query steps related to the tables and columns may also change greatly. Due to rapid and sharp change of data, the actual execution cost after the execution is finished is far from the actual execution cost after the last execution is finished.
As shown in fig. 6, a flowchart of a method for predicting the execution cost stability of a reference is shown in the present specification, which includes the following steps:
step 602, determine the characteristics of the query step.
The received query steps can be sent by the optimizer engine, or can be input by a tester for testing, and are determined according to the actual application scenario.
The characteristics of the inquiry step are determined based on the basic information of the inquiry step, and the same method as the above-mentioned method for determining the characteristics of the inquiry step is used.
Step 604, inputting the characteristics of the query step into the model.
The prediction model uses a model construction method for predicting the stability of the reference execution cost as described above, and will not be described in detail here.
And 606, predicting the reference execution cost stability of the query step according to the output of the model.
Because the data sample for constructing the prediction model is the query step within the preset historical time, if the data in the database in the recent historical time is stable, the reference execution cost of each query step is also stable, and the reference execution cost predicted by using the model in the recent time in the future is also stable. If the data in the database in the past time is not stable, the reference execution cost of each query step is also not stable, and the reference execution cost in the future time predicted by using the model is also not stable.
In one or more embodiments of the present description, when a prediction model is used for prediction, a construction time of the prediction model may be determined, and if the construction time of the prediction model is long, it indicates that the model is not updated for a long time, and at this time, the model is used to predict the execution cost stability, and an obtained result does not have a great reference meaning, so that when it is determined that the construction time of the prediction model exceeds the preset time, the prediction model may be updated.
In practical applications, before the reference execution cost is determined by using the reference execution cost determining method, whether the reference execution cost is stable or not can be predicted by using the method for predicting the reference execution cost stability.
As shown in fig. 7, for another method for determining an optimal query plan, which is applied to an optimizer execution engine, the method includes the following steps:
step 702, analyzing each query step in the query plan.
Step 704, for any query step, determines an alternative query step to the query step.
Step 706, for any alternative query step, predicting the reference execution cost stability corresponding to the alternative query step by using the method for predicting the reference execution cost stability as described above; and if the stability of the reference execution cost corresponding to the alternative query step is greater than a preset value, determining the characteristics of the alternative query step, and querying the reference execution cost corresponding to the alternative query step according to the corresponding relation.
Step 708, selecting the alternative query step with the minimum execution cost of each query step, and determining the optimal query plan.
If the stability is greater than the preset value, it indicates that the reference execution cost corresponding to the characteristics of the query step is relatively stable in a recent period of historical time, and therefore, the stored reference execution cost has high reliability (high accuracy) and can be directly queried and obtained.
Wherein, obtaining the reference execution cost of the query step is by using the reference execution cost determination method as described above.
If the stability is smaller than the predicted value, it is indicated that the reference execution cost corresponding to the feature of the query step is relatively unstable in a recent period of historical time, so that the reliability of the stored reference execution cost is not high (the accuracy is low) and the reference meaning is not large.
In practical application, if the stability is smaller than a preset value, the model for predicting the stability of the reference execution cost is updated.
In one or more embodiments of the present specification, a prediction result of a model for predicting the stability of the reference execution cost is directly related to sample data used for constructing the model, and if a training sample for constructing the prediction model is unstable, the obtained prediction result is also unstable, and if the training sample for constructing the prediction model is stable, the obtained prediction result is also stable. Therefore, when the stability is determined to be less than, it is indicated that the data related to the query step is unstable, and the data related to the query step is stable, and then the model is constructed by using the stable data sample of the query step, then the cost of predicting the reference execution of the query step by using the updated model is also stable.
In addition, in practical application, the optimizer engine determines that the order of the reference execution cost and the stability of the predicted reference execution cost may not be fixed for any alternative query step, and may obtain the reference execution cost first, then predict the stability of the reference execution cost, and determine whether the reference execution cost needs to be obtained by the optimizer engine.
Fig. 8 is a schematic block diagram of an embodiment of a method for determining a cost with reference to execution, which is described in detail below.
The data acquisition module is responsible for acquiring data in the database executor engine, and includes execution result information of any query step, which may be organized in the form of [ QueryId, TimeStamp, PlanInfo, and ExecutionInfo ], where QueryId is a unique identifier of each query plan, TimeStamp is a query submission TimeStamp, PlanInfo is plan information of the query plan, which includes multiple PlanNodeInfo (query step information), and ExecutionInfo is execution result information, which includes execution result information after execution of each query step is completed.
Wherein, each PlanNodeInfo uniquely identifies the PlanNodeId of the query step, the type PlanNodeType of the query step, and the reference execution cost PlanNodeStats of the query step, and the expression List List < expression type, List < Symbol > >, of the query step. Taking the reference execution cost as the input line number and the output line number as an example, plant nodestats [ InputRowCount, OutputRowCount ], where the input line number InputRowCount and the output line number OutputRowCount are taken as examples.
The Symbol may include schema id, TableId, column id, which are unique identifiers of the database, the table, and the column, respectively, and the expresstype is an expression operator of each type supported by the database.
The ExecutionInfo includes an actual execution cost after each query step is completed, corresponding to the reference execution cost, and the actual execution cost OperatorStats ═ InputRowCount, OutputRowCount.
The data storage module stores the data collected by the data collection module, the PlanNodeId of each query step is used as a unique identifier, and [ QueryId, Timestamp, PlanNodeType, List < ExpressionType, List < Symbol > >, PlanNodeStats, OperatorStats ] is used as a structure for storage.
The intelligent decision-making module can control data acquisition (whether to acquire and acquisition sampling ratio) of the data acquisition module, and can also perform model training by using data stored in the data storage module (wherein, the model needs to be updated all the time to ensure the accuracy of prediction).
The present specification also provides apparatuses, electronic devices, and computer-readable storage media corresponding to the above methods, and the following are descriptions of the apparatuses, devices, and storage media corresponding to the above methods.
The present specification provides an optimized query plan apparatus, as shown in fig. 9, which uses the features of the query step as identifiers to construct a corresponding relationship between the features of the query step and the reference execution cost, where the method includes:
an execution result information obtaining model 902, configured to obtain execution result information after the database execution engine completes any query step;
a characteristic and actual execution cost determining module 904, configured to determine the characteristic of the query step according to the basic information of the query step, and determine the actual execution cost of the query step information according to the execution result information;
an updating module 906, configured to update the reference execution cost corresponding to the feature of the query step according to the actual execution cost of the query step; and the corresponding relation between the query step characteristics and the reference execution cost is used for evaluating the cost of any query step when the query plan is optimized.
Wherein the execution cost is execution process information which can be obtained from the execution result information.
The correspondence further includes: the optimized query planning apparatus may further include a completion timestamp determining apparatus (not shown) for determining a completion timestamp of the query step according to the basic information of the query step; at this time, the updating module 906 is configured to add the actual execution cost and the completion timestamp to the identifier corresponding to the query step feature, and update the reference execution cost value corresponding to the query step feature to the actual execution cost value.
The present specification provides a reference execution cost determination apparatus, as shown in fig. 10, for determining a reference execution cost of any query step, including:
an analysis module 1002, configured to analyze each query step in the query plan;
a determination module 1004 for determining, for any query step, alternative query steps to the query step;
a reference execution cost determination module 1006, for any alternative query step, determining the characteristics of the alternative query step, and querying the reference execution cost corresponding to the alternative query step according to the correspondence relationship of claim 1;
the optimal query plan determining module 1108 selects an alternative query step with the minimum execution cost for each query step to obtain an optimal query plan.
The present specification provides a model building apparatus for predicting a stability of a reference execution cost, as shown in fig. 11, including:
a data sample obtaining module 1102, configured to obtain, according to a preset time duration T1, a set of query step data samples executed within a history time period of a time duration T1, where any query step data sample includes: basic information of the query step and actual execution cost corresponding to the query step;
the training sample determination module 1104 includes three subunits, which is used to convert any query step data sample into a training sample by the following method:
the query subunit 1114 is configured to determine, according to the basic information of the query step, characteristics and a completion timestamp of the query step, and obtain, by querying the correspondence as described above, a reference execution cost when the query step is evaluated;
the stability calculation subunit 1124, configured to calculate, according to the actual execution cost and the reference execution cost, a stability of the reference execution cost corresponding to the query step;
a training sample determination subunit 1134, configured to obtain a training sample corresponding to the query step by using the feature of the query step as a feature value and using the reference execution cost stability of the query step as a label;
a training sample set determining module 1106, configured to perform the conversion step on the data sample set in a traversal manner, so as to obtain a training sample set;
a model training module 1108, configured to train with the training sample set to obtain a model, where the model is configured to: and predicting the stability of the reference execution cost of any query step according to the characteristics of the query step.
The present specification provides an apparatus for predicting a stability of a reference execution cost, as shown in fig. 12, the apparatus for predicting a stability of a reference execution cost corresponding to any query step includes:
a feature determination module 1202 for receiving a query step and determining a feature of the query step;
an input module 1204, configured to input the features of the query step into a model for predicting the stability of the execution cost of the reference; the model is constructed by using the model construction device;
and a prediction result obtaining module 1206, configured to predict, according to the output of the model, a reference execution cost stability corresponding to the query step.
The present specification also provides another reference execution cost determination apparatus, as shown in fig. 13, including:
an analysis module 1302, configured to analyze each query step in the query plan;
a determining module 1304 for determining, for any query step, an alternative query step to the query step;
a reference execution cost determining module 1306, configured to predict, for any one alternative query step, a reference execution cost stability corresponding to the alternative query step by using the above apparatus for predicting a reference execution cost stability; if the stability of the reference execution cost corresponding to the alternative query step is greater than a preset value, determining the characteristics of the alternative query step, and querying the reference execution cost corresponding to the alternative query step according to the corresponding relation;
an optimal query plan determining module 1308, configured to select an alternative query step with the smallest execution cost in each query step, to obtain an optimal query plan.
The apparatus may further include an updating module (not shown) configured to update the model for predicting the reference execution cost stability if the reference execution cost stability corresponding to the querying step is smaller than a preset value.
The apparatuses, modules or units illustrated in the above embodiments may be implemented by a computer chip or an entity, or implemented by a product with certain functions. A typical implementation device is a computer, which may take the form of a personal computer, laptop computer, cellular telephone, camera phone, smart phone, personal digital assistant, media player, navigation device, email messaging device, game console, tablet computer, wearable device, or a combination of any of these devices.
This specification provides an electronic device including:
a processor;
a memory for storing processor-executable instructions;
the processor implements a method for optimizing a query plan by running the executable instructions, and constructs a corresponding relationship between the query step features and a reference execution cost by using the features of the query step as identifiers, wherein the method at least comprises the following steps:
acquiring execution result information of the database execution engine after completing any query step;
determining the characteristics of the query step according to the basic information of the query step, and determining the actual execution cost of the query step information according to the execution result information;
updating the reference execution cost corresponding to the characteristics of the query step according to the actual execution cost of the query step; and the corresponding relation between the query step characteristics and the reference execution cost is used for evaluating the cost of any query step when the query plan is optimized.
This specification also provides an electronic device, including:
a processor;
a memory for storing processor-executable instructions;
wherein the processor implements a method of determining an optimal query plan by executing the executable instructions, the method comprising at least:
analyzing each query step in the query plan;
for any query step, determining an alternative query step to the query step;
determining the characteristics of the alternative query steps aiming at any alternative query step, and querying the reference execution cost corresponding to the alternative query steps according to the corresponding relation between the characteristics of the query steps and the reference execution cost;
and selecting the alternative query step with the minimum execution cost of each query step to obtain the optimal query plan.
This specification also provides an electronic device, including:
a processor;
a memory for storing processor-executable instructions;
wherein the processor implements a model building method for predicting a stability of a reference execution cost by executing the executable instructions, the method comprising at least:
according to the preset duration T1, acquiring a set of query step data samples finished in the historical time period of the duration T1, wherein any query step data sample comprises: basic information of the query step and actual execution cost corresponding to the query step;
for any query step data sample, converting the query step data sample into a training sample by:
determining the characteristics and the completion timestamp of the query step according to the basic information of the query step, and obtaining the reference execution cost when the query step is evaluated through the corresponding relation between the characteristics of the query step and the reference execution cost;
calculating to obtain the stability of the reference execution cost corresponding to the query step according to the actual execution cost and the reference execution cost;
obtaining a training sample corresponding to the query step by taking the characteristic of the query step as a characteristic value and taking the reference execution cost stability of the query step as a label;
traversing the data sample set and executing the conversion step to obtain a training sample set;
training by using the training sample set to obtain a model, wherein the model is used for: and predicting the stability of the reference execution cost of any query step according to the characteristics of the query step.
This specification also provides an electronic device, including:
a processor;
a memory for storing processor-executable instructions;
the processor executes the executable instructions to realize a method for predicting the stability of the reference execution cost, which is used for predicting the stability of the reference execution cost corresponding to any query step, and the method at least comprises the following steps:
receiving a query step and determining characteristics of the query step;
inputting the characteristics of the query step into a model for predicting the stability of the execution cost of the reference; the model is constructed by using the construction method;
and predicting the stability of the reference execution cost corresponding to the query step according to the output of the model.
This specification also provides an electronic device, including:
a processor;
a memory for storing processor-executable instructions;
wherein the processor implements a method of determining an optimal query plan by executing the executable instructions, the method comprising at least:
analyzing each query step in the query plan;
for any query step, determining an alternative query step to the query step;
for any one alternative query step, predicting the stability of the reference execution cost corresponding to the alternative query step; if the stability of the reference execution cost corresponding to the alternative query step is greater than a preset value, determining the characteristics of the alternative query step, and querying the reference execution cost corresponding to the alternative query step according to the corresponding relation between the characteristics of the query step and the reference execution cost;
and selecting the alternative query step with the minimum execution cost of each query step to obtain the optimal query plan.
FIG. 14 is a schematic block diagram of an apparatus provided in an exemplary embodiment. Referring to FIG. 14, at the hardware level, the device includes a processor 1402, an internal bus 1404, a network interface 1406, a memory 1408, and a non-volatile storage 1410, although other hardware required for service may be included. One or more embodiments of the present description can be implemented in software, such as by processor 1402 reading corresponding computer programs from non-volatile storage 1410 into memory 1408 and then running. Of course, besides software implementation, the one or more embodiments in this specification do not exclude other implementations, such as logic devices or combinations of software and hardware, and so on, that is, the execution subject of the following processing flow is not limited to each logic unit, and may also be hardware or logic devices.
The apparatuses shown in fig. 9-13 can be applied to the device shown in fig. 14 to implement the technical solution of the present specification.
In a typical configuration, a computer includes one or more processors (CPUs), input/output interfaces, network interfaces, and memory.
The memory may include forms of volatile memory in a computer readable medium, Random Access Memory (RAM) and/or non-volatile memory, such as Read Only Memory (ROM) or flash memory (flash RAM). Memory is an example of a computer-readable medium.
The present specification provides a computer readable storage medium having stored thereon computer instructions which, when executed by a processor, implement a method of optimizing a query plan, identifying characteristics of query steps, and constructing a correspondence between the characteristics of the query steps and a reference execution cost, the method comprising:
acquiring execution result information of the database execution engine after completing any query step;
determining the characteristics of the query step according to the basic information of the query step, and determining the actual execution cost of the query step information according to the execution result information;
updating the reference execution cost corresponding to the characteristics of the query step according to the actual execution cost of the query step; and the corresponding relation between the query step characteristics and the reference execution cost is used for evaluating the cost of any query step when the query plan is optimized.
The present specification also provides a computer-readable storage medium having stored thereon computer instructions which, when executed by a processor, implement a method of determining an optimal query plan, the method comprising at least:
analyzing each query step in the query plan;
for any query step, determining an alternative query step to the query step;
determining the characteristics of the alternative query steps aiming at any alternative query step, and querying the reference execution cost corresponding to the alternative query steps according to the corresponding relation between the characteristics of the query steps and the reference execution cost;
and selecting the alternative query step with the minimum execution cost of each query step to obtain the optimal query plan.
The present specification also provides a computer-readable storage medium having stored thereon computer instructions which, when executed by a processor, implement a model building method for predicting a stability of a reference execution cost, the method comprising at least:
according to the preset duration T1, acquiring a set of query step data samples finished in the historical time period of the duration T1, wherein any query step data sample comprises: basic information of the query step and actual execution cost corresponding to the query step;
for any query step data sample, converting the query step data sample into a training sample by:
determining the characteristics and the completion timestamp of the query step according to the basic information of the query step, and obtaining the reference execution cost when the query step is evaluated through the corresponding relation between the characteristics of the query step and the reference execution cost;
calculating to obtain the stability of the reference execution cost corresponding to the query step according to the actual execution cost and the reference execution cost;
obtaining a training sample corresponding to the query step by taking the characteristic of the query step as a characteristic value and taking the reference execution cost stability of the query step as a label;
traversing the data sample set and executing the conversion step to obtain a training sample set;
training by using the training sample set to obtain a model, wherein the model is used for: and predicting the stability of the reference execution cost of any query step according to the characteristics of the query step.
The present specification also provides a computer-readable storage medium having stored thereon computer instructions which, when executed by a processor, implement a method of predicting a stability of a reference execution cost for predicting a stability of a reference execution cost corresponding to any one of query steps, the method comprising at least:
receiving a query step and determining characteristics of the query step;
inputting the characteristics of the query step into a model for predicting the stability of the execution cost of the reference; the model is constructed by using the construction method;
and predicting the stability of the reference execution cost corresponding to the query step according to the output of the model.
The present specification also provides a computer-readable storage medium having stored thereon computer instructions that, when executed by a processor, are executable to implement a method of determining an optimal query plan, the method comprising at least:
analyzing each query step in the query plan;
for any query step, determining an alternative query step to the query step;
for any one alternative query step, predicting the stability of the reference execution cost corresponding to the alternative query step; if the stability of the reference execution cost corresponding to the alternative query step is greater than a preset value, determining the characteristics of the alternative query step, and querying the reference execution cost corresponding to the alternative query step according to the corresponding relation between the characteristics of the query step and the reference execution cost;
and selecting the alternative query step with the minimum execution cost of each query step to obtain the optimal query plan.
Computer-readable media, including both non-transitory and non-transitory, removable and non-removable media, may implement information storage by any method or technology. The information may be computer readable instructions, data structures, modules of a program, or other data. Examples of computer storage media include, but are not limited to, phase change memory (PRAM), Static Random Access Memory (SRAM), Dynamic Random Access Memory (DRAM), other types of Random Access Memory (RAM), Read Only Memory (ROM), Electrically Erasable Programmable Read Only Memory (EEPROM), flash memory or other memory technology, compact disc read only memory (CD-ROM), Digital Versatile Discs (DVD) or other optical storage, magnetic cassettes, magnetic disk storage, quantum memory, graphene-based storage media or other magnetic storage devices, or any other non-transmission medium that can be used to store information that can be accessed by a computing device. As defined herein, a computer readable medium does not include a transitory computer readable medium such as a modulated data signal and a carrier wave.
Furthermore, the present description also provides a computer program for implementing the method as described in any one of the above when executed by one or more processors.
It should also be noted that the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising an … …" does not exclude the presence of other like elements in a process, method, article, or apparatus that comprises the element.
The foregoing description has been directed to specific embodiments of this disclosure. Other embodiments are within the scope of the following claims. In some cases, the actions or steps recited in the claims may be performed in a different order than in the embodiments and still achieve desirable results. In addition, the processes depicted in the accompanying figures do not necessarily require the particular order shown, or sequential order, to achieve desirable results. In some embodiments, multitasking and parallel processing may also be possible or may be advantageous.
The terminology used in the description of the one or more embodiments is for the purpose of describing the particular embodiments only and is not intended to be limiting of the description of the one or more embodiments. As used in one or more embodiments of the present specification and the appended claims, the singular forms "a," "an," and "the" are intended to include the plural forms as well, unless the context clearly indicates otherwise. It should also be understood that the term "and/or" as used herein refers to and encompasses any and all possible combinations of one or more of the associated listed items.
It should be understood that although the terms first, second, third, etc. may be used in one or more embodiments of the present description to describe various information, such information should not be limited to these terms. These terms are only used to distinguish one type of information from another. For example, first information may also be referred to as second information, and similarly, second information may also be referred to as first information, without departing from the scope of one or more embodiments herein. The word "if" as used herein may be interpreted as "at … …" or "when … …" or "in response to a determination", depending on the context.
The above description is only for the purpose of illustrating the preferred embodiments of the one or more embodiments of the present disclosure, and is not intended to limit the scope of the one or more embodiments of the present disclosure, and any modifications, equivalent substitutions, improvements, etc. made within the spirit and principle of the one or more embodiments of the present disclosure should be included in the scope of the one or more embodiments of the present disclosure.