WO2019101119A1 - 代价优化器与代价估计的方法及其设备 - Google Patents

代价优化器与代价估计的方法及其设备 Download PDF

Info

Publication number
WO2019101119A1
WO2019101119A1 PCT/CN2018/116874 CN2018116874W WO2019101119A1 WO 2019101119 A1 WO2019101119 A1 WO 2019101119A1 CN 2018116874 W CN2018116874 W CN 2018116874W WO 2019101119 A1 WO2019101119 A1 WO 2019101119A1
Authority
WO
WIPO (PCT)
Prior art keywords
type
connection
predicate
cost
records
Prior art date
Application number
PCT/CN2018/116874
Other languages
English (en)
French (fr)
Inventor
夏立
陈振强
Original Assignee
星环信息科技(上海)有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 星环信息科技(上海)有限公司 filed Critical 星环信息科技(上海)有限公司
Priority to SG11202004818SA priority Critical patent/SG11202004818SA/en
Priority to EP18880194.8A priority patent/EP3716093B1/en
Priority to US16/766,480 priority patent/US11023466B2/en
Priority to CA3083148A priority patent/CA3083148C/en
Priority to JP2020538120A priority patent/JP2021504852A/ja
Publication of WO2019101119A1 publication Critical patent/WO2019101119A1/zh

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/24Querying
    • G06F16/245Query processing
    • G06F16/2453Query optimisation
    • G06F16/24534Query rewriting; Transformation
    • G06F16/24542Plan optimisation
    • G06F16/24544Join order optimisation
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/24Querying
    • G06F16/242Query formulation
    • G06F16/2433Query languages
    • G06F16/244Grouping and aggregation
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/24Querying
    • G06F16/245Query processing
    • G06F16/2453Query optimisation
    • G06F16/24534Query rewriting; Transformation
    • G06F16/24542Plan optimisation
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F17/00Digital computing or data processing equipment or methods, specially adapted for specific functions
    • G06F17/10Complex mathematical operations
    • G06F17/18Complex mathematical operations for evaluating statistical data, e.g. average values, frequency distributions, probability functions, regression analysis

Definitions

  • the present application relates to the field of computers, such as methods and apparatus for cost optimizer and cost estimation.
  • Cost Based Optimizer is a core component of the database system. Due to its significant impact on the performance of the database system, it plays an important role in modern database systems.
  • the core of the cost optimizer is the cost estimation model, which is used to estimate the cost of the execution plan generated by the database system to select the optimal execution plan. Estimate the quality of the model affects the pros and cons of the final implementation plan.
  • the cost optimizer optimizes the execution plan based on the statistical data of the target data. The integrity and accuracy of the statistical information directly affect the estimation model's estimation of the execution plan.
  • the cost optimizer is optimized to generate an optimal execution plan.
  • the execution plan is generally represented by an operation tree, and the operation tree is composed of different types of operations.
  • the operation type of an operation tree may be scan table, selection, filtering, aggregation, connection, projection, and the like.
  • the cost optimizer can use the cost estimation model to estimate the cost of each step of the execution plan, and select the smallest overall cost from all possible execution plans as the final execution plan, thus completing the entire cost.
  • the process of optimization It can be seen that the core of the cost optimizer is the cost estimation based on statistical information, which is the basis of the cost optimizer.
  • the cost optimizer needs statistics and cannot complete the cost optimization in the absence of the necessary statistics. For example, a temporary table created at runtime or a scenario with subqueries cannot determine its statistics during the compilation phase, so cost optimization cannot be completed.
  • a temporary table created at runtime or a scenario with subqueries cannot determine its statistics during the compilation phase, so cost optimization cannot be completed.
  • the cost optimizer is applied to the big data system to process massive data, because the statistical information collection of massive data is costly, collecting the statistical information of the massive data will become the bottleneck of the application cost optimizer. For a given data set, it is not necessary to collect statistics for the full amount of data, and based on some statistical information, it is also possible to get an optimal plan. In the case where the statistical information is incomplete or unavailable, the cost optimizer in the related art fails.
  • the present application provides a cost optimizer and cost estimation method and a device thereof, which can solve the problem that the cost estimation cannot be performed when the statistical information is incomplete.
  • a method for estimating a cost comprising: determining whether the acquired statistical information is complete, and determining a cost estimation manner of the first operation type if the statistical information is incomplete; and estimating the cost based on the cost Means determining a first cost estimate of the first type of operation, the first type of operation being an operation type of an operation tree dependent on the statistical information; and determining a second cost estimate based on the first cost estimate and the second type of operation The cumulative cost estimate of the operation tree, the second type of operation being an operation type of an operation tree that is not dependent on the statistical information.
  • an apparatus for cost estimation comprising: a judging means, a determining means and an estimating means.
  • the determining device is configured to determine whether the obtained statistical information is complete. If the statistical information is incomplete, the corresponding cost estimating manner is determined according to the operation type of the operation tree that depends on the statistical information.
  • the determining means is arranged to determine a cost estimate for the corresponding type of operation based on the cost estimate.
  • the estimating means is arranged to determine a cumulative cost estimate of the operating tree based on a cost estimate corresponding to the cost estimate corresponding to the statistical information operation type and the operation type not dependent on the statistical information.
  • a cost optimizer is further provided, the cost optimizer is configured to: generate an original execution plan; determine whether the obtained statistical information is complete, and if the statistical information is complete, according to the first based on the statistical information
  • the cost estimation model estimates the cost. If the statistical information is incomplete, the cost is estimated according to the second cost estimation model; and the optimal execution plan is generated according to the cost estimated by the first cost estimation model or the cost estimated by the second cost estimation model.
  • a computing-based device comprising: a processor; and a memory arranged to store computer executable instructions that, when executed, cause the processor to: determine Whether the obtained statistical information is complete, and if the statistical information is incomplete, determining a corresponding cost estimation manner according to an operation type of the operation tree that depends on the statistical information; determining a cost estimate of the corresponding operation type based on the cost estimation manner And determining a cumulative cost estimate of the operation tree according to a cost estimate corresponding to the cost estimate corresponding to the statistical information operation type and an operation type not dependent on the statistical information.
  • a computer readable storage medium storing computer executable instructions for performing the method of cost estimation described above.
  • the corresponding cost estimation manner is determined according to the operation type of the operation tree that depends on the statistical information; and the cost estimation of the corresponding operation type is determined based on the cost estimation manner; A cumulative cost estimate for the operational tree is determined dependent on a cost estimate corresponding to the statistical information operation type and a cost estimate corresponding to the operation type not dependent on the statistical information. Therefore, the traditional cost optimizer can not estimate the cost of creating temporary tables and subqueries at runtime, and the scenario of massive data is limited by the data size.
  • the execution plan corresponding to the statement of the Structured Query Language is estimated according to the cumulative cost estimation, and can be applied to the optimization of the SQL in the database system, thereby improving the accuracy of the cost estimate cost estimation of the SQL, thereby Generate a better performing execution plan.
  • SQL Structured Query Language
  • FIG. 1 is a flow chart showing a method for estimating a cost according to an aspect of the present application.
  • FIG. 2 is a schematic diagram showing an operation tree of a tree in an embodiment of the present application.
  • FIG. 3 is a block diagram showing the structure of a device for estimating a cost according to another aspect of the present application.
  • FIG. 4 shows a schematic diagram of an improved cost optimizer provided in accordance with yet another aspect of the present application.
  • the terminal, the device of the service network, and the trusted party each include one or more processors (eg, a central processing unit (CPU), an input/output interface, a network interface, and a memory.
  • processors eg, a central processing unit (CPU), an input/output interface, a network interface, and a memory.
  • the memory may include non-persistent memory, random access memory (RAM) and/or non-volatile memory in a computer readable storage medium, such as Read-Only Memory (ROM). Or flash memory (flash RAM). Memory is an example of a computer readable storage medium.
  • RAM random access memory
  • ROM Read-Only Memory
  • flash RAM flash RAM
  • Computer readable storage media including both permanent and non-persistent, removable and non-removable media may be implemented by any method or technology.
  • the information can be computer readable instructions, data structures, modules of programs, or other data.
  • Examples of computer storage media include, but are not limited to, phase change memory (PRAM), static random access memory (SRAM), dynamic random access memory (DRAM), other types of random access memory.
  • RAM Phase change memory
  • SRAM static random access memory
  • DRAM dynamic random access memory
  • RAM Random Access Memory
  • ROM Read Only Memory
  • EEPROM Electrically-Erasable Programmable ROM
  • Flash Memory or other memory technology
  • Read-Only Memory (CD-ROM) ROM Read-Only Memory
  • CD-ROM Compact Disc
  • DVD Digital Versatile Disc
  • Computer-readable storage media does not include non-transitory computer readable media, such as modulated data signals and carrier waves.
  • FIG. 1 is a schematic flowchart diagram of a method for estimating a cost according to an aspect of the present application, where the method includes: S11 to S13.
  • An operation tree can include multiple types of operations, such as including types of operations that rely on statistics and types of operations that do not rely on statistics.
  • the operation type of the operation tree depending on the statistical information is referred to as a first operation type
  • the cost estimate of the first operation type is referred to as a first cost estimate
  • the operation type of the operation tree not dependent on the statistical information is called
  • the cost estimate for the second type of operation is referred to as the second cost estimate.
  • a first cost estimate of the first type of operation is determined based on the cost estimation manner.
  • a cost estimate corresponding to the operation type corresponding to the statistical information operation type ie, the first cost estimate of the first operation type
  • an operation type not dependent on the statistical information ie, the second operation type
  • Two cost estimates A cumulative cost estimate for the operational tree is determined.
  • the above method can solve the scenario that the temporary table and the subquery cannot be estimated at runtime, or the scenario of the massive data is limited by the data size, and the traditional cost optimizer cannot perform the cost optimization problem.
  • the method of the present application includes estimating an execution plan corresponding to the optimized structured query language statement based on the cumulative cost estimate.
  • the cost estimation method described in the present application is applied to the optimization of SQL in the database system, and the accuracy of the cost estimate cost estimation of the SQL is improved, thereby generating a better performance execution plan.
  • the execution plan may be cost optimized based on incomplete statistical information.
  • the cost estimation model in the related technology is used for estimation, and the optimal execution plan is generated based on the estimation result; when the statistical information is incomplete, the cost estimation of the corresponding operation is performed by the method of cost estimation described in the present application. To obtain an estimate and generate an optimal execution plan based on the estimate.
  • the corresponding cost estimation manner is determined according to the operation type (ie, the first operation type) of the operation tree that depends on the statistical information.
  • the operation type of the operation tree that depends on the statistical information may include: a scan operation, a filtering operation, a connection operation, and an aggregation operation.
  • an operation tree is composed of different operators, each of which represents an operation type, which may include scanning, filtering, connecting, projecting, aggregating, selecting, and the like.
  • an operation tree diagram adjusts the operation type on the operation tree, estimates the cost of each operation type, and finally generates an operation tree with the smallest cumulative cost to generate an execution plan.
  • the cost of different operation types depends on the statistical information.
  • the cost of the scan operation, the filtering operation, the connection operation, and the aggregation operation depend on the statistical information, that is, only the scan operation, the filtering operation, the connection operation, and the aggregation operation are obtained.
  • the complete statistical information can determine the cost estimate of the sweep operation, the filtering operation, the connection operation, and the aggregation operation; and the cost of the projection operation and the selection operation does not depend on the statistical information, that is, it is not necessary to obtain the projection operation and the selection operation.
  • Complete statistical information can also be used to determine the cost estimate for projection operations and selection operations.
  • the present application judges whether the obtained statistical information is complete, and if the statistical information is incomplete, the cost estimation method of the operation type depending on the statistical information is improved.
  • a cost estimation manner of the operation type depending on the statistical information is determined, thereby determining a cost estimate of the operation type depending on the statistical information according to the cost estimation manner.
  • a cost estimate corresponding to the type of operation of the statistical information operation type ie, the first cost estimate of the first operation type
  • an operation type not dependent on the statistical information the second cost of the second operation type
  • the cost estimation of the scan operation, the filtering operation, the connection operation, and the aggregation operation depending on the statistical information is accumulated with the cost estimation of the projection operation and the selection operation not dependent on the statistical information to obtain the cost estimate of the operation tree, thereby generating the most Excellent implementation plan.
  • a cost estimate of the scan operation is determined according to the number of records of the data set; a selection rate of the filter condition is determined according to a type of the filter predicate, and a cost estimate of the filtering operation is determined according to the selection rate; Determining a cost estimate of the connection operation according to the determined number of records of the connection result set; determining an aggregation rate of the aggregation field according to the aggregation field and the aggregation function, and determining a cost estimate of the aggregation operation according to the aggregation rate.
  • the cost of the scan operation is related to the size of the data set (also known as the number of records in the data set).
  • the cost of the filtering operation is related to the selection rate of the filtering conditions, and the selection rate is calculated according to the filtering conditions.
  • the selection rate of the filtering conditions is determined differently.
  • the cost of the join operation is related to the size of the data set participating in the join, and also the size of the join result set after the join. Therefore, first determine the number of records in the join result set, and then determine the cost estimate of the join operation based on the number of records in the join result set. .
  • the cost of the aggregation operation depends on the amount of data involved in the aggregation and the aggregation rate of the aggregation field.
  • the aggregation rate of the aggregation field needs to be calculated according to the number of different values of the aggregation field. Therefore, when determining the cost of the aggregation operation, the aggregation field needs to be calculated. The number of different values.
  • the number of records in the data set is recorded as RC (Row Count), and the size of the data set (denoted as A) is in a proportional relationship with the RC, that is, A ⁇ RC, so in the embodiment of the present application, You can use RC to represent the size of the data set, then the cost of the scan operation is estimated to be:
  • the formula indicates that the cost estimate of the scan table operation can be determined based on the record number RC of the data set, wherein the Op represents the target operation for which the cost is to be estimated, and the getOriginalCost represents the method using the correlation technique and the RC to estimate the cost.
  • the cost estimate of the filtering operation is related to the selection rate of the filtering predicate, and the estimation algorithms of the selection rates of the different types of filtering predicates are different.
  • the cost of the filtering operation is related to the selection rate of the filtering condition, and the selection rate of the filtering condition is calculated according to the filtering condition.
  • the identity predicate and the unequal predicate are a pair of complementary predicates, that is, the sum of the selection conditions of the filter conditions of the constant predicate and the unequal predicate is 1, therefore,
  • the type of the filtering predicate is a range predicate
  • the filter condition is between two values.
  • v 1 , v 2 , . . . , v n are filter values
  • C is a set of filter values
  • n is the number of filter values in set C.
  • the selection rate of the filtering condition S 1/specified value.
  • the selection rate needs to be estimated according to the null value of the null predicate field.
  • the filtering predicates of the cascading are, in order to prevent the estimation error from being cascading and amplifying, in the embodiment of the present application, the selection rate of the filtering conditions is estimated by the following method:
  • defines the minimum value of the filter filter rate of And cascade filter predicate, and can adjust the value of ⁇ according to the actual situation; selectivity is the selection rate, the above formula represents the cascaded selection rate value of And cascade filter predicate and And cascade filter The maximum value of the minimum value of the predicate filter rate is used as the selection rate of the filter condition.
  • the selection rate of the filtering condition is determined according to the selection rate of the concatenated predicate corresponding to the or concatenated predicate.
  • the filter or predicate filter predicate is:
  • the selection rate of the filter condition is determined as follows:
  • the selection rate of the corresponding filtering conditions is discussed for different types of filtering predicates, and the cost of the filtering operation can be calculated according to the determined selection rate:
  • the number of records of the connection result set is determined according to the number of different values of the connection fields of the connected left and right tables.
  • the left table of the join is denoted as T left
  • the number of records is RC left
  • the number of odd values is NDV left
  • the right table of the join join is T right
  • the number of records is RC right
  • the number of distinct values is NDV right
  • the number of records in the result set is recorded as RC result .
  • the left and right tables participating in the connection have the following operations: selection operation, filtering operation, scan table operation, etc.
  • the left and right tables respectively constitute the left operation tree and the right operation tree, and the cost estimate of each operation type on the operation tree. It can be estimated, therefore, the data set sizes of the left and right tables participating in the join (ie, RC left and RC right ) can be calculated from the estimation of the types of operations contained in the left and right tables and the existing calculation methods.
  • the number of records of the connection result set is determined based on the connection type of the connection field.
  • the following discussion discusses how to determine the number of records for a join result set based on the connection type of the join field.
  • connection type of the connection field of the left table and the right table is connected as a primary key (PK)-foreign key (FK) form
  • the record of the connection result set is determined according to the number of records of the foreign key field and the selection rate of the filtering condition of the primary key field. number.
  • the connection type of the connection field of the left table and the right table is a primary key - the foreign key form connection can mean that the connection type of the connection field of the left table is a primary key form connection, and the connection type of the connection field of the right table is a foreign key form. Connection; or the connection type of the connection field of the left table is a foreign key connection, and the connection type of the connection field of the right table is a primary key connection.
  • connection type of the connection field of the left table is a primary key form connection
  • connection type of the connection field of the right table is a foreign key form connection
  • connection type of the connection fields of the left table and the right table is a non-primary key-foreign key connection, it can be further divided into the following cases 1 - 4.
  • Case 1 When the connection type of the connection fields of the left table and the right table is an inner connection, the maximum number of records of the left table and the number of records of the right table are used as the number of records of the connection result set;
  • connection type of the connection field of the left table and the right table is a cross-multiply connection
  • the product of the number of records of the connected left table and the number of records of the right table is used as the number of records of the connection result set
  • connection type of the connection field of the left table is a left outer connection
  • connection type of the connection field of the right table is a right outer connection
  • the corresponding number of records of the left table or the record of the right table will be The number is the number of records that are connected to the result set.
  • connection type of the connection fields of the left table and the right table is full connection
  • the number of records of the connected left table and the number of records of the right table are counted as the number of records of the connection result set.
  • the connection type of the connection field is a full outer join
  • cost Op.getOriginalCost(RC left , RC right , RC result ).
  • the cost estimate of the aggregation operation is related to the size of the data volume participating in the aggregation and the aggregation rate of the aggregation field, and the aggregation rate R agg may be defined as:
  • the aggregation rate of the aggregated field needs to be calculated based on the number of unique values of the aggregated field.
  • the number of unique values of the aggregated field cannot be obtained, the following discussion is made.
  • the aggregation rate R agg of the aggregate field is a piecewise function as follows:
  • n is the number of aggregated fields in the set of aggregated fields.
  • the aggregation rate R agg of the aggregate field is:
  • n is the number of aggregated fields in the set of aggregated fields
  • k is a positive integer
  • the aggregation rate R agg of the aggregate field is:
  • n is the number of aggregated fields in the set of aggregated fields
  • k is a positive integer
  • the cost estimation of the operation tree may be performed by the method of cost estimation described in the foregoing embodiment of the present application, and the method for estimating the cost described in the present application may be applied.
  • the cost optimizer in the related art is improved based on the method of cost estimation described in the present application, and the cost of the execution plan can be optimized based on incomplete statistical information.
  • the cost estimation model in the related art can estimate the cost of the corresponding operation, and generate an optimal execution plan based on the estimation result.
  • the cost optimizer in the related art cannot make a cost estimation for the corresponding operation, and the cost estimation model described in the present application (ie, the cost estimation method for the operation operation of the operation tree is improved)
  • the operation performs a cost estimate and generates an optimal execution plan based on the estimation results.
  • FIG. 3 is a block diagram showing the structure of a device for cost estimation according to another aspect of the present application, the device comprising: a judging device 11, a determining device 12, and an estimating device 13.
  • the judging device 11 is configured to determine whether the obtained statistical information is complete. If the statistical information is incomplete, the corresponding cost estimation manner is determined according to the operation type of the operation tree depending on the statistical information.
  • the determining means 12 is arranged to determine a cost estimate for the corresponding type of operation based on the cost estimate.
  • the estimating means 13 is arranged to determine a cumulative cost estimate of the operating tree based on a cost estimate corresponding to the cost estimate corresponding to the statistical information operation type and the operation type not dependent on the statistical information.
  • the above device can solve the problem that the cost optimizer in the related art cannot estimate the cost for creating the temporary table and the subquery at the runtime, and the scenario of the massive data is limited by the data size.
  • the apparatus further comprises: an execution device configured to estimate an execution plan corresponding to the statement of the optimized structured query language based on the cumulative cost estimate.
  • the cost estimation method described in the present application is applied to the optimization of SQL in the database system, and the accuracy of the cost estimate cost estimation of the SQL is improved, thereby generating a better performance execution plan.
  • the cost of the execution plan can be optimized based on incomplete statistical information.
  • the cost estimation model in the related technology is used for estimation, and the optimal execution plan is generated based on the estimation result; when the statistical information is incomplete, the cost estimation of the corresponding operation is performed by the method of cost estimation described in the present application. And generate an optimal execution plan based on the estimation results.
  • the judging means 11 is arranged to determine the corresponding cost estimation mode according to the operation type of the operation tree depending on the statistical information when it is judged that the acquired statistical information is incomplete.
  • the operation types of the operation tree that depend on the statistical information include a scan operation, a filtering operation, a connection operation, and an aggregation operation.
  • an operation tree is composed of different operators, each of which represents an operation type, and an operation type of an operation tree may include operations such as scanning, filtering, connecting, projecting, aggregating, and selecting.
  • an operation tree diagram adjusts the operation type on the operation tree, estimates the cost of each operation type, and finally generates an operation tree with the smallest cumulative cost to generate an execution plan, and the cost of different operation types The dependence of statistical information is different.
  • the cost estimation of the sweep operation, the filtering operation, the connection operation, and the aggregation operation need to rely on statistical information, and the cost estimation of the projection operation and the selection operation does not depend on the statistical information. Therefore, it is necessary to determine the type of operation and use the corresponding cost estimation method to estimate the cost of each type of operation. When it is judged that the obtained statistical information is incomplete, the cost estimation method of the operation type depending on the statistical information is improved.
  • the determining means 12 is arranged to determine a cost estimate of the type of operation dependent on the statistical information, thereby determining a cost estimate for the type of operation.
  • the estimating means 13 is arranged to determine a cumulative cost estimate of the operating tree based on a cost estimate corresponding to the cost estimate corresponding to the statistical information operation type and the operation type not dependent on the statistical information. For example, the cost estimation of the scan operation, the filtering operation, the connection operation, and the aggregation operation depending on the statistical information is accumulated with the cost estimation of the projection operation and the selection operation not dependent on the statistical information to obtain the cost estimate of the operation tree, thereby generating the most Excellent implementation plan.
  • the determining device 12 is configured to determine a cost estimate of the scan table operation according to the number of records of the data set; determine a selection rate of the filter condition according to the type of the filter predicate, and determine a cost estimate of the filter operation according to the selection rate. And determining a cost estimate of the connection operation according to the determined number of records of the connection result set; determining an aggregation rate of the aggregation field according to the aggregation field and the aggregation function, and determining a cost estimate of the aggregation operation according to the aggregation rate.
  • the cost of the scan operation is related to the size of the data set (also referred to as the number of records in the data set).
  • the cost of the filtering operation is related to the selection rate of the filtering conditions, and the selection rate is calculated according to the filtering conditions.
  • the selection rate is determined differently.
  • the cost of the join operation is related to the size of the data set participating in the join, and also the size of the result set after the join. Therefore, first determine the number of records in the join result set, and then determine the cost estimate of the join operation based on the number of records in the join result set.
  • the cost of the aggregation operation is related to the amount of data involved in the aggregation and the aggregation rate of the aggregation field.
  • the aggregation rate of the aggregation field needs to be calculated according to the number of unique values of the aggregation field. Therefore, the number of heterogeneous values of the aggregation field needs to be calculated.
  • the number of records in the data set is recorded as RC (Row Count), and the size of the data set (denoted as A) is in a proportional relationship with the RC, that is, A ⁇ RC, so in the embodiment of the present application, You can use RC to represent the size of the data set, then the cost of the scan operation is estimated to be:
  • the formula indicates that the cost estimate of the scan table operation can be determined based on the record number RC of the data set, wherein the Op represents the target operation for which the cost is to be estimated, and the getOriginalCost represents the method using the correlation technique and the RC to estimate the cost.
  • the cost estimate of the filtering operation is related to the selection rate of the filtering predicate, and the estimation algorithms of the selection rates of the different types of filtering predicates are different.
  • the cost of the filtering operation is related to the selection rate of the filtering condition, and the selection rate of the filtering condition is calculated according to the filtering condition.
  • the identity predicate and the unequal predicate are a pair of complementary predicates, that is, the sum of the selection conditions of the filter conditions of the constant predicate and the unequal predicate should be 1, therefore
  • the filter condition is between two values.
  • v 1 , v 2 , . . . , v n are filter values
  • C is a set of filter values
  • n is the number of filter values in set C.
  • the selection rate of the filtering condition S 1/specified value.
  • the selection rate needs to be estimated according to the null value of the null predicate field.
  • defines the minimum value of the filter filter rate of And cascade filter predicate, and can adjust the value of ⁇ according to the actual situation; selectivity is the selection rate, the above formula represents the cascaded selection rate value of And cascade filter predicate and And cascade filter The maximum value of the minimum value of the predicate filter rate is used as the selection rate of the filter condition.
  • the selection rate of the filtering condition is determined according to the selection rate of the concatenated predicate corresponding to the or concatenated predicate.
  • the filter or predicate filter predicate is:
  • the selection rate of the filter condition is determined as follows:
  • the selection rate of the corresponding filter condition is discussed for different types of filter predicates, and the cost of the filtering operation can be calculated according to the determined selection rate:
  • the number of records of the connection result set is determined according to the number of different values of the connection fields of the connected left and right tables.
  • the left table of the join connection is T left
  • the number of records is RC left
  • the number of odd values is NDV left
  • the right table of the join join is T right
  • the number of records is RC right
  • the number of distinct values is NDV right
  • the number of records in the result set is recorded as RC result .
  • the left and right tables participating in the connection have the following operations: selection operation, filtering operation, scan table operation, etc.
  • the left and right tables respectively constitute the left operation tree and the right operation tree, and the cost estimates of each operation type on the operation tree are estimated. It can be estimated, therefore, the data set sizes of the left and right tables participating in the join (ie, RC left and RC right ) can be calculated from the estimation of the operation types of the left and right tables and the existing calculation methods.
  • the number of records of the connection result set is determined based on the connection type of the connection field.
  • the following discussion discusses how to determine the number of records for a join result set based on the connection type of the join field.
  • connection type of the connection field is a primary key-foreign key form connection
  • number of records of the connection result set is determined according to the number of records of the foreign key field and the selection rate of the filtering condition of the primary key field. If the connection type of the connection field of the left table and the connection type of the connection field of the right table are the primary key-foreign key form connection, respectively, the number of records of the connection result set RC result is:
  • connection type of the connection field is non-primary key-foreign key connection, it can be further divided into the following cases 1 - case 4.
  • connection result When the connection type of the connection field is an inner connection, the maximum number of records of the connected left table and the number of records of the right table is used as the number of records of the connection result set; here, in the implementation of the present application
  • connection type of the connection field is a cross-multiply connection
  • the product of the number of records of the connected left table and the number of records of the right table is used as the number of records of the connection result set
  • connection type of the connection field When the connection type of the connection field is left outer join or right outer join, the number of records of the corresponding connected left table or the number of records of the right table is used as the number of records of the connection result set.
  • connection type of the connection field When the connection type of the connection field is full connection, the number of records of the connected left table and the number of records of the right table are counted as the number of records of the connection result set.
  • the connection type of the connection field is a full outer join
  • cost Op.getOriginalCost(RC left , RC right , RC result ).
  • the cost estimate of the aggregation operation is related to the size of the data volume participating in the aggregation and the aggregation rate of the aggregation field, and the aggregation rate R agg may be defined as:
  • the aggregation rate of the aggregated field needs to be calculated based on the number of unique values of the aggregated field.
  • the number of unique values of the aggregated field cannot be obtained, the following discussion is made.
  • the aggregation rate R agg of the aggregate field is a piecewise function as follows:
  • n is the number of aggregated fields in the set of aggregated fields
  • the aggregation rate R agg of the aggregate field is:
  • n is the number of aggregated fields in the set of aggregated fields, and k is a positive integer
  • the aggregation rate R agg of the aggregate field is:
  • n is the number of aggregated fields in the set of aggregated fields
  • k is a positive integer
  • a cost optimizer is further provided, wherein the cost optimizer is configured to: generate an original execution plan; determine whether the obtained statistical information is complete, and if the statistical information is complete, according to the statistical information
  • the first cost estimation model ie, the cost estimation model in the related art
  • the second cost estimation model based on the first cost estimation model is optimized (ie, the cost estimation model in the present application) Estimating a cost; generating an optimal execution plan based on the cost estimated by the first cost estimation model or the cost estimated by the second cost estimation model.
  • the cost optimizer shown in FIG. 4 can optimize the execution plan based on incomplete statistical information.
  • the cost is estimated using the first cost estimation model, wherein the first cost estimation model is a cost estimation model in the related art, and the cost estimation model in the related technology requires statistical information when performing cost optimization, and therefore, Collect complete and accurate statistics.
  • an execution plan is generally represented by an operation tree, which is composed of different operation types, which may include scanning, selecting, filtering, aggregating, connecting, projecting, and the like.
  • the cost optimizer can use the cost estimation model to estimate the cost of each type of operation of the execution plan, and select the smallest overall cost from all possible execution plans as the final The execution plan, thus completing the entire cost optimization process.
  • the cost is estimated using the second cost estimation model optimized for the first cost estimation model, that is, the cost is estimated using the cost estimation model of the present application, and an optimal execution plan is generated based on the estimation result.
  • the cost estimation model of the present application is configured to determine a corresponding cost estimation manner according to an operation type (ie, a first operation type) of an operation tree that depends on the statistical information; and determine the corresponding operation type based on the cost estimation manner.
  • a cost estimate (ie, a first cost estimate of the first type of operation); a cost estimate corresponding to the cost estimate corresponding to the type of operation of the statistical information and an operation type not dependent on the statistical information (ie, the cost corresponding to the second type of operation) Estimating) determining a cumulative cost estimate for the operational tree.
  • the cost estimation of the operation tree may be performed by the method performed in the device of the cost estimation described in the above embodiment of the present application, and the cost estimate described in the present application.
  • Devices can be applied to temporary tables and subqueries created at runtime, as well as scenarios for massive amounts of data, which can be quickly estimated at a cost without being limited by the size of the data.
  • the cost optimizer obtained by the device based on the cost estimation described in the present application can optimize the execution plan based on incomplete statistical information.
  • the cost estimation model in the related art can estimate the cost of the corresponding operation, and generate an optimal execution plan based on the estimation result.
  • the cost optimizer in the related art cannot make a cost estimation for the corresponding operation, and the cost estimation model of the present application (ie, the cost estimation method for the related operation of the operation tree is improved) costs the corresponding operation. Estimate and generate an optimal execution plan based on the estimates.
  • a computing-based device comprising: a processor; and a memory arranged to store computer executable instructions.
  • the executable instruction when executed, causes the processor to: determine whether the acquired statistical information is complete, and if the statistical information is incomplete, according to an operation type (first operation type) of the operation tree depending on the statistical information. Determining a corresponding cost estimation manner; determining a cost estimate of the corresponding operation type (ie, a first cost estimate of the first operation type) based on the cost estimation manner; and estimating a cost corresponding to the operation type depending on the statistical information ( The first cost estimate of the first type of operation and the cost estimate corresponding to the type of operation not dependent on the statistical information (ie, the second cost estimate corresponding to the second type of operation) determine a cumulative cost estimate for the operational tree.
  • the present application can be implemented in software and/or a combination of software and hardware, for example, using an application specific integrated circuit (ASIC), a general purpose computer, or any other similar hardware device.
  • the software program of the present application can be executed by a processor to implement the steps or functions described above.
  • the software programs (including related data structures) of the present application can be stored in a computer readable recording medium such as a RAM memory, a magnetic or optical drive or a floppy disk and the like.
  • some of the steps or functions of the present application may be implemented in hardware, for example, as a circuit that cooperates with a processor to perform various steps or functions.
  • a portion of the present application can be applied as a computer program product, such as computer program instructions, which, when executed by a computer, can invoke or provide a method and/or technical solution in accordance with the present application.
  • the program instructions for invoking the method of the present application may be stored in a fixed or removable recording medium, and/or transmitted by a data stream in a broadcast or other signal bearing medium, and/or stored in a The working memory of the computer device in which the program instructions are run.
  • an embodiment in accordance with the present application includes a device including a memory for storing computer program instructions and a processor for executing program instructions, wherein when the computer program instructions are executed by the processor, triggering
  • the apparatus operates based on the aforementioned methods and/or technical solutions in accordance with various embodiments of the present application.

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Data Mining & Analysis (AREA)
  • General Physics & Mathematics (AREA)
  • Databases & Information Systems (AREA)
  • General Engineering & Computer Science (AREA)
  • Operations Research (AREA)
  • Computational Linguistics (AREA)
  • Mathematical Physics (AREA)
  • Mathematical Analysis (AREA)
  • Computational Mathematics (AREA)
  • Mathematical Optimization (AREA)
  • Pure & Applied Mathematics (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Evolutionary Biology (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Probability & Statistics with Applications (AREA)
  • Algebra (AREA)
  • Software Systems (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
  • Management, Administration, Business Operations System, And Electronic Commerce (AREA)

Abstract

一种代价优化器与代价估计的方法及其设备,当判断获取到的统计信息不完备时,则根据依赖于所述统计信息的操作树的操作类型确定对应的代价估计方式(S11);基于所述代价估计方式确定所述对应操作类型的代价估计(S12);根据依赖于所述统计信息操作类型对应的代价估计及未依赖于统计信息的操作类型对应的代价估计确定所述操作树的累积代价估计(S13)。从而对于运行时创建临时表和子查询可以进行代价估算,实现对海量数据的场景不受数据规模限制。

Description

代价优化器与代价估计的方法及其设备
本申请要求在2017年11月22日提交中国专利局、申请号为201711175349.9的中国专利申请的优先权,该申请的全部内容通过引用结合在本申请中。
技术领域
本申请涉及计算机领域,例如涉及代价优化器与代价估计的方法及其设备。
背景技术
代价优化器(CBO,Cost Based Optimizer)是数据库系统中的核心部件,由于对数据库系统的性能影响显著,因而在现代数据库系统中占据重要地位。代价优化器的核心是代价估计模型,用于对数据库系统生成的执行计划进行代价估计,从而选择最优的执行计划。估计模型的好坏影响最终执行计划的优劣。代价优化器基于目标数据的统计信息对执行计划进行优化,统计信息的完整性和精确性直接影响到估计模型对执行计划的估计。
进行代价优化需要统计信息,因此收集完整而精确的统计信息是所有代价优化器必不可少的环节。优化代价优化器是为了生成最优的执行计划。在数据库系统中,执行计划一般用操作树表示,操作树由不同类型的操作构成,一棵操作树包括的操作类型可以为扫表、选择、过滤、聚合、连接、投影等。基于统计信息,代价优化器可以借助代价估算模型,对执行计划的每一步操作进行代价估算,并从所有可能的执行计划中选取整体代价最小者作为最终的执行计划,由此就完成了整个代价优化的过程。从中可知,代价优化器的核心在于基于统计信息的代价估算,统计信息是代价优化器的基础。
代价优化器需要统计信息,在缺少必要统计信息的情况下无法完成代价优化。例如,在运行时创建的临时表或者存在子查询的场景,在编译阶段不能确定其统计信息,因此无法完成代价优化。另一方面,当代价优化器被应用到大数据系统中处理海量数据时,由于海量数据的统计信息收集代价巨大,收集该 海量数据的统计信息将成为应用代价优化器的瓶颈。对于给定的数据集,并非需要收集全量数据的统计信息,基于部分统计信息也可能得到最优计划。在统计信息不完整或者不可获取的情况下,相关技术中的代价优化器失效。
发明内容
本申请提供一种代价优化器与代价估计的方法及其设备,可以解决统计信息不完备时无法进行代价估计的问题。
根据本申请的一个方面,提供了一种代价估计的方法,该方法包括:判断获取到的统计信息是否完备,若统计信息不完备,确定第一操作类型的代价估计方式;基于所述代价估计方式确定所述第一操作类型的第一代价估计,该第一操作类型为依赖于所述统计信息的操作树的操作类型;以及根据第一代价估计及第二操作类型的第二代价估计确定所述操作树的累积代价估计,第二操作类型为不依赖于所述统计信息的操作树的操作类型。
根据本申请的另一方面,还提供了一种代价估计的设备,所述设备包括:判断装置、确定装置和估算装置。
判断装置设置为判断获取到的统计信息是否完备,若统计信息不完备,则根据依赖于所述统计信息的操作树的操作类型确定对应的代价估计方式。
确定装置设置为基于所述代价估计方式确定所述对应操作类型的代价估计。
估算装置设置为根据依赖于所述统计信息操作类型对应的代价估计及未依赖于统计信息的操作类型对应的代价估计确定所述操作树的累积代价估计。
根据本申请再一个方面,还提供了一种代价优化器,该代价优化器设置为:生成原始执行计划;判断获取到的统计信息是否完备,若统计信息完备,则根据基于统计信息的第一代价估算模型估算代价,若统计信息不完备,则根据第二代价估算模型估算代价;根据所述第一代价估算模型估算的代价或所述第二代价估算模型估算的代价生成最优执行计划。
根据本申请再一个方面,还提供了一种基于计算的设备,包括:处理器;以及被安排成存储计算机可执行指令的存储器,所述可执行指令在被执行时使 所述处理器:判断获取到的统计信息是否完备,若统计信息不完备,则根据依赖于所述统计信息的操作树的操作类型确定对应的代价估计方式;基于所述代价估计方式确定所述对应操作类型的代价估计;根据依赖于所述统计信息操作类型对应的代价估计及未依赖于统计信息的操作类型对应的代价估计确定所述操作树的累积代价估计。
根据本申请再一个方面,还提供了一种计算机可读存储介质,存储有计算机可执行指令,所述计算机可执行指令用于执行上述代价估计的方法。
本申请判断获取到的统计信息不完备时,则根据依赖于所述统计信息的操作树的操作类型确定对应的代价估计方式;基于所述代价估计方式确定所述对应操作类型的代价估计;根据依赖于所述统计信息操作类型对应的代价估计及未依赖于统计信息的操作类型对应的代价估计确定所述操作树的累积代价估计。从而解决传统代价优化器对于运行时创建临时表和子查询无法进行代价估算,对海量数据的场景受数据规模限制的问题。此外,根据所述累积代价估计优化结构化查询语言(Structured Query Language,SQL)的语句对应的执行计划,可应用于数据库系统中SQL的优化,提高SQL的代价优化器代价估计的准确性,从而生成性能更好的执行计划。
附图说明
图1示出根据本申请一个方面提供的一种代价估计的方法流程示意图。
图2示出本申请中的一实施例的一棵的操作树示意图。
图3示出根据本申请另一个方面提供的一种代价估计的设备结构示意图。
图4示出根据本申请再一个方面提供的一种改进的代价优化器示意图。
附图中相同或相似的附图标记代表相同或相似的部件。
具体实施方式
在本申请一个实施例中,终端、服务网络的设备和可信方均包括一个或多个处理器(例如,中央处理器(Central Processing Unit,CPU)、输入/输出接口、 网络接口和内存。
内存可能包括计算机可读存储介质中的非永久性存储器,随机存取存储器(Random-Access Memory,RAM)和/或非易失性内存等形式,如只读存储器(Read-Only Memory,ROM)或闪存(flash RAM)。内存是计算机可读存储介质的示例。
计算机可读存储介质包括永久性和非永久性、可移动和非可移动媒体可以由任何方法或技术来实现信息存储。信息可以是计算机可读指令、数据结构、程序的模块或其他数据。计算机存储介质的例子包括但不限于相变内存(Phase RAM,PRAM)、静态随机存取存储器(Static RAM,SRAM)、动态随机存取存储器(Dynamic RAM,DRAM)、其他类型的随机存取存储器(RAM)、只读存储器(ROM)、电可擦除可编程只读存储器(Electrically-Erasable Programmable ROM,EEPROM)、快闪记忆体或其他内存技术、只读型存储器(Compact Disc ROM,CD-ROM)、数字多功能光盘(Digital Versatile Disc,DVD)或其他光学存储、磁盒式磁带,磁带磁盘存储或其他磁性存储设备或任何其他非传输介质,可用于存储可以被计算设备访问的信息。按照本文中的界定,计算机可读存储介质不包括非暂存电脑可读媒体(transitory media),如调制的数据信号和载波。
图1示出根据本申请一个方面提供的一种代价估计的方法流程示意图,所述方法包括:S11~S13。
在S11中,判断获取到的统计信息是否完备,若获取到的统计信息不完备,则根据依赖于所述统计信息的操作树的操作类型确定对应的代价估计方式。
一棵操作树可以包括多种操作类型,如包括依赖于统计信息的操作类型和不依赖于统计信息的操作类型。在本申请中,将依赖于统计信息的操作树的操作类型称为第一操作类型,第一操作类型的代价估计称为第一代价估计;将不依赖于统计信息的操作树的操作类型称为第二操作类型,第二操作类型的代价估计称为第二代价估计。
在S12中,基于所述代价估计方式确定第一操作类型的第一代价估计。
在S13中,根据依赖于所述统计信息操作类型对应的代价估计(即第一操 作类型的第一代价估计)及不依赖于统计信息的操作类型对应的代价估计(即第二操作类型的第二代价估计)确定所述操作树的累积代价估计。
上述方法可以解决在运行时创建临时表和子查询无法进行代价估算的场景,或者对海量数据的场景受数据规模限制,传统代价优化器无法进行代价优化的问题。
在一个可选示例中,本申请所述方法包括:根据所述累积代价估计优化结构化查询语言语句对应的执行计划。本申请所述的代价估计的方法应用于数据库系统中SQL的优化,提高SQL的代价优化器代价估计的准确性,从而生成性能更好的执行计划。
在本申请一实施例中,可以基于不完备的统计信息对执行计划进行代价优化。当统计信息完备时,利用相关技术中的代价估算模型进行估算,基于估算结果生成最优执行计划;当统计信息不完备时,通过本申请所述的代价估计的方法,对相应操作进行代价估算以得到估算结果,并基于估算结果生成最优执行计划。
在S11中,判断获取到的统计信息是否完备,若统计信息不完备,则根据依赖于所述统计信息的操作树的操作类型(即第一操作类型)确定对应的代价估计方式。所述依赖于所述统计信息的操作树的操作类型可以包括:扫表操作、过滤操作、连接操作和聚合操作。在此,一棵操作树由不同的操作符构成,每一种操作符代表一种操作类型,该操作类型可以包括扫表、过滤、连接、投影、聚合、选择等。如图2所示的一棵操作树示意图,调整操作树上的操作类型,估算每一种操作类型的代价,最终生成一棵累积代价最小的操作树,生成执行计划。不同操作类型的代价对统计信息的依赖不同,比如,扫表操作、过滤操作、连接操作和聚合操作的代价需要依赖统计信息,即只有获得了扫表操作、过滤操作、连接操作和聚合操作对应的完整的统计信息,才可确定扫表操作、过滤操作、连接操作、和聚合操作的代价估计;而投影操作和选择操作的代价不依赖于统计信息,即无需获取投影操作和选择操作对应的完整的统计信息,也可以确定投影操作和选择操作的代价估计。因此需要确定操作树的操作类型, 采用与操作类型对应的代价估计方式确定每一种操作类型的代价估计。本申请判断获取到的统计信息是否完备,若统计信息不完备,则对依赖于统计信息的操作类型的代价估算方法进行改进。
在S12中,确定依赖于统计信息的操作类型的代价估计方式,从而根据代价估计方式确定该依赖于统计信息的操作类型的代价估计。
在S13中,根据依赖于所述统计信息操作类型的代价估计(即第一操作类型的第一代价估计)及不依赖于统计信息的操作类型对应的代价估计(第二操作类型的第二代价估计)确定所述操作树的累积代价估计。
比如,将依赖于统计信息的扫表操作、过滤操作、连接操作和聚合操作的代价估计与不依赖于统计信息的投影操作和选择操作的代价估计进行累积得到操作树的代价估计,进而生成最优执行计划。
在本申请一实施例中,在S12中,根据数据集的记录数确定扫表操作的代价估计;根据过滤谓词的类型确定过滤条件的选择率,根据所述选择率确定过滤操作的代价估计;根据确定的连接结果集的记录数确定连接操作的代价估计;根据聚合字段及聚合函数确定聚合字段的聚合率,根据所述聚合率确定聚合操作的代价估计。
扫表操作的代价与数据集的大小(又可称为数据集的记录数)相关,数据集越大,扫表代价越高,因此可以根据数据集的记录数确定扫表操作的代价估计。
过滤操作的代价与过滤条件的选择率相关,选择率又根据过滤条件计算得到,而过滤谓词类型不同时,过滤条件的选择率的确定方式也不同。
连接操作的代价与参与连接的数据集大小有关,也与连接后的连接结果集的大小有关,因此,先确定连接结果集的记录数,再根据连接结果集的记录数确定连接操作的代价估计。
聚合操作的代价与参与聚合的数据量大小和聚合字段的聚合率有关,聚合字段的聚合率需要根据聚合字段的异值数计算得到,因此,在确定聚合操作的代价时,需要计算聚合字段的异值数。
在本申请一实施例中,将数据集的记录数记为RC(Row Count),数据集的大小(记为A)与RC成正比例关系,即:A∝RC,因此在本申请实施例中,可以使用RC代表数据集的大小,则扫表操作的代价估计cost为:
cost=Op.getOriginalCost(RC);
该公式表示可以根据数据集的记录数RC确定扫表操作的代价估计,其中,所述Op表示想要估计代价的目标操作,getOriginalCost表示使用相关技术的方法和RC来估计代价。
在本申请一实施例中,在过滤操作中,过滤操作的代价估计与过滤谓词的选择率相关,不同类型的过滤谓词的选择率的估算算法不同。可以通过判断过滤谓词的类型对应的计算选择率的对象是否可获取,从而确定过滤条件的选择率。若过滤谓词的类型对应的计算选择率的对象不可获取,则过滤条件的选择率S=1/指定值。在此,过滤谓词的类型对应的计算选择率的对象包括异值数、谓词字段的极值和谓词字段的空值数,指定值根据具体的实际应用而定,为大于1的正数,在不同过滤谓词的类型中,该指定值可能相同也可能不同,例如,S=1/5,S=1/9等。
在一个可选示例中,过滤操作的代价与过滤条件的选择率有关,过滤条件的选择率又根据过滤条件计算得到,在本申请实施例中,定义过滤条件的选择率为S=结果集记录总数/参与过滤操作记录总数。当过滤谓词的类型对应的计算选择率的对象不可获取时,根据过滤谓词的类型确定过滤条件的选择率作以下讨论。
当过滤谓词的类型为恒等谓词时,判断所述过滤操作的异值数是否可获取,若过滤操作的异值数不可获取,则过滤条件的选择率S=1/指定值;在此,恒等谓词(Equal、=)时,若异值数(NDV)可获取,则过滤条件的选择率S=1/NDV,当异值数不可获取时,S (恒等谓词)=1/10,其中,S (恒等谓词)=1/10是根据经验值确定,也可以为其他指定值。
当过滤谓词是不等谓词(Non-Equal、!=)时,恒等谓词与不等谓词是一对互补谓词,即恒定谓词和不等谓词的过滤条件的选择率之和为1,因此,不等谓 词的选择率为S (不等谓词)=1-S (恒等谓词)
当过滤谓词的类型为范围谓词时,其中,范围谓词包括一边区间范围(形如>、<、>=、<=),两值区间范围(形如between),是否在集合内(形如In)。若范围谓词的字段的极值或异值数未能获取,则过滤条件的选择率S=1/指定值。范围谓词形如>、<、>=、<=时,过滤条件的选择率根据范围谓词字段的极值确定,当极值不可获取时,定义过滤条件的选择率S=1/3。范围谓词形如between时,即过滤条件在两值之间,当极值不可获取时,定义过滤条件的选择率S=1/9。当范围谓词形如In时,C={v 1,v 2,...,v n},即过滤条件col IN为:
col IN(v 1,v 2,...,v n),n=|C|
其中,v 1,v 2,...,v n为过滤值,C为过滤值的集合,n为集合C中过滤值的个数,过滤时判断待过滤的数据是否属于该集合C内的值。当异值数不可获取时,定义过滤条件的选择率S=1/5。
当过滤谓词的类型为判空谓词时,判断判空谓词字段的空值数是否可获取,若判空谓词字段的空值数不可获取,则确定所述过滤条件的选择率S=1/指定值。当过滤谓词的类型为判空谓词时,选择率需要根据判空谓词字段的空值数估算,当空值数不可获取时,定义过滤条件的选择率S=1/10。当过滤谓词的类型为非空谓词时,由于非空谓词与判空谓词为互补谓词,则过滤条件的选择率S 非空谓词=1-S 判空谓词
过滤谓词的类型还包括以下几种情况:当过滤谓词的类型为like时,所述过滤条件的选择率为S=1/指定值。当过滤谓词的类型为like时,定义选择率S=1/5。当过滤谓词的类型为和级联谓词时,根据级联谓词的选择率的乘积与和级联谓词中过滤率最小值确定过滤条件的选择率。
And级联的过滤谓词,记AND级联的过滤谓词为,为防止估算误差被级联放大,在本申请实施例中,通过以下方式估算过滤条件的选择率:
Figure PCTCN2018116874-appb-000001
其中,α定义了And级联过滤谓词过滤率的最小值,可以根据实际情况调 整α的值;selectivity为选择率,上述公式表示And级联过滤谓词的级联的选择率值与And级联过滤谓词过滤率的最小值中的最大值作为该过滤条件的选择率。
当过滤谓词的类型为或级联谓词时,根据所述或级联谓词对应的级联谓词的选择率确定过滤条件的选择率。在此,记或级联的过滤谓词为:
P=P 1OR P 2OR...P n
则过滤条件的选择率按照如下方式确定:
Figure PCTCN2018116874-appb-000002
对不同类型的过滤谓词讨论其对应的过滤条件的选择率,根据确定的选择率可计算过滤操作的代价:
cost=Op.getOriginalCost(selectivity)
需要说明的是,本领域技术人员应能理解,上述实施例中出现的S=1/指定值中的数值仅为举例,且在上述实施例中每种过滤谓词对应的过滤条件选择率可以根据实际情况做调整,也可以通过参数传递等方式动态改变。
在本申请一实施例中,根据确定的连接结果集的记录数确定连接操作的代价估计之前,根据相连接的左表和右表的连接字段的异值数确定连接结果集的记录数。在此,参与连接的左表记为T left,其记录数为RC left,异值数为NDV left;参与连接的右表记为T right,其记录数为RC right,异值数为NDV right;结果集的记录数记为RC result。参与连接的左右表均有以下操作构成:选择操作、过滤操作、扫表操作等,左表和右表分别构成了左边操作树和右边操作树,操作树上的每一种操作类型的代价估计均可以估算得到,因此,参与连接的左表和右表的的数据集大小(即RC left和RC right)可由对左表和右表包含的操作类型的估算和已有的计算方法计算得到。
在本申请一实施例中,当相连接的左表和右表的连接字段的异值数不可获取时,基于连接字段的连接类型确定连接结果集的记录数。下文讨论如何基于连接字段的连接类型确定连接结果集的记录数。
当左表和右表的连接字段的连接类型为主键(PK)-外键(FK)形式连接时,则根据外键字段的记录数与主键字段的过滤条件的选择率确定连接结果集的记录 数。此处的左表和右表的连接字段的连接类型为主键-外键形式连接可以是指:左表的连接字段的连接类型为主键形式连接,右表的连接字段的连接类型为外键形式连接;或者左表的连接字段的连接类型为外键形式连接,右表的连接字段的连接类型为主键形式连接。
如果左表的连接字段的连接类型是主键形式连接,右表的连接字段的连接类型是外键形式连接,则连接结果集的记录数RC result为:
RC result=RC fk×selectivity(PK)
其中,RC fk为RC left和RC right之一,表示外键字段记录数;selectivity(PK)表示主键字段的的过滤条件的选择率。如果主键字段存在过滤条件,则最终连接的结果也会受该过滤条件的影响,因此,在计算连接结果集的记录数时,外键字段记录数要乘以主键字段过滤条件的选择率。当主键字段不存在过滤条件时,则主键字段的过滤条件的选择率selectivity(PK)=1。
当左表和右表的连接字段的连接类型为非主键-外键连接时,则又可分为以下情况1-情况4。
情况1:当左表和右表的连接字段的连接类型为内连接时,则将相连接的左表的记录数及右表的记录数中的最大值作为连接结果集的记录数;在此,在本申请一实施例中,将连接结果的记录数估算为左表连接记录数和右表连接记录数的最大值,即RC result=max(RC left,RC right)。即结果记录数为RC left和RC right中的最大值。
情况2:当左表和右表的连接字段的连接类型为叉乘连接时,则将相连接的左表的记录数及右表的记录数的乘积作为连接结果集的记录数;对于叉乘连接,其连接结果集的记录数为连接记录数之乘积,即:RC result=RC left×RC right
情况3:当左表的连接字段的连接类型为左外连接,或当右表的连接字段的连接类型为右外连接时,则将对应的相连接的左表的记录数或右表的记录数作为连接结果集的记录数。当连接字段的连接类型为左外连接时,连接结果集的记录数为T left的记录数R left,即RC result=RC left。当连接字段的连接类型为右外连接时,连接结果集的记录数为T right的记录数R right,即RC result=RC right
情况4:当左表和右表的连接字段的连接类型为全连接时,则将相连接的左表的记录数及右表的记录数的累加作为连接结果集的记录数。在此,当连接字段的连接类型为全外连接时,连接结果集的记录数为T left和T right的记录数之和,即:RC result=RC left+RC right
最终,连接操作的代价估计为:cost=Op.getOriginalCost(RC left,RC right,RC result)。
在本申请一实施例中,对于聚合操作过程,聚合操作的代价估计与参与聚合的数据量大小和聚合字段的聚合率有关,可以定义聚合率R agg为:
Figure PCTCN2018116874-appb-000003
聚合字段的聚合率需要根据聚合字段的异值数计算得到,当聚合字段的异值数无法获取时,进行如下讨论。
记聚合字段的集合C为C={c 1,c 2,...,c n},n=|C|,其中,c 1、c 2和c n为聚合字段,n为聚合字段的集合中聚合字段的个数。
当聚合字段的集合C中至少存在一个主键时,聚合字段的聚合率R agg为1,即R agg=1。
当聚合字段对应的聚合函数为简单聚合函数Group By时,聚合字段的聚合率R agg为如下的分段函数:
Figure PCTCN2018116874-appb-000004
其中,n为聚合字段的集合中聚合字段的个数。
当聚合字段对应的聚合函数为Rollup时,聚合字段的聚合率R agg为:
Figure PCTCN2018116874-appb-000005
其中,n为聚合字段的集合中聚合字段的个数,k为正整数。
当聚合字段对应的聚合函数为Cube时,聚合字段的聚合率R agg为:
Figure PCTCN2018116874-appb-000006
其中,n为聚合字段的集合中聚合字段的个数,k为正整数。
综上所述,在统计信息不准确或不完整的情况下,可以通过本申请上述实 施例中所述的代价估计的方法进行操作树的代价估算,本申请所述的代价估计的方法可以应用于运行时创建的临时表和子查询,以及海量数据的场景。可以快速地进行代价估计而不受数据规模的限制。另一方面,基于本申请所述的代价估计的方法对相关技术中的代价优化器进行改进,可以基于不完备的统计信息对执行计划进行代价优化。当统计信息完备时,相关技术中的代价估算模型可以估算相应操作的代价,基于估算结果生成最优执行计划。当统计信息不完备时,相关技术中的代价优化器无法对相应操作进行代价估算,则可以利用本申请所述的代价估算模型(即对操作树的相关操作的代价估计方法进行改进)对相应操作进行代价估算,并基于估算结果生成最优执行计划。
图3示出根据本申请另一个方面提供的一种代价估计的设备结构示意图,所述设备包括:判断装置11、确定装置12和估算装置13。
判断装置11设置为判断获取到的统计信息是否完备,若统计信息不完备,则根据依赖于所述统计信息的操作树的操作类型确定对应的代价估计方式。
确定装置12设置为基于所述代价估计方式确定所述对应操作类型的代价估计。
估算装置13设置为根据依赖于所述统计信息操作类型对应的代价估计及未依赖于统计信息的操作类型对应的代价估计确定所述操作树的累积代价估计。
利用上述装置可以解决相关技术中的代价优化器对于运行时创建临时表和子查询无法进行代价估算,对海量数据的场景受数据规模限制的问题。
在一个可选示例中,所述设备还包括:执行装置,该执行装置设置为根据所述累积代价估计优化结构化查询语言的语句对应的执行计划。本申请所述的代价估计的方法应用于数据库系统中SQL的优化,提高SQL的代价优化器代价估计的准确性,从而生成性能更好的执行计划。
在本申请一实施例中,利用本申请代价优化器,可以基于不完备的统计信息对执行计划进行代价优化。当统计信息完备时,利用相关技术中的代价估算模型进行估算,基于估算结果生成最优执行计划;当统计信息不完备时,通过 本申请所述的代价估计的方法,对相应操作进行代价估算,并基于估算结果生成最优执行计划。
判断装置11设置为在判断获取到的统计信息不完备时,则根据依赖于所述统计信息的操作树的操作类型确定对应的代价估计方式。所述依赖于所述统计信息的操作树的操作类型包括扫表操作、过滤操作、连接操作和聚合操作。在此,一棵操作树由不同的操作符构成,每一种操作符代表一种操作类型,一棵操作树的操作类型可以包括扫表、过滤、连接、投影、聚合、选择等操作。如图2所示的一棵操作树示意图,调整操作树上操作类型,估算每一种操作类型的代价,最终生成一棵累积代价最小的操作树,生成执行计划,而不同操作类型的代价对统计信息的依赖不同。比如,扫表操作、过滤操作、连接操作、聚合操作的代价估计需要依赖统计信息,而投影操作、选择操作的代价估计不依赖于统计信息。因此需要确定操作类型,采用对应的代价估计方式,进行每一种操作类型的代价估计。当判断获取到的统计信息不完备时,则对依赖于统计信息的操作类型的代价估算方法进行改进。
确定装置12设置为确定依赖于统计信息的操作类型的代价估计方式,进而确定操作类型的代价估计。
估算装置13设置为根据依赖于所述统计信息操作类型对应的代价估计及未依赖于统计信息的操作类型对应的代价估计确定所述操作树的累积代价估计。比如,将依赖于统计信息的扫表操作、过滤操作、连接操作和聚合操作的代价估计与不依赖于统计信息的投影操作和选择操作的代价估计进行累积得到操作树的代价估计,进而生成最优执行计划。
在本申请一实施例中,确定装置12设置为根据数据集的记录数确定扫表操作的代价估计;根据过滤谓词的类型确定过滤条件的选择率,根据所述选择率确定过滤操作的代价估计;根据确定的连接结果集的记录数确定连接操作的代价估计;根据聚合字段及聚合函数确定聚合字段的聚合率,根据所述聚合率确定聚合操作的代价估计。
在此,扫表操作的代价与数据集的大小(又可称为数据集的记录数)相关, 数据集越大,其扫表代价越高,因此可以根据数据集的记录数确定扫表操作的代价估计。
过滤操作的代价与过滤条件的选择率相关,选择率又根据过滤条件计算得到,而过滤谓词类型不同时,选择率确定的方式也不同。
连接操作的代价与参与连接的数据集大小有关,也与连接后结果集的大小有关,因此,先确定连接结果集的记录数,再根据连接结果集的记录数确定连接操作的代价估计。
聚合操作的代价与参与聚合的数据量大小和聚合字段的聚合率有关,聚合字段的聚合率需要根据聚合字段的异值数计算得到,因此,需要计算聚合字段的异值数。
在本申请一实施例中,将数据集的记录数记为RC(Row Count),数据集的大小(记为A)与RC成正比例关系,即:A∝RC,因此在本申请实施例中,可以使用RC代表数据集的大小,则扫表操作的代价估计cost为:
cost=Op.getOriginalCost(RC);
该公式表示可以根据数据集的记录数RC确定扫表操作的代价估计,其中,所述Op表示想要估计代价的目标操作,getOriginalCost表示使用相关技术的方法和RC来估计代价。
在本申请一实施例中,在过滤操作中,过滤操作的代价估计与过滤谓词的选择率相关,不同类型的过滤谓词的选择率的估算算法不同。可以通过判断过滤谓词的类型对应的计算选择率的对象是否可获取,从而确定过滤条件的选择率。若过滤谓词的类型对应的计算选择率的对象不可获取,则过滤条件的选择率S=1/指定值。在此,过滤谓词的类型对应的计算选择率的对象包括异值数、谓词字段的极值和谓词字段的空值数,指定值根据具体的实际应用而定,为大于1的正数,在不同过滤谓词的类型中,该指定值可能相同也可能不同,例如,S=1/5,S=1/9等。
在一个可选示例中,过滤操作的代价与过滤条件的选择率有关,过滤条件的选择率又根据过滤条件计算得到,在本申请实施例中,定义过滤条件的选择 率为S=结果集记录总数/参与过滤操作记录总数。当过滤谓词的类型对应的计算选择率的对象不可获取时,根据过滤谓词的类型确定过滤条件的选择率作以下讨论:
当过滤谓词的类型为恒等谓词时,判断所述过滤操作的异值数是否可获取,若过滤操作的异值数不可获取,则过滤条件的选择率S=1/指定值;在此,恒等谓词(Equal、=)时,若异值数(NDV)可获取,则过滤条件的选择率S=1/NDV,当异值数不可获取时,S (恒等谓词)=1/10,其中,S (恒等谓词)=1/10是根据经验值确定,也可以为其他指定值。
当过滤谓词是不等谓词(Non-Equal、!=)时,恒等谓词与不等谓词是一对互补谓词,即恒定谓词和不等谓词的过滤条件的选择率之和应该为1,因此,不等谓词的选择率为S (不等谓词)=1-S (恒等谓词)
当过滤谓词的类型为范围谓词时,其中,范围谓词包括一边区间范围(形如>、<、>=、<=),两值区间范围(形如between),是否在集合内(形如In)。若范围谓词的字段的极值或异值数未能获取,则过滤条件的选择率S=1/指定值;范围谓词形如>、<、>=、<=时,过滤条件的选择率根据范围谓词字段的极值确定,当极值不可获取时,定义过滤条件的选择率S=1/3。范围谓词形如between时,即过滤条件在两值之间,当极值不可获取时,定义过滤条件的选择率S=1/9。当范围谓词形如In时,C={v 1,v 2,...,v n},即过滤条件col IN为:
col IN(v 1,v 2,...,v n),n=|C|
其中,v 1,v 2,...,v n为过滤值,C为过滤值的集合,n为集合C中过滤值的个数,过滤时判断待过滤的数据是否属于该集合C内的值。当异值数不可获取时,定义过滤条件的选择率S=1/5。
当过滤谓词的类型为判空谓词时,判断判空谓词字段的空值数是否可获取,若判空谓词字段的空值数不可获取,则确定所述过滤条件的选择率S=1/指定值。当过滤谓词的类型为判空谓词时,选择率需要根据判空谓词字段的空值数估算,当空值数不可获取时,定义过滤条件的选择率S=1/10。当过滤谓词的类型为非空谓词时,由于非空谓词与判空谓词为互补谓词,则过滤条件的选择率S 非空谓词 =1-S 判空谓词
过滤谓词的类型还包括以下几种情况:当过滤谓词的类型为like时,所述过滤条件的选择率S=1/指定值。当过滤谓词的类型为like时,定义选择率=1/5。当过滤谓词的类型为和级联谓词时,根据级联谓词的选择率的乘积与和级联谓词中过滤率最小值确定过滤条件的选择率。
And级联的过滤谓词,记AND级联的过滤谓词为P=P 1AND P 2AND...P n,为防止估算误差被级联放大,在本申请实施例中,通过以下方式估算过滤条件的选择率:
Figure PCTCN2018116874-appb-000007
其中,α定义了And级联过滤谓词过滤率的最小值,可以根据实际情况调整α的值;selectivity为选择率,上述公式表示And级联过滤谓词的级联的选择率值与And级联过滤谓词过滤率的最小值中的最大值作为该过滤条件的选择率。
当过滤谓词的类型为或级联谓词时,根据所述或级联谓词对应的级联谓词的选择率确定过滤条件的选择率。在此,记或级联的过滤谓词为:
P=P 1OR P 2OR...P n
则过滤条件的选择率按照如下方式确定:
Figure PCTCN2018116874-appb-000008
对于不同类型的滤谓词讨论其对应的过滤条件的选择率,根据确定的选择率可计算过滤操作的代价:
cost=Op.getOriginalCost(selectivity)
需要说明的是,本领域技术人员应能理解,上述实施例中出现的S=1/指定值中的数值仅为举例,且在上述实施例中每种过滤谓词的过滤条件选择率可以根据实际情况做调整,也可以通过参数传递等方式动态改变。
在本申请一实施例中,根据确定的连接结果集的记录数确定连接操作的代价估计之前,根据相连接的左表和右表的连接字段的异值数确定连接结果集的记录数。在此,参与连接的左表记为T left,其记录数为RC left,异值数为NDV left; 参与连接的右表记为T right,其记录数为RC right,异值数为NDV right;结果集的记录数记为RC result。参与连接的左右表均有以下操作:选择操作、过滤操作、扫表操作等,左表和右表分别构成了左边操作树和右边操作树,操作树上的每一种操作类型的代价估计均可以估算得到,因此,参与连接的左表和右表的数据集大小(即RC left和RC right)可由对左表和右表的操作类型的估算和已有的计算方法计算得到。
在本申请一实施例中,当相连接的左表和右表的连接字段的异值数不可获取时,基于连接字段的连接类型确定连接结果集的记录数。下文讨论如何基于连接字段的连接类型确定连接结果集的记录数。
当连接字段的连接类型为主键-外键形式连接时,则根据外键字段的记录数与主键字段的过滤条件的选择率确定连接结果集的记录数。如果左表的连接字段的连接类型和右表的连接字段的连接类型分别是主键-外键形式连接,则连接结果集的记录数RC result为:
RC result=RC fk×selectivity(PK)
其中,RC fk为RC left和RC right之一,表示外键字段记录数;selectivity(PK)表示主键字段的过滤条件的选择率。如果主键字段存在过滤条件,则最终连接的结果也会受该过滤条件的影响,因此,在计算连接结果集的记录数时,外键字段记录数要乘以主键字段过滤条件的选择率,当主键字段不存在过滤条件时,则主键字段的过滤条件的选择率selectivity(PK)=1。
当连接字段的连接类型为非主键-外键连接时,则又可分为以下情况1-情况4。
情况1:当连接字段的连接类型为内连接时,则将相连接的左表的记录数及右表的记录数中的最大值作为连接结果集的记录数;在此,在本申请一实施例中,将连接结果的记录数估算为左表连接记录数和右表连接记录数的最大值,即RC result=max(RC left,RC right)。即结果记录数为RC left和RC right中的最大值。
情况2:当连接字段的连接类型为叉乘连接时,则将相连接的左表的记录数及右表的记录数的乘积作为连接结果集的记录数;对于叉乘连接,其连接结果 的记录数为连接记录数之乘积,即:RC result=RC left×RC right
情况3:当连接字段的连接类型为左外连接或右外连接时,则将对应的相连接的左表的记录数或右表的记录数作为连接结果集的记录数。当是连接字段的连接类型为左外连接时,连接结果集的记录数为T left的记录数R left,即RC result=RC left。当连接字段的连接类型是右外连接时,连接结果集的记录数为T right的记录数R right,即RC result=RC right
情况4:当连接字段的连接类型为全连接时,则将相连接的左表的记录数及右表的记录数的累加作为连接结果集的记录数。在此,当连接字段的连接类型为全外连接时,连接结果集的记录数为T left和T right的记录数之和,即:RC result=RC left+RC right
最终,连接操作的代价估计为:cost=Op.getOriginalCost(RC left,RC right,RC result)。
在本申请一实施例中,对于聚合操作过程,聚合操作的代价估计与参与聚合的数据量大小和聚合字段的聚合率有关,可以定义聚合率R agg为:
Figure PCTCN2018116874-appb-000009
聚合字段的聚合率需要根据聚合字段的异值数计算得到,当聚合字段的异值数无法获取时,进行如下讨论。
记聚合字段的集合C为C={c 1,c 2,...,c n},n=|C|,其中,c 1、c 2和c n为聚合字段,n为聚合字段的集合中聚合字段的个数。
当聚合字段的集合C中至少存在一个主键时,聚合字段的聚合率R agg为1,即R agg=1。
当聚合字段对应的聚合函数为简单聚合函数Group By时,聚合字段的聚合率R agg为如下的分段函数:
Figure PCTCN2018116874-appb-000010
其中,n为聚合字段的集合中聚合字段的个数;
当聚合字段对应的聚合函数为Rollup时,聚合字段的聚合率R agg为:
Figure PCTCN2018116874-appb-000011
其中,n为聚合字段的集合中聚合字段的个数,k为正整数;
当聚合字段对应的聚合函数为Cube时,聚合字段的聚合率R agg为:
Figure PCTCN2018116874-appb-000012
其中,n为聚合字段的集合中聚合字段的个数,k为正整数。
根据本申请再一个方面,还提供了一种代价优化器,其中,所述代价优化器设置为:生成原始执行计划;判断获取到的统计信息是否完备,若统计信息完备,则根据基于统计信息的第一代价估算模型(即相关技术中的代价估算模型)估算代价,若统计信息不完备,则根据基于第一代价估算模型进行优化的第二代价估算模型(即本申请中的代价估算模型)估算代价;根据所述第一代价估算模型估算的代价或所述第二代价估算模型估算的代价生成最优执行计划。
在本申请一实施例中,如图4所示的代价优化器,可以基于不完备的统计信息对执行计划进行代价优化。当存在完备统计信息时,使用第一代价估算模型估算代价,其中,第一代价估算模型为相关技术中的代价估算模型,相关技术中的代价估算模型进行代价优化时需要统计信息,因此,需要收集完整而精确的统计信息。在数据库系统中,执行计划一般用操作树表示,操作树由不同的操作类型构成,该操作可以包括扫表、选择、过滤、聚合、连接、投影等。当存在完整的统计信息时,基于该统计信息,代价优化器可以借助代价估算模型,对执行计划的每一种操作类型进行代价估算,并从所有可能的执行计划中选取整体代价最小者作为最终的执行计划,由此就完成了整个代价优化的过程。而当不存在完备统计信息时,使用对第一代价估算模型进行优化的第二代价估算模型估算代价,即使用本申请的代价估算模型估算代价,并基于估算结果生成最优执行计划。其中,本申请的代价估算模型设置为根据依赖于所述统计信息的操作树的操作类型(即第一操作类型)确定对应的代价估计方式;基于所述代价估计方式确定所述对应操作类型的代价估计(即第一操作类型的第一代 价估计);根据依赖于所述统计信息操作类型对应的代价估计及未依赖于统计信息的操作类型对应的代价估计(即第二操作类型对应的代价估计)确定所述操作树的累积代价估计。
综上所述,在统计信息不准确或不完整的情况下,可以通过本申请上述实施例中所述的代价估计的设备中执行的方法进行操作树的代价估算,本申请所述的代价估计的设备可以应用于运行时创建的临时表和子查询,以及对于海量数据的场景,可以快速地代价估计而不受数据规模的限制。另一方面,基于本申请所述的代价估计的设备得到的代价优化器,可以基于不完备的统计信息对执行计划进行代价优化。当统计信息完备时,相关技术中的代价估算模型可以估算相应操作的代价,基于估算结果生成最优执行计划。当统计信息不完备时,相关技术中的代价优化器无法对相应操作进行代价估算,则利用本申请的代价估算模型(即对操作树的相关操作的代价估计方法进行改进)对相应操作进行代价估算,并基于估算结果生成最优执行计划。
在本申请一实施例中,还提供了一种基于计算的设备,包括:处理器;以及被安排成存储计算机可执行指令的存储器。
所述可执行指令在被执行时使所述处理器:判断获取到的统计信息是否完备,若统计信息不完备,则根据依赖于所述统计信息的操作树的操作类型(第一操作类型)确定对应的代价估计方式;基于所述代价估计方式确定所述对应操作类型的代价估计(即第一操作类型的第一代价估计);以及根据依赖于所述统计信息操作类型对应的代价估计(第一操作类型的第一代价估计)及未依赖于统计信息的操作类型对应的代价估计(即第二操作类型对应的第二代价估计)确定所述操作树的累积代价估计。
需要注意的是,本申请可在软件和/或软件与硬件的组合体中被实施,例如,可采用专用集成电路(ASIC)、通用目的计算机或任何其他类似硬件设备来实现。在一个实施例中,本申请的软件程序可以通过处理器执行以实现上文所述步骤或功能。同样地,本申请的软件程序(包括相关的数据结构)可以被存储到计 算机可读记录介质中,例如,RAM存储器,磁或光驱动器或软磁盘及类似设备。另外,本申请的一些步骤或功能可采用硬件来实现,例如,作为与处理器配合从而执行各个步骤或功能的电路。
另外,本申请的一部分可被应用为计算机程序产品,例如计算机程序指令,当其被计算机执行时,通过该计算机的操作,可以调用或提供根据本申请的方法和/或技术方案。而调用本申请的方法的程序指令,可能被存储在固定的或可移动的记录介质中,和/或通过广播或其他信号承载媒体中的数据流而被传输,和/或被存储在根据所述程序指令运行的计算机设备的工作存储器中。在此,根据本申请的一个实施例包括一个装置,该装置包括用于存储计算机程序指令的存储器和用于执行程序指令的处理器,其中,当该计算机程序指令被该处理器执行时,触发该装置运行基于前述根据本申请的多个实施例的方法和/或技术方案。
对于本领域技术人员而言,本申请不限于上述示范性实施例的细节。因此,应将实施例看作是示范性的,而且是非限制性的。不应将权利要求中的任何附图标记视为限制所涉及的权利要求。此外,“包括”一词不排除其他单元或步骤,单数不排除复数。装置权利要求中陈述的多个单元或装置也可以由一个单元或装置通过软件或者硬件来实现。第一,第二等词语用来表示名称,而并不表示任何特定的顺序。

Claims (24)

  1. 一种代价估计的方法,包括:
    判断获取到的统计信息是否完备,若所述统计信息不完备,确定第一操作类型的代价估计方式,所述第一操作类型为依赖于所述统计信息的操作树的操作类型;
    基于所述代价估计方式确定所述第一操作类型的第一代价估计;以及
    根据所述第一代价估计及第二操作类型的第二代价估计确定所述操作树的累积代价估计,所述第二操作类型为不依赖于所述统计信息的操作树的操作类型。
  2. 根据权利要求1所述的方法,其中,所述第一操作类型包括:扫表操作、过滤操作、连接操作和聚合操作。
  3. 根据权利要求1所述的方法,还包括:
    根据所述累积代价估计优化结构化查询语言(SQL)的语句对应的执行计划。
  4. 根据权利要求2所述的方法,其中,基于所述代价估计方式确定所述第一操作类型的第一代价估计,包括:
    根据数据集的记录数确定扫表操作的代价估计;
    根据过滤谓词的类型确定过滤条件的选择率,根据所述选择率确定过滤操作的代价估计;
    根据确定的连接结果集的记录数确定连接操作的代价估计;以及
    根据聚合字段及聚合函数确定聚合字段的聚合率,根据所述聚合率确定聚合操作的代价估计。
  5. 根据权利要求4所述的方法,其中,根据过滤谓词的类型确定过滤条件的选择率,包括:
    判断过滤谓词的类型对应的计算选择率的对象是否可获取,若所述计算选择率的对象不可获取,则所述过滤条件的选择率S=1/指定值。
  6. 根据权利要求5所述的方法,其中,判断过滤谓词的类型对应的计算选择率的对象是否可获取,若所述计算选择率的对象不可获取,则所述过滤条件的选择率,则所述过滤条件的选择率S=1/指定值,包括:
    当过滤谓词的类型为恒等谓词时,判断所述过滤操作的异值数是否可获取,若所述过滤操作的异值数不可获取,则所述过滤条件的选择率S=1/指定值;
    当过滤谓词的类型为范围谓词时,若谓词的字段的极值或异值数未能获取,则所述过滤条件的选择率S=1/指定值;以及
    当过滤谓词的类型为判空谓词时,判断谓词字段的空值数是否可获取,若所述谓词字段的空值数不可获取,则所述过滤条件的选择率S=1/指定值。
  7. 根据权利要求4所述的方法,其中,根据过滤谓词的类型确定过滤条件的选择率,包括:
    当过滤谓词的类型为like时,所述过滤条件的选择率S=1/指定值;
    当过滤谓词的类型为和级联谓词时,根据级联谓词的选择率的乘积与和级联谓词中过滤率最小值确定所述过滤条件的选择率;以及
    当过滤谓词的类型为或级联谓词时,根据所述或级联谓词对应的级联谓词的选择率确定所述过滤条件的选择率。
  8. 根据权利要求4所述的方法,根据确定的连接结果集的记录数确定连接操作的代价估计之前,包括:
    当相连接的左表和右表的连接字段的异值数不可获取时,基于连接字段的连接类型确定连接结果集的记录数。
  9. 根据权利要求8所述的方法,其中,基于连接字段的连接类型确定连接结果集的记录数,包括:
    当连接字段的连接类型为主键-外键形式连接时,则根据外键字段的记录数与主键字段的过滤条件的选择率确定连接结果集的记录数;
    当连接字段的连接类型为内连接时,则将相连接的左表的记录数及右表的记录数中的最大值作为连接结果集的记录数;
    当连接字段的连接类型为叉乘连接时,则将相连接的左表的记录数及右表的记录数的乘积作为连接结果集的记录数;
    当连接字段的连接类型为左外连接时,将相连接的左表的记录数作为连接结果集的记录数,当连接字段的连接类型为右外连接时,将相连接的右表的记 录数作为连接结果集的记录数;以及
    当连接字段的连接类型为全连接时,将相连接的左表的记录数及右表的记录数的累加作为连接结果集的记录数。
  10. 根据权利要求4所述的方法,其中,根据聚合字段及聚合函数确定聚合字段的聚合率,包括:
    当聚合字段的集合中至少存在一个主键时,聚合字段的聚合率R agg=1;
    当聚合字段对应的聚合函数为简单聚合函数时,聚合字段的聚合率R agg为:
    Figure PCTCN2018116874-appb-100001
    其中,n为聚合字段的集合中聚合字段的个数;
    当聚合字段对应的聚合函数为Rollup时,聚合字段的聚合率R agg为:
    Figure PCTCN2018116874-appb-100002
    其中,n为聚合字段的集合中聚合字段的个数,k为正整数;
    当聚合字段对应的聚合函数为Cube时,聚合字段的聚合率R agg为:
    Figure PCTCN2018116874-appb-100003
    其中,n为聚合字段的集合中聚合字段的个数,k为正整数。
  11. 一种代价估计的设备,包括:
    判断装置,设置为判断获取到的统计信息是否完备,若所述统计信息不完备,确定第一操作类型的代价估计方式,所述第一操作类型为依赖于所述统计信息的操作树的操作类型;
    确定装置,设置为基于所述代价估计方式确定所述第一操作类型的第一代价估计;
    估算装置,设置为根据所述第一代价估计及第二操作类型对应的第二代价估计确定所述操作树的累积代价估计,所述第二操作类型为不依赖于所述统计信息的操作树的操作类型。
  12. 根据权利要求11所述的设备,其中,所述第一操作类型包括:扫表操 作、过滤操作、连接操作和聚合操作。
  13. 根据权利要求11所述的设备,还包括:
    执行装置,设置为根据所述累积代价估计优化结构化查询语言(SQL)的语句对应的执行计划。
  14. 根据权利要求12所述的设备,其中,所述确定装置设置为:
    根据数据集的记录数确定扫表操作的代价估计;
    根据过滤谓词的类型确定过滤条件的选择率,根据所述选择率确定过滤操作的代价估计;
    根据确定的连接结果集的记录数确定连接操作的代价估计;以及
    根据聚合字段及聚合函数确定聚合字段的聚合率,根据所述聚合率确定聚合操作的代价估计。
  15. 根据权利要求14所述的设备,其中,所述确定装置设置为:
    判断过滤谓词的类型对应的计算选择率的对象是否可获取,若所述计算选择率的对象不可获取,则所述过滤条件的选择率S=1/指定值。
  16. 根据权利要求15所述的设备,其中,所述确定装置设置为:
    当过滤谓词的类型为恒等谓词时,判断所述过滤操作的异值数是否可获取,若所述过滤操作的异值数不可获取,则所述过滤条件的选择率S=1/指定值;
    当过滤谓词的类型为范围谓词时,若谓词的字段的极值或异值数未能获取,则所述过滤条件的选择率S=1/指定值;以及
    当过滤谓词的类型为判空谓词时,判断谓词字段的空值数是否可获取,若所述谓词字段的空值数不可获取,则所述过滤条件的选择率S=1/指定值。
  17. 根据权利要求14所述的设备,其中,所述确定装置设置为:
    当过滤谓词的类型为like时,所述过滤条件的选择率S=1/指定值;
    当过滤谓词的类型为和级联谓词时,根据级联谓词的选择率的乘积与和级联谓词中过滤率最小值确定所述过滤条件的选择率;
    当过滤谓词的类型为或级联谓词时,根据所述或级联谓词对应的级联谓词的选择率确定所述过滤条件的选择率。
  18. 根据权利要求14所述的设备,其中,所述确定装置设置为:
    当相连接的左表和右表的连接字段的异值数不可获取时,基于连接字段的连接类型确定连接结果集的记录数。
  19. 根据权利要求18所述的设备,其中,所述确定装置设置为:
    当连接字段的连接类型为主键-外键形式连接时,则根据外键字段的记录数与主键字段的过滤条件的选择率确定连接结果集的记录数;
    当连接字段的连接类型为内连接时,则将相连接的左表的记录数及右表的记录数中的最大值作为连接结果集的记录数;
    当连接字段的连接类型为叉乘连接时,则将相连接的左表的记录数及右表的记录数的乘积作为连接结果集的记录数;
    当连接字段的连接类型为左外连接时,将相连接的左表的记录数作为所述连接结果集的记录数,当连接字段的连接类型为右外连接,则将相连接的右表的记录数作为连接结果集的记录数;以及
    当连接字段的连接类型为全连接时,则将相连接的左表的记录数及右表的记录数的累加作为连接结果集的记录数。
  20. 根据权利要求14所述的设备,其中,所述确定装置设置为:
    当聚合字段的集合中至少存在一个主键时,聚合字段的聚合率R agg=1;
    当聚合字段对应的聚合函数为简单聚合函数时,聚合字段的聚合率R agg为:
    Figure PCTCN2018116874-appb-100004
    其中,n为聚合字段的集合中聚合字段的个数;
    当聚合字段对应的聚合函数为Rollup时,聚合字段的聚合率R agg为:
    Figure PCTCN2018116874-appb-100005
    其中,n为聚合字段的集合中聚合字段的个数,k为正整数;
    当聚合字段对应的聚合函数为Cube时,聚合字段的聚合率R agg为:
    Figure PCTCN2018116874-appb-100006
    其中,n为聚合字段的集合中聚合字段的个数,k为正整数。
  21. 一种代价优化器,包括:
    生成原始执行计划;
    判断获取到的统计信息是否完备,若所述统计信息完备,则根据所述统计信息的第一代价估算模型估算代价,若所述统计信息不完备,则根据第二代价估算模型估算代价;以及
    根据所述第一代价估算模型估算的代价或所述第二代价估算模型估算的代价对所述原始执行计划进行优化,生成最优执行计划。
  22. 根据权利要求21所述的代价优化器,其中,所述第二代价估算模型,包括:
    确定第一操作类型的代价估计方式,所述第一操作类型为依赖于所述统计信息的操作树的操作类型;
    基于所述代价估计方式确定所述第一操作类型的第一代价估计;以及
    根据所述第一代价估计及第二代价估计确定所述操作树的累积代价估计,所述第二操作类型为不依赖于所述统计信息的操作树的操作类型。
  23. 一种基于计算的设备,包括:
    处理器;以及
    被安排成存储计算机可执行指令的存储器,所述可执行指令在被执行时使所述处理器:
    判断获取到的统计信息是否完备,若所述统计信息不完备,确定第一操作类型的代价估计方式,所述第一操作类型为依赖于所述统计信息的操作树的操作类型;
    基于所述代价估计方式确定所述第一操作类型的第一代价估计;
    根据所述第一代价估计及第二操作类型的第二代价估计确定所述操作树的累积代价估计,所述第二操作类型为依赖于所述统计信息的操作树的操作类型。
  24. 一种计算机可读存储介质,存储有计算机可执行指令,所述计算机可执行指令用于执行权利要求1-10任一项的方法。
PCT/CN2018/116874 2017-11-22 2018-11-22 代价优化器与代价估计的方法及其设备 WO2019101119A1 (zh)

Priority Applications (5)

Application Number Priority Date Filing Date Title
SG11202004818SA SG11202004818SA (en) 2017-11-22 2018-11-22 Cost-based optimizer, and cost estimation method and device thereof
EP18880194.8A EP3716093B1 (en) 2017-11-22 2018-11-22 Cost-based optimizer, and cost estimation method and device thereof
US16/766,480 US11023466B2 (en) 2017-11-22 2018-11-22 Cost-based optimizer, and cost estimation method and device thereof
CA3083148A CA3083148C (en) 2017-11-22 2018-11-22 Cost-based optimizer, and cost estimation method and device thereof
JP2020538120A JP2021504852A (ja) 2017-11-22 2018-11-22 コスト最適化装置、コスト推定の方法およびそのデバイス

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN201711175349.9A CN107885865B (zh) 2017-11-22 2017-11-22 一种代价优化器与代价估计的方法及其设备
CN201711175349.9 2017-11-22

Publications (1)

Publication Number Publication Date
WO2019101119A1 true WO2019101119A1 (zh) 2019-05-31

Family

ID=61778174

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2018/116874 WO2019101119A1 (zh) 2017-11-22 2018-11-22 代价优化器与代价估计的方法及其设备

Country Status (7)

Country Link
US (1) US11023466B2 (zh)
EP (1) EP3716093B1 (zh)
JP (1) JP2021504852A (zh)
CN (1) CN107885865B (zh)
CA (1) CA3083148C (zh)
SG (1) SG11202004818SA (zh)
WO (1) WO2019101119A1 (zh)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US11023466B2 (en) 2017-11-22 2021-06-01 Transwarp Technology (Shanghai) Co., Ltd. Cost-based optimizer, and cost estimation method and device thereof

Families Citing this family (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110297858B (zh) * 2019-05-27 2021-11-09 苏宁云计算有限公司 执行计划的优化方法、装置、计算机设备和存储介质
US11200230B2 (en) 2019-08-09 2021-12-14 Couchbase, Inc. Cost-based optimization for document-oriented database queries
US11681687B2 (en) 2020-08-31 2023-06-20 Couchbase, Inc. Executing transactions on distributed databases
CN114328606B (zh) * 2021-12-30 2022-11-29 星环信息科技(上海)股份有限公司 Sql执行效率的提高方法、设备及存储介质
CN116932580A (zh) * 2022-03-31 2023-10-24 华为技术有限公司 一种基数估计方法及装置
CN116821193B (zh) * 2023-08-30 2024-01-09 之江实验室 一种基于代理模型近似处理的推理查询优化方法及装置

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103136260A (zh) * 2011-11-30 2013-06-05 国际商业机器公司 评估过滤因子用于数据库中访问路径优化的方法和装置
CN107168977A (zh) * 2016-03-08 2017-09-15 阿里巴巴集团控股有限公司 一种数据查询的优化方法及装置
CN107239541A (zh) * 2017-06-02 2017-10-10 星环信息科技(上海)有限公司 一种代价估计的方法及设备
CN107885865A (zh) * 2017-11-22 2018-04-06 星环信息科技(上海)有限公司 一种代价优化器与代价估计的方法及其设备

Family Cites Families (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US5903888A (en) * 1997-02-28 1999-05-11 Oracle Corporation Method and apparatus for using incompatible types of indexes to process a single query
US7596560B2 (en) * 2004-12-23 2009-09-29 Raytheon Company System and method for adaptive query identification and acceleration
US8447754B2 (en) * 2010-04-19 2013-05-21 Salesforce.Com, Inc. Methods and systems for optimizing queries in a multi-tenant store
US9372890B2 (en) * 2011-11-23 2016-06-21 Infosys Technologies, Ltd. Methods, systems, and computer-readable media for providing a query layer for cloud databases
US9471631B2 (en) * 2012-09-28 2016-10-18 Oracle International Corporation Creating and using data that indicates misestimates of actual costs
CN107193813B (zh) 2016-03-14 2021-05-14 阿里巴巴集团控股有限公司 数据表连接方式处理方法及装置
CN106446134B (zh) 2016-09-20 2019-07-09 浙江大学 基于谓词规约和代价估算的局部多查询优化方法
CN107025273A (zh) * 2017-03-17 2017-08-08 南方电网科学研究院有限责任公司 一种数据查询的优化方法和装置
US20190236188A1 (en) * 2018-01-31 2019-08-01 Salesforce.Com, Inc. Query optimizer constraints

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103136260A (zh) * 2011-11-30 2013-06-05 国际商业机器公司 评估过滤因子用于数据库中访问路径优化的方法和装置
CN107168977A (zh) * 2016-03-08 2017-09-15 阿里巴巴集团控股有限公司 一种数据查询的优化方法及装置
CN107239541A (zh) * 2017-06-02 2017-10-10 星环信息科技(上海)有限公司 一种代价估计的方法及设备
CN107885865A (zh) * 2017-11-22 2018-04-06 星环信息科技(上海)有限公司 一种代价优化器与代价估计的方法及其设备

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
See also references of EP3716093A4

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US11023466B2 (en) 2017-11-22 2021-06-01 Transwarp Technology (Shanghai) Co., Ltd. Cost-based optimizer, and cost estimation method and device thereof

Also Published As

Publication number Publication date
CA3083148A1 (en) 2019-05-31
CN107885865A (zh) 2018-04-06
JP2021504852A (ja) 2021-02-15
US20200379999A1 (en) 2020-12-03
EP3716093A1 (en) 2020-09-30
US11023466B2 (en) 2021-06-01
CA3083148C (en) 2021-10-05
EP3716093B1 (en) 2021-09-15
CN107885865B (zh) 2019-12-10
EP3716093A4 (en) 2020-10-28
SG11202004818SA (en) 2020-06-29

Similar Documents

Publication Publication Date Title
WO2019101119A1 (zh) 代价优化器与代价估计的方法及其设备
US20160275150A1 (en) Lightweight table comparison
US11734234B1 (en) Data architecture for supporting multiple search models
WO2020042804A1 (zh) 一种数据库查询优化方法、装置、及计算机设备
CN107239541B (zh) 一种代价估计的方法及设备
US11893011B1 (en) Data query method and system, heterogeneous acceleration platform, and storage medium
US10726006B2 (en) Query optimization using propagated data distinctness
US10409813B2 (en) Imputing data for temporal data store joins
WO2022121098A1 (zh) 适用于olap查询引擎的查询方法及装置
WO2020199832A1 (zh) 用于预计算系统中查询引擎的动态路由方法及装置
CN111159219B (zh) 一种数据管理方法、装置、服务器及存储介质
WO2019120093A1 (en) Cardinality estimation in databases
CN111723161A (zh) 一种数据处理方法、装置及设备
CN109344169B (zh) 数据处理方法及装置
CN107735781A (zh) 存储查询结果的方法和装置、计算设备
CN111857981A (zh) 一种数据处理方法以及装置
JPH11353331A (ja) デ―タベ―スの問合せに応答する方法
US7613682B2 (en) Statistics collection using path-identifiers for relational databases
WO2020224242A1 (zh) 区块链数据处理方法、装置、服务器及存储介质
WO2022267183A1 (zh) 预计算模型的评分方法、装置、设备和存储介质
CN114328606B (zh) Sql执行效率的提高方法、设备及存储介质
US20190362254A1 (en) Systems and methods for estimating validation time for fraud detection rules
CN114238500A (zh) 区块链交易的存储方法、装置、电子设备及可读存储介质
CN113918771A (zh) 批流融合的信息处理方法和装置、存储介质
WO2024041221A1 (zh) 一种选择率估算方法及估算装置

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 18880194

Country of ref document: EP

Kind code of ref document: A1

ENP Entry into the national phase

Ref document number: 3083148

Country of ref document: CA

ENP Entry into the national phase

Ref document number: 2020538120

Country of ref document: JP

Kind code of ref document: A

NENP Non-entry into the national phase

Ref country code: DE

ENP Entry into the national phase

Ref document number: 2018880194

Country of ref document: EP

Effective date: 20200622