CN114398378B

CN114398378B - Method and device for determining index cost

Info

Publication number: CN114398378B
Application number: CN202210298016.XA
Authority: CN
Inventors: 王国平; 朱涛; 陈萌萌
Original assignee: Beijing Oceanbase Technology Co Ltd
Current assignee: Beijing Oceanbase Technology Co Ltd
Priority date: 2022-03-25
Filing date: 2022-03-25
Publication date: 2022-11-01
Anticipated expiration: 2042-03-25
Also published as: CN114398378A

Abstract

A method and apparatus for determining an index cost are provided. The method is applied to a LSM tree storage engine-based database in which baseline data and incremental data are stored, the baseline data and the incremental data containing data of an index, the index having physical rows, the number of rows of the index being the sum of the number of rows of the index contained in the baseline data and the number of rows of the index contained in the incremental data, the method comprising: receiving a first database query statement; determining the number of rows of physical rows to be scanned for the index according to the first database query statement; determining a cost of scanning the index according to the number of rows of the physical rows.

Description

Method and device for determining index cost

Technical Field

The present disclosure relates to the field of databases, and more particularly, to a method and apparatus for determining index costs.

Background

In a relational database, the base table access path selection method refers to a method in which an optimizer of the database selects an index or an index path. The index path selection method requires evaluating the cost of each index and selecting the index with the lowest cost to access the table in the database.

Generally, the cost evaluation of an index depends on many factors. For example, the cost evaluation of the index may take into account one or more of the following factors: the number of scanned rows, the number of return table rows, the number of projected columns, the number of predicates, etc. Most of these factors need to be calculated from statistical information maintained by the database system.

Among the factors affecting the index cost, the line number information is an important factor affecting the index cost. In a relational database, the cost of scanning the index and the cost of returning to the table are both determined based on the row number information.

Conventional relational databases typically manage data based on a buffer-pool storage engine. In such a storage engine, the line number information calculated based on the statistical information is generally matched with the line number information actually stored in the database, so that the calculation of the index cost based on the line number information does not usually have a great problem.

However, the index cost calculation method of the LSM tree storage engine is significantly different from the conventional Buffer-Pool-based storage engine. The traditional line number information calculation method has no means for accurately evaluating the index cost of the LSM tree storage engine, so that an optimizer of a database system may have no means for selecting a better base table access path.

Disclosure of Invention

The present disclosure provides a method and an apparatus for determining an index cost, so as to improve accuracy of index cost evaluation in an LSM tree storage engine.

In a first aspect, a method for determining an index cost is provided, where the method is applied to a database based on an LSM tree storage engine, where the database stores baseline data and incremental data, where the baseline data and the incremental data include data of an index, the index has a physical row, and a row number of the physical row of the index is a sum of a row number of the index included in the baseline data and a row number of the index included in the incremental data, and the method includes: receiving a first database query statement; determining the number of rows of physical rows to be scanned for the index according to the first database query statement; determining a cost of scanning the index according to the number of rows of the physical rows.

Optionally, as a possible implementation manner, the index further has a logical row, and the number of rows of the logical row of the index is the number of rows of the index included in the baseline data and the number of rows of the remaining rows of the index included in the incremental data after merging, the method further includes: determining the number of rows of the logic row of which the index needs to return to the table according to the first database query statement; and determining the table returning cost of the index according to the row number of the logic row.

Optionally, as a possible implementation manner, the number of rows of the physical row that the index needs to be scanned is determined based on first row number information and second row number information, where the first row number information is used to indicate the number of rows included in the baseline data, and the second row number information is used to indicate one or more of the following number of rows included in the incremental data: the number of added lines, the number of deleted lines and the number of changed lines.

Optionally, as a possible implementation manner, the number of rows of the logical rows that the index needs to be returned to the table is determined based on first row number information and second row number information, where the first row number information is used to indicate the number of rows included in the baseline data, and the second row number information is used to indicate one or more of the following number of rows included in the incremental data: the number of added lines, the number of deleted lines and the number of changed lines.

Optionally, as a possible implementation manner, the first line number information is obtained from metadata of the baseline data.

Optionally, as a possible implementation manner, the second line number information is obtained by dynamically sampling the incremental data.

In a second aspect, an apparatus for determining an index cost is provided, the apparatus including a LSM tree storage engine-based database having stored therein baseline data and incremental data, the baseline data and the incremental data including data of an index, the index having physical rows, the number of rows of the index being the sum of the number of rows of the index included in the baseline data and the number of rows of the index included in the incremental data, the apparatus including: the receiving module is used for receiving a first database query statement; a first determining module, configured to determine, according to the first database query statement, a number of lines of a physical line that needs to be scanned in the index; and the second determining module is used for determining the cost of scanning the index according to the number of the rows of the physical rows.

Optionally, as a possible implementation manner, the index further has a logical row, and the number of rows of the logical row of the index is the number of rows of the row of the index included in the baseline data and the number of rows of the remaining rows of the index included in the incremental data after merging, and the apparatus further includes: a third determining module, configured to determine, according to the first database query statement, a number of rows of a logical row of which the index needs to return to the table; and the fourth determining module is used for determining the table returning cost of the index according to the row number of the logic row.

Optionally, as a possible implementation manner, the number of rows of the physical row that the index needs to be scanned is determined based on first row number information and second row number information, where the first row number information is used to indicate the number of rows included in the baseline data, and the second row number information is used to indicate one or more of the following number of rows included in the incremental data: the number of newly added lines, the number of deleted lines and the number of changed lines.

In a third aspect, an apparatus for determining an index cost is provided, including: a memory to store instructions; a processor configured to execute instructions stored in the memory to perform a method as described in the first aspect or any one of the possible implementations of the first aspect.

In a fourth aspect, a computer-readable storage medium is provided, on which instructions for performing the method of the first aspect or any one of the possible implementations of the first aspect are stored.

In a fifth aspect, a computer program product is provided, which comprises instructions for carrying out the method of the first aspect or any one of the possible implementations of the first aspect.

On the premise of considering the data storage characteristics of the LSM tree storage engine, the embodiment of the disclosure introduces the concept of the physical row, which is different from the row number counted by the traditional statistical information, and the physical row is the row number actually required to be accessed by the LSM tree storage engine in the process of scanning the index, so that the introduction of the concept of the physical row can improve the accuracy of the index path selection.

Drawings

In order to more clearly illustrate the technical solutions in the embodiments or the background art of the present disclosure, the drawings used in the embodiments or the background art of the present disclosure will be described below.

FIG. 1 is an exemplary diagram of a data storage manner of an LSM tree storage engine.

FIG. 2 is an exemplary diagram of row number information for an LSM tree storage engine.

Fig. 3 is a diagram illustrating an example manner of determining physical rows and logical rows provided by an embodiment of the present disclosure.

Fig. 4 is a flowchart illustrating a method for determining an index cost according to an embodiment of the present disclosure.

Fig. 5 is a flowchart illustrating a method for determining an index cost according to another embodiment of the present disclosure.

Fig. 6 is a flowchart illustrating an apparatus for determining an index cost according to an embodiment of the present disclosure.

Fig. 7 is a flowchart illustrating an apparatus for determining an index cost according to another embodiment of the present disclosure.

Detailed Description

The embodiments of the present disclosure are described below with reference to the drawings. In the following description, reference is made to the accompanying drawings which form a part hereof and in which is shown by way of illustration specific aspects of embodiments of the disclosure or in which aspects of embodiments of the disclosure may be practiced. It should be understood that the disclosed embodiments may be used in other respects, and may include structural or logical changes not depicted in the drawings. The following detailed description is, therefore, not to be taken in a limiting sense. For example, it should be understood that the disclosure in connection with the described methods may equally apply to the corresponding apparatus or system for performing the methods, and vice versa. For example, if one or more particular method steps are described, the corresponding apparatus may comprise one or more units, such as functional units, to perform the described one or more method steps (e.g., a unit performs one or more steps, or multiple units, each of which performs one or more of the multiple steps), even if such one or more units are not explicitly described or illustrated in the figures. On the other hand, for example, if a particular apparatus is described based on one or more units, such as functional units, the corresponding method may comprise one step to perform the functionality of the one or more units (e.g., one step performs the functionality of the one or more units, or multiple steps, each of which performs the functionality of one or more of the units), even if such one or more steps are not explicitly described or illustrated in the figures. Further, it should be understood that features of the various exemplary embodiments and/or aspects described herein may be combined with each other, unless explicitly stated otherwise.

In a relational database, the base table access path selection method refers to a method in which an optimizer of the database selects an index. Therefore, the base table access path selection method may also be sometimes referred to as an index path selection method. The index path selection method requires evaluating the cost of each index and selecting the index with the lowest cost to access the table in the database.

Generally speaking, the cost evaluation of an index depends on many factors. For example, the cost evaluation of the index may take into account one or more of the following factors: the number of scanned rows, the number of return rows, the number of projected columns, the number of predicates, and the like. Most of these factors need to be calculated from statistical information maintained by the database system.

Among the factors affecting the index cost, the line number information is an important factor affecting the index cost. In a conventional relational database, the cost of the index and the cost of the look-back table are scanned. Both of the above costs are determined based on the line number information.

As an example at the cost of scanning the index. In general, the cost of scanning an index is proportional to the number of rows that the index needs to be scanned, which is determined by a portion of the query predicates in the database query statement that define the index scan start and end positions (i.e., the predicates define the scan range (query range) of the index). Theoretically, the more rows that are scanned, the longer the database query statement will execute.

Take the cost of indexing back to the table as an example. The return table cost of the index is positively correlated with the number of rows of the return table required by the index, and the number of rows of the return table is determined by the query predicate in the database query statement. Theoretically, the more rows that are returned to the table, the longer the execution time of the database query statement will be.

In a conventional relational database, row number information is typically obtained by calculating predicate selection rates from statistical information maintained in an optimizer. Of course, in some databases, the row number information may also be obtained in other more advanced ways, such as by dynamic sampling. As a simple example, given a joint index (a, b) and query predicates a >1 and a-5 and b-Ap-5, then the predicates a >1 and a-Ap-5 define the location of the start and end of the index scan, and if there are 1 ten thousand rows that satisfy both conditions, then the cost of scanning the index is 1 ten thousand rows of sequential scan. If the field selected by the select statement is missing in the join index (a, b), then a look-back table is required, which costs a random scan of 5 kilo-rows assuming a predicate selectivity of 0.5 for predicate b < 5.

In recent years, many database systems have used LSM TREEs (LSM-TREEs) as their storage engines. The LSM tree storage engine splits the data into two parts as shown in fig. 1: baseline data and delta data. The baseline data may also be referred to as static data and the incremental data may also be referred to as dynamic data. The baseline data is not modified, is read only and is stored in a magnetic disk; all modification operations (addition, deletion, modification) are recorded in the incremental data, the incremental data is usually stored in the memory first, and if the data amount stored in the memory exceeds a certain threshold, the incremental data is dumped to the disk. The baseline data and incremental data are periodically merged to form new baseline data. In the LSM tree storage engine, for one query operation, the LSM trees need to be merged to form the final query result.

Conventional relational databases typically manage data based on a buffer-pool storage engine. In such a storage engine, the line number information calculated based on the above statistical information is generally matched with the line number information actually stored in the database, so that the calculation of the index cost based on the line number information does not usually have a great problem.

However, the index cost calculation method of the LSM tree storage engine is significantly different from the conventional Buffer-Pool-based storage engine. The conventional row number information calculated based on the selection rate is not accurate to evaluate the index cost in the LSM tree storage engine, so that the optimizer may not select a better base table access path.

Consider an example where baseline data in the LSM tree storage engine shown in fig. 2 is deleted. In fig. 2, the baseline data contains 10 ten thousand rows of data, and the deletion operation for these 10 ten thousand rows of data is maintained in the incremental data. In this scenario, the total number of rows in the table is 0 rows, and in the conventional Buffer-Pool-based storage engine, the cost of the scan index matches the number of rows of the scan index, so that the scan can be completed quickly. However, in an LSM tree storage engine, scanning can be slow (the scanning process involves the merging of 10 ten thousand lines of baseline data and 10 ten thousand lines of delta data). It can be seen that in the LSM tree storage engine, the number of rows of the index and the cost of scanning the index are not matched. The reason for this problem is that: on the storage engine based on the LSM tree, the traditional line number information calculated based on dynamic sampling and predicate selection rate is not enough to reflect the line number required to be scanned in the actual calculation cost process. As a simple example, in a conventional relational database, if 1 ten thousand rows of data are inserted and 1 thousand rows of data are deleted, 9 thousand rows are used for calculation when calculating the index cost. However, in the LSM tree storage engine scenario, if the aforementioned 1 ten thousand rows of data are stored in the baseline data and the 1k rows of data are incremental data stored in the memory, it is only accurate to calculate with ten thousand rows when calculating the index cost.

In order to solve the problem that when the LSM tree storage engine calculates the cost of the index, the number of rows information is inconsistent with the actual number of rows in the index, the embodiments of the present disclosure provide a new number of rows calculation method suitable for the LSM tree storage engine. The row number calculation may be applied to a LSM tree storage engine based database, such as an OceanBase database.

First, the embodiments of the present disclosure introduce the concepts of "physical row" and "logical row", and propose an index cost calculation method based on "physical row" and "logical row".

The number of rows of the physical row of an index may be equal to the sum of the number of rows of the index contained in the baseline data and the number of rows of the index contained in the delta data. The physical rows are mainly used to characterize the number of rows that the storage engine, LSM tree, needs to really access when calculating the cost of scanning the index. The number of rows of logical rows of an index is the number of rows of index rows contained in the baseline data and the remaining rows of the index rows after the merging of the index rows contained in the delta data. Logical rows are to be understood as rows in the conventional sense. Referring back to fig. 2, in the example of fig. 2, the logical rows are 0 rows and the physical rows are 20w rows.

The physical rows may be used to determine the cost of scanning the index. That is, the cost of scanning an index may be determined based on the number of rows of physical rows that the index needs to be scanned.

In some embodiments, the number of rows of the physical row that the index needs to be scanned may be determined based on the first row number information and the second row number information. The first line number information refers to line number information associated with the baseline data; the second line number information refers to line number information associated with the incremental data.

The first row number information may be used to indicate the number of rows contained in the baseline data. For example, the scan range of the index may be determined first, and then the number of rows in the baseline data that lie within the scan range of the index may be determined.

The first line number information may be acquired in various ways. For example, it may be obtained by statistical information of a database. As another example, the metadata may be obtained from the baseline data. Some databases, such as the OceanBase database, maintain block-level statistics for the baseline data, and thus are able to quickly calculate the first row information.

The second line number information may be used to indicate one or more of the following line numbers contained in the delta data: the number of added lines, the number of deleted lines and the number of changed lines. For example, the scan range of the index may be determined, and then the number of new lines, deleted lines, and modified lines in the incremental data that are within the scan range of the index may be determined.

The embodiment of the present disclosure considers both incremental data and baseline data, which means that the statistical information is real-time, while the conventional statistical information collection is delayed (usually, the addition/deletion/modification operation of a table is performed to a certain extent, and then the statistical information is collected again).

The logical rows may be used to determine the cost of the index needing to go back to the table. That is, the cost of an index requiring a table return may be determined based on the number of rows of logical rows for which the index requires a table return. The cost of the index needing to return to the table can be determined by adopting a logic row, because the baseline data and the incremental data are merged in the process of scanning the index, the return to the table cost of the index can be calculated by directly utilizing the merged data of the baseline data and the incremental data.

In some embodiments, the number of rows of logical rows to index back to the table may be determined based on the first row number information and the second row number information. The first line number information refers to line number information associated with the baseline data; the second line number information refers to line number information associated with the delta data.

The first line number information may be acquired in various ways. For example, it may be obtained by statistical information of a database. As another example, the metadata may be obtained from the metadata of the baseline data. Some databases, such as the OceanBase database, maintain block-level statistics for the baseline data, and therefore can quickly calculate the first row count information.

The second line number information may be used to indicate one or more of the following line numbers contained in the delta data: the number of newly added lines, the number of deleted lines and the number of changed lines. For example, the scan range of the index may be determined, and then the number of new lines, deleted lines, and modified lines in the incremental data that are within the scan range of the index may be determined.

Referring to fig. 3, the baseline data contains metadata. Thus, the metadata may be used to calculate the number of lines in the baseline data (corresponding to the first line number information above), and dynamically sample the incremental data by means of dynamic sampling, resulting in the new, deleted, and modified number of lines (corresponding to the second line number information above). Combining the two, a logical row and a physical row can be obtained.

Fig. 4 illustrates a method for determining an index cost according to an embodiment of the present disclosure. The method 400 of fig. 4 may be performed by a database, such as an optimizer in a database. The database may be a LSM tree storage engine based database. For example, the database may be an OceanBase database. One of the advantages of the OceanBase database is that predicate dependencies on index columns are resolved. For example, considering indexes (a, b) and query conditions a =1 and b =1, one problem that must be considered when a conventional database calculates the predicate selection rate of the query condition is whether a and b have a dependency relationship, and then uses corresponding statistical information (multi-column histogram or dynamic sampling) to estimate row number information, thereby improving the accuracy of predicate selection rate calculation. The current evaluation mode of the OceanBase database takes the dependency relationship between different predicate conditions into account, so the scenes with dependency relationship between a and b can be solved by default by using the OceanBase database. The database may select a base table access path (i.e., select an index to access a table in the database) based on the method shown in fig. 4. For example, the database may calculate the cost of each index based on the method shown in fig. 4, and then determine an index path for accessing a table in the database according to the cost of each index. As shown in fig. 4, the method 400 may include steps S410, S420, and S430, which are described in detail below.

In step S410, a first database query statement is received. As one example, the database query statement may be an SQL statement.

In step S420, the number of rows of the physical row that the index needs to be scanned is determined according to the first database query statement. For example, a scan range of the index may be determined according to a predicate in a database query statement, and then a sum of the number of rows in the baseline data and the incremental data that are within the scan range is calculated, so as to obtain the number of rows of the physical row that the index needs to be scanned.

In step S430, the cost of scanning the index is determined according to the number of rows of the physical row. Taking the example that the number of rows of the physical row that needs to be scanned for the index is 1 ten thousand rows, the cost of scanning the index can be determined as a sequential scan of 1 ten thousand rows.

Fig. 5 illustrates a method for determining an index cost according to another embodiment of the present disclosure. The method 500 of fig. 5 may be performed by a database, such as an optimizer in a database. The database may be a LSM tree storage engine based database. For example, the database may be an OceanBase database. The database may select a base table access path (i.e., select an index to access a table in the database) based on the method shown in fig. 5. For example, the database may calculate the cost of each index based on the method shown in fig. 5, and then determine an index path for accessing a table in the database according to the cost of each index. As shown in fig. 5, the method 500 may include step S510, step S520, and step S530. These steps are described in detail below.

At step S510, a first database query statement is received. As one example, the database query statement may be an SQL statement.

In step S520, the number of rows of the physical row that the index needs to be scanned and the number of rows of the logical row that the index needs to return to the table are determined according to the first database query statement. For example, a scan range of the index may be determined based on predicates in the database query statement, and then the number of rows in the baseline data and the incremental data that lie within the scan range may be calculated to obtain the number of rows of physical and logical rows that the index needs to be scanned.

In step S530, the cost of scanning the index and the cost of returning the index to the table are determined according to the number of rows of the physical rows and the number of rows of the logical rows, respectively. Taking the number of rows of the physical row to be scanned by the index as 1 ten thousand rows and the logical row as 5 thousand rows as an example, the cost of scanning the index can be determined as sequential scanning of 1 ten thousand rows, and the return cost of the index can be determined as random scanning of 5 thousand rows.

The embodiment of the disclosure can improve the accuracy of estimating the line by the LSM tree storage engine by introducing two concepts of the logical line and the physical line, thereby improving the accuracy of selecting the index path.

Method embodiments of the present disclosure are described in detail above in conjunction with fig. 1-5, and apparatus embodiments of the present disclosure are described in detail below in conjunction with fig. 6-7. It is to be understood that the description of the method embodiments corresponds to the description of the apparatus embodiments, and therefore reference may be made to the preceding method embodiments for parts not described in detail.

Fig. 6 is an apparatus for determining an index cost according to an embodiment of the present disclosure. The apparatus 600 shown in fig. 6 includes a LSM tree storage engine-based database having stored therein baseline data and delta data, the baseline data and the delta data comprising data of an index, the index having a physical row with a row number that is the sum of the row number of the index contained in the baseline data and the row number of the index contained in the delta data.

The apparatus 600 comprises: a receiving module 610, a first determining module 620, and a second determining module 630.

The receiving module 610 may be configured to receive a first database query statement.

The first determination module 620 may be configured to determine the number of rows of physical rows that the index needs to be scanned according to the first database query statement.

The second determining module 630 may be configured to determine a cost of scanning the index according to the number of rows of the physical row.

Optionally, in some embodiments, the index further has a logical row, and the number of rows of the logical row of the index is the number of rows of the index included in the baseline data and the remaining rows of the index included in the delta data after the rows of the index are merged, and the apparatus further includes: a third determining module, configured to determine, according to the first database query statement, a row number of a logical row of the index that needs to be returned to the table; and the fourth determining module is used for determining the table returning cost of the index according to the row number of the logic row.

Optionally, in some embodiments, the number of rows of the physical row that the index needs to be scanned is determined based on first row number information indicating the number of rows included in the baseline data and second row number information indicating one or more of the following number of rows included in the incremental data: the number of added lines, the number of deleted lines and the number of changed lines.

Optionally, in some embodiments, the number of rows of the logical rows that the index needs to be returned to the table is determined based on first row number information and second row number information, the first row number information indicating the number of rows included in the baseline data, and the second row number information indicating one or more of the following rows included in the delta data: the number of newly added lines, the number of deleted lines and the number of changed lines.

Optionally, in some embodiments, the first line number information is obtained from metadata of the baseline data.

Optionally, in some embodiments, the second row number information is obtained by dynamically sampling the incremental data.

Fig. 7 is a schematic structural diagram of an apparatus for determining an index cost according to another embodiment of the present disclosure. The apparatus 700 for determining an index cost as described in fig. 7 may include a memory 710 and a processor 720, and the memory 710 may be configured to store instructions. The processor 720 may be configured to execute instructions stored in the memory 710 to implement the steps of the various methods described previously. In some embodiments, the apparatus 700 may further include a network interface 730, and data exchange between the processor 720 and an external device may be implemented through the network interface 730.

In the above embodiments, all or part of the implementation may be realized by software, hardware, firmware or any other combination. When implemented in software, may be implemented in whole or in part in the form of a computer program product. The computer program product includes one or more computer instructions. The procedures or functions described in accordance with the embodiments of the disclosure are, in whole or in part, generated when the computer program instructions are loaded and executed on a computer. The computer may be a general purpose computer, a special purpose computer, a network of computers, or other programmable device. The computer instructions may be stored on a computer readable storage medium or transmitted from one computer readable storage medium to another, for example, from one website, computer, server, or data center to another website, computer, server, or data center via wire (e.g., coaxial cable, fiber optic, digital Subscriber Line (DSL)) or wireless (e.g., infrared, wireless, microwave, etc.). The computer-readable storage medium can be any available medium that can be accessed by a computer or a data storage device, such as a server, a data center, etc., that incorporates one or more of the available media. The usable medium may be a magnetic medium (e.g., a floppy disk, a hard disk, a magnetic tape), an optical medium (e.g., a Digital Video Disk (DVD)), or a semiconductor medium (e.g., a Solid State Disk (SSD)), among others.

Those of ordinary skill in the art will appreciate that the various illustrative elements and algorithm steps described in connection with the embodiments disclosed herein may be implemented as electronic hardware or combinations of computer software and electronic hardware. Whether such functionality is implemented as hardware or software depends upon the particular application and design constraints imposed on the implementation. Skilled artisans may implement the described functionality in varying ways for each particular application, but such implementation decisions should not be interpreted as causing a departure from the scope of the present disclosure.

In the several embodiments provided in the present disclosure, it should be understood that the disclosed system, apparatus, and method may be implemented in other ways. For example, the above-described apparatus embodiments are merely illustrative, and for example, the division of the units is only one logical division, and other divisions may be realized in practice, for example, a plurality of units or components may be combined or integrated into another system, or some features may be omitted, or not executed. In addition, the shown or discussed mutual coupling or direct coupling or communication connection may be an indirect coupling or communication connection through some interfaces, devices or units, and may be in an electrical, mechanical or other form.

The units described as separate parts may or may not be physically separate, and parts displayed as units may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. Some or all of the units can be selected according to actual needs to achieve the purpose of the solution of the embodiment.

In addition, functional units in the embodiments of the present disclosure may be integrated into one processing unit, or each unit may exist alone physically, or two or more units are integrated into one unit.

The above description is only for the specific embodiments of the present disclosure, but the scope of the present disclosure is not limited thereto, and any person skilled in the art can easily think of the changes or substitutions within the technical scope of the present disclosure, and shall cover the scope of the present disclosure. Therefore, the protection scope of the present disclosure shall be subject to the protection scope of the claims.

Claims

1. A method of determining an index cost, the method being applied to a database based on an LSM tree storage engine, the database having stored therein baseline data and delta data in which modification operations for data in the database are recorded, the baseline data and the delta data containing data of an index, the index having a physical row for characterizing a number of rows that the LSM tree storage engine needs to access when calculating a cost of scanning the index, the number of rows of the index being a sum of a number of rows of the index contained in the baseline data and a number of rows of the index contained in the delta data,

the method comprises the following steps:

receiving a first database query statement, a query result of which is formed based on the LSM tree storage engine after merging the baseline data and the incremental data;

determining a scanning range of the index according to a predicate in the first database query statement, and determining the number of rows of physical rows of the index to be scanned in the scanning range;

determining a cost of scanning the index according to the number of rows of the physical rows.

2. The method of claim 1, the index further having logical rows with a number of rows being the number of rows of the index contained in the baseline data and the number of rows remaining after the rows of the index contained in the delta data have been merged,

the method further comprises the following steps:

determining the number of rows of the logic row of which the index needs to return to the table according to the first database query statement;

and determining the table returning cost of the index according to the row number of the logic row.

3. The method of claim 1, the number of rows of physical rows for which the index needs to be scanned being determined based on first row number information indicating the number of rows contained in the baseline data and second row number information indicating one or more of the following number of rows contained in the delta data: the number of added lines, the number of deleted lines and the number of changed lines.

4. The method of claim 2, the number of rows of logical rows for which the index requires a return table being determined based on first row number information indicating the number of rows contained in the baseline data and second row number information indicating one or more of the following number of rows contained in the delta data: the number of added lines, the number of deleted lines and the number of changed lines.

5. The method of claim 3 or 4, the first line number information being obtained from metadata of the baseline data.

6. The method of claim 3 or 4, wherein the second row number information is obtained by dynamically sampling the incremental data.

7. An apparatus for determining an index cost, the apparatus comprising a LSM tree storage engine-based database having stored therein baseline data and delta data in which modification operations for data in the database are recorded, the baseline data and the delta data containing data of an index, the index having a physical row for characterizing a number of rows that the LSM tree storage engine needs to access when calculating a cost for scanning the index, the number of rows of the index being a sum of a number of rows of the index contained in the baseline data and a number of rows of the index contained in the delta data,

the device comprises:

a receiving module, configured to receive a first database query statement, where a query result of the first database query statement is formed by combining the baseline data and the incremental data based on the LSM tree storage engine;

a first determining module, configured to determine a scanning range of the index according to a predicate in the first database query statement, and determine, in the scanning range, a number of rows of physical rows that the index needs to be scanned;

and the second determining module is used for determining the cost of scanning the index according to the number of the rows of the physical rows.

8. The apparatus of claim 7, the index further having logical rows with a number of rows that is the number of rows of the index contained in the baseline data and the remaining rows of the index contained in the delta data after the rows of the index have been merged,

the device further comprises:

a third determining module, configured to determine, according to the first database query statement, a number of rows of a logical row of which the index needs to return to the table;

and the fourth determining module is used for determining the table returning cost of the index according to the row number of the logic row.

9. The apparatus of claim 7, the number of rows of physical rows for which the index needs to be scanned is determined based on first row number information indicating the number of rows contained in the baseline data and second row number information indicating one or more of the following number of rows contained in the delta data: the number of added lines, the number of deleted lines and the number of changed lines.

10. The apparatus of claim 8, the number of rows of logical rows for which the index requires a table back to be determined based on first row number information indicating a number of rows included in the baseline data and second row number information indicating one or more of the following number of rows included in the delta data: the number of added lines, the number of deleted lines and the number of changed lines.

11. The apparatus of claim 9 or 10, the first line number information being obtained from metadata of the baseline data.

12. The apparatus of claim 9 or 10, wherein the second row number information is obtained by dynamically sampling the incremental data.

13. An apparatus for determining an index cost, comprising:

a memory to store instructions;

a processor for executing instructions stored in the memory to perform the method of any of claims 1-6.