CN115391424A - Database query processing method, storage medium and computer equipment - Google Patents

Database query processing method, storage medium and computer equipment Download PDF

Info

Publication number
CN115391424A
CN115391424A CN202211037293.1A CN202211037293A CN115391424A CN 115391424 A CN115391424 A CN 115391424A CN 202211037293 A CN202211037293 A CN 202211037293A CN 115391424 A CN115391424 A CN 115391424A
Authority
CN
China
Prior art keywords
query
index
tree
data
aggregation
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202211037293.1A
Other languages
Chinese (zh)
Inventor
范国腾
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing Kingbase Information Technologies Co Ltd
Original Assignee
Beijing Kingbase Information Technologies Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing Kingbase Information Technologies Co Ltd filed Critical Beijing Kingbase Information Technologies Co Ltd
Priority to CN202211037293.1A priority Critical patent/CN115391424A/en
Publication of CN115391424A publication Critical patent/CN115391424A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/24Querying
    • G06F16/245Query processing
    • G06F16/2458Special types of queries, e.g. statistical queries, fuzzy queries or distributed queries
    • G06F16/2462Approximate or statistical queries
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/22Indexing; Data structures therefor; Storage structures
    • G06F16/2228Indexing structures
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/23Updating
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/24Querying
    • G06F16/245Query processing
    • G06F16/2453Query optimisation
    • G06F16/24534Query rewriting; Transformation
    • G06F16/24542Plan optimisation

Abstract

The invention provides a processing method, a storage medium and computer equipment for database query, wherein the processing method comprises the following steps: acquiring a query statement, and analyzing the query prediction to obtain a query tree of the query statement; traversing the tables related to the query tree to obtain an aggregation index generated in advance on the related tables; replacing a matching sub-tree on the query tree, which is matched with the query sub-tree, by using the query sub-tree of the aggregated index, so as to obtain an optimized query tree; and generating an execution plan of the query statement according to the optimized query tree. By applying the scheme of the invention, when statistical query is processed, the aggregation index is used for query, and the data table does not need to be scanned and complex aggregation operation is carried out any more, so that the query execution time of the query statement is greatly shortened, and the occupation of hardware resources is reduced.

Description

Database query processing method, storage medium and computer equipment
Technical Field
The present invention relates to database technologies, and in particular, to a method, a storage medium, and a computer device for processing a database query.
Background
There are typically a large number of statistical query statements present in analytical database applications. The processing of these statements is time consuming and requires significant hardware resources. A number of aggregation operations are used in these statements to generate statistical information such as grouping, deduplication, summing, counting, and the like.
In order to improve the processing efficiency of the statistical queries, some solutions proposed in the existing database technologies, such as materialized views, column storage, and calculation of approximate values by using sampled statistical information, etc., but the technologies all have certain limitations. For example, materialized views cannot reflect data updates in real time, data changes need to be reflected by refreshes, and frequent refreshes also consume a large amount of resources. Column storage to get a column statistic, it is inefficient to import all the data for a column. And the sampling statistics can only obtain approximate values and is not accurate enough.
In order to solve the above problem, an ideal processing method is to store some common statistical data which is time-consuming for query, so as to be directly used in the next query. However, in the actual implementation process, due to the diversity and complexity of the statistical mode and the limitation of the updating efficiency, some existing schemes have certain limitations, and the prior art lacks a solution for breaking through the limitations.
Disclosure of Invention
It is an object of the invention to improve the efficiency of processing statistical query statements.
It is a further object of the invention to reduce the occupation of hardware resources in processing statistical query statements.
Particularly, the invention provides a processing method of database query, comprising the following steps:
acquiring a query statement, and analyzing the query prediction to obtain a query tree of the query statement;
traversing the tables related to the query tree to obtain an aggregation index generated in advance on the related tables;
replacing a matching sub-tree on the query tree, which is matched with the query sub-tree, by using the query sub-tree of the aggregated index, so as to obtain an optimized query tree;
and generating an execution plan of the query statement according to the optimized query tree.
Optionally, the step of replacing a matching sub-tree on the query tree that matches the query sub-tree with the query sub-tree of the aggregated index comprises:
compiling all the aggregation indexes obtained by traversal into aggregation index subtrees respectively;
querying a matching subtree matched with the aggregation index subtree on a query tree;
the matching subtree is replaced with the aggregate index subtree.
Optionally, the step of querying the query tree for a matching subtree matching the aggregated indexed subtree comprises:
and traversing the query tree again, and searching the subtree which is the same as the query subtree and the subtree of which the sub-query contains the query condition of the aggregation index subtree as a matching subtree.
Optionally, the step of generating an execution plan of the query statement according to the optimized query tree comprises:
performing cost evaluation on the optimized query tree by an optimizer of the database;
and selecting an execution plan from the optimized query tree based on the result of the cost evaluation.
Optionally, after the step of generating an execution plan of the query statement according to the optimized query tree, the method further includes:
executing the execution plan by an executor of the database, and inquiring the index key value and the aggregation data meeting the conditions from the corresponding aggregation index in the process of processing the aggregation index subtree in the execution process.
Optionally, the step of querying the corresponding aggregated index to obtain the index key value meeting the condition and the aggregated data thereof further includes:
judging whether the aggregated data is directly the required result data;
if not, performing secondary calculation on the aggregated data to obtain result data.
Alternatively, after traversing the tables involved in the query tree, if all of the involved tables do not have an aggregated index, an execution plan is formulated by the optimizer of the database directly from the query tree.
Optionally, the method further includes:
acquiring data operation for triggering the update of the aggregation index of the database, and modifying the target data table of the database according to the data operation;
reading a statistical formula predefined by the aggregation index;
carrying out statistical calculation on the modified data by using the statistical formula to obtain statistical information;
updating aggregated data of the aggregation index using the statistical information.
According to yet another aspect of the present invention, there is also provided a computer device comprising a memory, a processor and a machine-executable program stored on the memory and running on the processor, and the processor, when executing the machine-executable program, implements the method of processing a database query of any of the above.
The processing method for database query of the invention uses the aggregation index to query when processing statistical query, does not need to scan the data table and perform complex aggregation operation, but only performs simple secondary operation on the statistical information stored in the index and then returns, thereby greatly shortening the query execution time of query sentences and reducing the occupation of hardware resources.
Furthermore, the processing method for database query of the invention uses the aggregated indexed data to complete query, and does not scan the original data table any more, so the influence on the execution time due to the increase of the data volume is not large, and the larger the data volume of the database is, more query processing time can be saved.
Furthermore, in the processing method of the database query of the present invention, the aggregated index has the following characteristics: the index value is not repeated, the index contains the specified statistical information when defining the index, the index is a sorting index, and the index can be updated in real time when the data changes. The data fed back by using the aggregation index is accurate and efficient.
The above and other objects, advantages and features of the present invention will become more apparent to those skilled in the art from the following detailed description of specific embodiments thereof taken in conjunction with the accompanying drawings.
Drawings
Some specific embodiments of the invention will be described in detail hereinafter, by way of illustration and not limitation, with reference to the accompanying drawings. The same reference numbers in the drawings identify the same or similar elements or components. Those skilled in the art will appreciate that the drawings are not necessarily drawn to scale. In the drawings:
FIG. 1 is a diagram of a data structure of an aggregated index for which a method of processing a database query is directed, according to one embodiment of the invention;
FIG. 2 is a data diagram of a node in an aggregated index for which a method of processing a database query is directed, according to one embodiment of the invention;
FIG. 3 is a flowchart illustrating updating an aggregated index in a method for processing a database query according to an embodiment of the present invention;
FIG. 4 is a schematic diagram of a method of processing a database query according to one embodiment of the invention;
FIG. 5 is a schematic diagram of generating an execution plan in a method of processing a database query, according to one embodiment of the invention;
FIG. 6 is a data instance for executing a query using an aggregated index in a method of processing a database query according to one embodiment of the invention;
FIG. 7 is a graph comparing execution time before and after application of a method of processing a database query, according to one embodiment of the invention;
FIG. 8 is a schematic diagram of a machine-readable storage medium according to one embodiment of the invention; and
FIG. 9 is a schematic diagram of a computer device according to one embodiment of the invention.
Detailed Description
The database query processing method of the embodiment uses a novel index named as an aggregation index. The purpose of aggregating the index is to store the statistical information of the data in the form of the index, so that the problem that the processing consumes time and resources and the processing consumes a lot of resources because the statistical information is generated by using an aggregation operation when the statistical query is processed in the prior art is solved.
The aggregated index can utilize the orderliness of BTREE index or other index technology in the database to accelerate the retrieval of information, and the real-time performance of statistical information is ensured by means of the characteristic that the index is updated correspondingly with the updated data. The aggregated index can also ensure the visibility of data by using the index data update to be submitted when the transaction is submitted, and can even realize the acceleration of the access speed of the statistical information by using the partition of the index.
Fig. 1 is a schematic diagram of a data structure of an aggregation index for which a processing method of a database query according to an embodiment of the present invention is applied, and fig. 2 is a schematic diagram of data of a node in the aggregation index for which the processing method of a database query according to an embodiment of the present invention is applied.
Referring to the Btree structure as an example in fig. 1, the index portion 110 of the aggregated index is constructed according to index key values, where 4, 32, 64, and the like are example index key values, which may be data ids, and non-leaf nodes have pointers of next-layer nodes. The data portion 120 of the aggregate index stores the position of the data corresponding to the index key value, i.e., the pointer of the data.
The data portion 120 is provided with statistical information value in addition to the data position, as shown in fig. 2, the aggregate index is subjected to deduplication processing, the same index key 44 corresponds to three data, and the statistical information value of the data is also stored.
The data structure of the aggregation index is improved on the basis of the index structure (such as a Btree structure) of the database, so that the data structure of the aggregation index has the structure of the existing index on one hand, and new characteristics are added on the other hand.
The aggregation index targeted by the method of the embodiment is added with a list field for storing statistical information in the data structure of the index. The statistical information is recalculated each time the data is updated, based on the data pointer specified by the update index.
The length and content of the statistics field are defined when creating the index. For example, if the statistical information for the id column is selected id and sum (val) from t group by id, the sum (val) value corresponding to id =1 is recalculated and updated into the index file each time the data of id =1 is updated.
The data stored in the aggregate index may be selected to be a value after deduplication by setting, that is, in the aggregate index, there may be one index node for each data row, or one index node may be shared by all data rows with the same index key value.
For the aggregation operation to be saved being a grouping or deduplication operation, only the data from which the duplicate value is removed may be retained in the aggregation index. For example, in the case that the values of id columns of two rows of data inserted into a table (id int) are both 1, only one key value may be reserved in the aggregation index, the pointer pointing to data corresponding to the key value is an array, and the pointers pointing to the two rows of data are reserved in the array. The data equivalent to the original BTREE is saved in the form of two indexes: (key =1, value = (1)), (key =1, value = (1)). The data is stored in the aggregation index in the form of (key =1, value = ((1), (1)), aggval = ()). Wherein aggval = () the value of statistical information to be calculated. In this way, the time for the grouping and deduplication operations can be saved during data query.
As can be seen from the above description, the aggregation index further increases the position of the corresponding data (generally, the data row position corresponding to the index key value) and the statistical information value on the basis of the index data structure. I.e. two linked lists are newly added: and storing the linked list of all data line positions corresponding to the index key value and the linked list of all statistical values corresponding to the index key value.
The generation process of the aggregation index targeted by the database query processing method of the embodiment includes: acquiring a creating statement of a database index; semantic analysis is carried out on the created sentences to obtain indexed relationships and statistical formulas; creating an aggregation index; counting the indexed relation according to a statistical formula to obtain statistical information; and saving the statistical information as aggregation data in an aggregation index.
Where the create statement may specify the index type of the index and the index data structure (e.g., btree or hash structure). An alternative syntax for creating statements is:
CREATE[UNIQUE]INDEX[CONCURRENTLY]name ON table[USING method]
wherein CREATE is a CREATE command; [ UNIQUE ] indicates that each index key value of the index corresponds to a UNIQUE data record and is an optional setting item; [ CONCURRENTLY ] is an optional setting item indicating that DML (delete, modify, insert) operations are not blocked during the creation of the index; [ USING method ] is an index type. The type of the generated aggregation index of this embodiment may be defined as AGGREGATE.
For example for create statement 1: CREATE INDEX idx1 ON t using AGGREGEGEGEGEGEATE (select id, max (score) from t group by id). The function is to establish an aggregation index on an id column of a table t, and a max (score) value corresponding to the id is stored in the aggregation index.
For another example for create statement 2: CREATE INDEX idx1 ON t using AGGREGGREGATE (select count, progress, sum (num), max (score) from t group by count, progress). The method has the function of establishing an aggregation index on a county and a provice in a table t, and saving sum (num) and max (score) values corresponding to the county and the provice in the aggregation index.
The indexed relation obtained by creating the statement through semantic analysis may be a single table of the database, or a sub-query in which the database is associated with multiple tables, that is, the aggregate index may be created on the single table of the database, or on one sub-query in which multiple tables are associated. This is also an important difference from existing indexes.
For example, for the above creation statement 1, the indexed relationship is the id column of table t, and the statistical formula (statistical function) is max (score); for the above-mentioned creation statement 2, the indexed relationship is the counter and province columns of table t, and the statistical formula (statistical function) has two values: sum (num) and max (score).
When the database parser analyzes syntax and semantics of the created statement, if the index type is determined to be aggregate, the aggregate index creation process is called, and the creation of the aggregate index is started. For example, a statement of CREATE INDEX idx1 ON t using AGGREGATE (select id, max (score) from t group by id) is obtained ON an id column of a table t during parsing, an aggregation (AGGREGATE) INDEX is created, and aggregation information stored in a key value of the INDEX is max (score).
The step of creating an aggregation index may comprise: creating an index data table, wherein the index data table can adopt an ordered index structure such as a BTree data structure; and recording the corresponding relation between the statistical formula and the indexed data table and the indexed relation in a system table of the database. That is to say, in the process of creating the aggregated index, a data table t _ index is created first, then the corresponding relationship between the data table t (indexed relationship) and the index data table t _ index is recorded in the database system table, and other information of the index, such as type aggregate, corresponding key value id, corresponding statistical information calculation formula, etc., is also recorded in the system table.
The step of recording the statistical formula in the system table of the database may include: saving the statistical formula in the form of a created statement; or compiling the statistical formula to obtain an executable code corresponding to the statistical formula, and storing the executable code corresponding to the statistical formula. That is, when the statistical formula is saved, the SQL form of the calculation formula, such as max (score), sum (num), etc., may be saved, and the statistical formula may also be compiled into executable code by some compiling techniques (such as JIT, etc.) to speed up the compiling when modifying the index.
Before the step of recording the statistical formula in the system table of the database, whether the statistical formula defines a corresponding conversion rule can be judged; if yes, converting the statistics according to the corresponding conversion rule; and the step of recording the statistical formula in the system table of the database includes: and recording the converted statistical formula in a system table of the database.
The conversion requirements for different statistical formulas are different. Some statistical formulas do not need to be transformed, and some statistical formulas need to be transformed to a certain extent according to a defined rule so as to avoid the need of re-scanning the whole table for recalculation each time data is updated. For example, for creating statement 1, when max (score) is counted, it is only necessary to save max (score), so that when the value of data inserted into data table t is larger than the current value, updating is needed, and when the deleted value is the maximum value, the statistical information of the key value needs to be counted again. Also for example, if it is statistically calculated as average avg (score), information of sum (score) and count aggregation function needs to be stored because average avg = total/number = sum/count.
Considering that the value of the statistics is a collection of some integer or floating point numbers, the length and footprint of the list of statistics in defining the aggregation index may already be determined. Then after the step of creating the index data table may further comprise: determining the data length of the statistical information; and allocating space for the index data table according to the data length.
For example, the constraint conditions for defining the index are select id, sum (val), max (score) from t group by id. If val is int type and score is float type, then for the aggregation index listed as id, the extra space size to be allocated to each key value should be sizeof (int) + sizeof (float), i.e. the size of int type and the size of float type.
The statistical process of counting the indexed relation according to the statistical formula can be performed after reading the corresponding column of the indexed relation (data table t) according to the statistical formula.
The statistical information is stored as aggregated data in the aggregated index, that is, stored in a linked list for storing the statistical information corresponding to the key value of the aggregated index, and the stored statistical information is referred to as aggregated data of the aggregated index.
According to the method, after the creation sentences of the database index are obtained, the aggregation index is created, the aggregation index is used for counting in advance and storing the results of subsequent common statistical queries, and the data can be conveniently and quickly provided during the subsequent statistical queries. On one hand, the processing efficiency of statistical query is improved by means of the orderliness of indexes; on the other hand, the aggregation index is correspondingly updated along with the change of the data, so that the real-time performance of the data is ensured. The aggregated data of the aggregated index is obtained by counting all relevant data of the indexed relation (which can be a single table or a result table of sub-query), so that the accuracy of the statistical data is ensured.
The statistical information is stored in a data structure of the aggregation index. That is, for each index key, additional space is allocated in the data structure of the aggregate index to hold the statistics.
The aggregate index has the following characteristics: the index key values are not repeated, the aggregation index contains flexibly defined statistical information, the index is a sorting index, and the index is updated in real time when data changes.
Alternatively, the statistical information may be saved in another file to form a new data file. In the index file, only the position of the corresponding statistical information in the file is saved. Suppose key value A of an index file corresponds to data at position B of the data file. The traditional index file stores the corresponding relation (A- > B) between A and B. The method of this embodiment may add a statistics file C. The aggregate index file stores (A- > (B, C)). Wherein a change in the data file B causes a modification of the index file a and simultaneously causes a modification of the statistics file C. Except for the above trigger condition, the statistical file C is not actively updated.
Fig. 3 is a schematic flowchart of updating the aggregation index in the processing method of the database query according to an embodiment of the present invention. The updating method comprises the following steps:
step S302, acquiring a request for updating data; when a data table associated with the aggregation index is subjected to DML (INSERT, UPDATE, DELETE) operation, updating of the aggregation index is triggered at the same time;
in step S304, data update is performed. I.e., to modify data in a database table.
Step S306, performing aggregation index update. And for each row of changed data, all indexes corresponding to the table are obtained from the system table and are modified in sequence.
For the aggregate index of the BTREE structure, firstly, the query of index key values is carried out according to the BTREE index mode, and then the operation of updating the index is carried out.
When the aggregate index is processed in step S306, the index key is first queried in a BTREE index manner.
And searching the node position of the index key value in the BTREE aiming at the inserted data. If the index key value already exists in the BTREE node and the aggregation index requires that the key value is not repeated, a space is distributed on a linked list of the corresponding data position on the data space corresponding to the index key value to store the position of the current record. And then calculating the statistical information corresponding to the key value. If the index key value does not exist, a new index key value is allocated. And saving the data corresponding to the index key value. Then, the statistical information of the index key value is recalculated.
For deleting data, firstly updating a data position linked list, and deleting the data row corresponding to the index key value. If the chain table is empty after deletion, deleting the key value of the index key value, and otherwise, recalculating the statistical information value after deletion.
When modifying the aggregated data, the specific process may include: reading a statistical formula predefined by the aggregation index; carrying out statistical calculation on the modified data by using the statistical formula to obtain statistical information; updating aggregated data of the aggregation index using the statistical information. First, all the statistical formulas defined on the aggregation index are read from the database system table. And then traversing the set of statistical formulas, and performing statistical calculation on each value. During the statistical calculation, whether the statistical data needs to be recalculated or not is judged according to different statistical formulas. For some specific statistical formulas, such as max (maximum), min (minimum), count, etc., only the data value of the current row and the stored original aggregated data need to be operated. For other statistical formulas, all data rows corresponding to the index key value can be traversed, all data corresponding to the index value are sequentially taken, and the aggregated data is recalculated.
In the database mechanism, when data is updated, the index information is also updated. When the index information is updated, the statistical information in the index information is also updated.
For example: for the aggregation index: CREATE INDEX idx1 ON t using AGGREGEGEGATE (select id, max (score) from t group by id). When data operation of update t set score =100where id =1and val < -10 is performed, the update operation of the aggregation index at this time is: finding all data lines satisfying id =1 by an id =1 index; judging that val is less than 10 for each row of records, and executing data updating if the records meet the conditions; record calculate max (score) per line; the final max (score) result is updated into the aggregated data of the aggregation index.
When the statistical query is executed, the statistical query statement undergoes analysis and optimization, and a query result is finally returned after three stages of execution. Fig. 4 is a schematic diagram of a processing method of a database query according to an embodiment of the present invention, the processing method of the database query including:
step S402, obtaining a query statement, and analyzing the query prediction to obtain a query tree of the query statement;
step S404, traversing the tables related to the query tree to obtain the aggregation indexes generated in advance on the related tables;
step S406, replacing the matching subtree matched with the query subtree on the query tree by using the query subtree of the aggregated index, thereby obtaining an optimized query tree;
step S408, generating an execution plan of the query statement according to the optimized query tree.
The matching in step S406 may specifically include: compiling all the aggregation indexes obtained by traversal into aggregation index subtrees respectively; querying a matching subtree matched with the aggregation index subtree on a query tree; the matching subtree is replaced with the aggregate index subtree. The process of querying the query tree for a matching subtree that matches the aggregated indexed subtree may include: and traversing the query tree again, and searching the subtree which is the same as the query subtree and the subtree of which the sub-query contains the query condition of the aggregation index subtree as a matching subtree.
If after traversing the tables involved in the query tree, if all of the involved tables do not have an aggregated index, an execution plan is formulated by the database optimizer directly from the query tree. That is, if the aggregate index cannot provide the data needed for the statistical query, the optimizer processes the statistical query according to existing processes.
The process of generating the execution plan of the query statement at step S408 may include: performing cost evaluation on the optimized query tree by an optimizer of the database; and selecting an execution plan from the optimized query tree based on the result of the cost evaluation. After the execution plan is generated, the execution plan can be executed by an executor of the database, and in the process of processing the aggregation index subtree in the execution process, the index key value meeting the conditions and the aggregation data thereof are inquired from the corresponding aggregation index. Then judging whether the aggregated data is directly the required result data; if the aggregated data is directly the desired result data, this directly provides the aggregated data; if the aggregated data is not the desired result data, a secondary calculation may also be performed on the aggregated data to obtain the result data.
Fig. 5 is a schematic diagram of generating an execution plan in the processing method of the database query according to an embodiment of the present invention, where the step of generating the execution plan includes:
step S502, obtaining and analyzing statistical query;
step S504, judge whether there is an aggregation index on the column of the statistical inquiry;
step S506, the database optimizer directly makes an execution plan
Step S510, generating an execution plan by utilizing the aggregation index;
step S512, searching on the aggregation index;
step S514, a secondary calculation is performed on the search result.
When the statistical query is executed by using the aggregation index, the SQL statement is subjected to three stages of analysis, optimization and execution, and finally a query result is returned. Step S502 is an analysis process, and the optimizer completes step S504, step S610, and the executor completes step S512 and step S514.
In the optimizer stage, it is determined that if some of the queries of the statistical queries satisfy the query conditions of the aggregate index, an execution plan of the aggregate index is generated for the sub-queries.
The prior art plans to execute the query statement as follows:
NestLoop Join(a.id=b.id)
SeqScan on a
HashAggregate(b.id)
SeqScan on b
that is, in the prior art implementation plan, one scan and aggregate statistical calculation is performed for each record in table a and for table b. This lookup operation can be very time consuming when the data of table b is very large.
When the aggregation index implemented in this embodiment is applied, a create statement is used: CREATE INDEX idx1 ON b using AGGREGEGEGEGATE (select id, sum (score) from b group by id) CREATEs an aggregation INDEX.
When the optimizer generates an execution plan by using the aggregation index, it is determined that the aggregation index exists in the column id in the table b, and further, an operation of querying all aggregation queries exists in aggregation data formed in advance by the aggregation index, so that a query plan based on the aggregation index is generated for the table b. The execution plan may be:
NestLoop Join(a.id=b.id)
SeqScan on a
AggIndexScan on b(b.id)
i.e. one index scan is performed for each record of table a and for table b. And because the result of the statistical query is stored in the aggregated data of the aggregated index, the result of the aggregated data can be directly returned.
It can be seen from the above description that, by applying the method of this embodiment, the processing efficiency of the aggregated query can be greatly improved.
And when processing the statistic query request statement, the database optimizer judges the column where the statistic query relates to the aggregation index, and if the aggregation index meets the condition of aggregation operation, generates an execution plan for querying by using the aggregation index.
When the actuator executes the aggregation index, the aggregation index is inquired to obtain statistical information, and then secondary operation is performed by using the statistical information to return a final result. And the aggregation index is used for inquiring, the data table does not need to be scanned and complex aggregation operation is carried out, and only the statistical information stored in the index is simply subjected to secondary operation and then returned. The execution time of SQL and the occupation of hardware resources are greatly reduced.
The function of the optimizer is divided into rule-based optimization and cost-based optimization, and the optimization can be attributed to the rule-based optimization because the use of the aggregation index certainly brings performance improvement. The rule is applicable to all databases, and the process of technical implementation is described below by taking the KES database as an example:
the input of the database optimizer is a Query tree Query1 (an internal tree data structure) converted by an SQL string, and the output is the finally selected execution plan. The main task of logic optimization is to transform the query tree according to rules. When the aggregation index transformation rule is applied, all tables related in the query tree corresponding to the sub-query are traversed first, and whether an aggregation index exists in each data table is judged. All found aggregation indices are returned as one set.
And traversing the aggregation index obtained in the last step. For each aggregate index in the set, its definition is compiled into a query tree.
For example, for the indices defined below:
CREATE INDEX idx1 ON b using AGGREGATE(select id,sum(score)from b group by id)
and newly analyzing SQL by using a select id, sum (score) from b group by id to form a new Query tree Query2, traversing the tree structure Query1 to find whether a subtree is identical to the Query2, and if so, changing the subtree into a node AggIndexScan of an aggregation index.
When the child node of Query1 is matched with Query2, a fuzzy matching algorithm is needed, the child node and Query2 are not required to be completely identical, and the condition of the child Query includes Query 2. For example, a SQL is (select id, sum (score) from b group id where = 2) and aggregation index (select id, sum (score) from b group id) definitions may also be matched. Another example is that SQL is (select id, sum (score) from b group by id where = 2) and aggregation index (select id, name, sum (score) from b group by id, name) definitions can also be matched. For these cases, the executor needs to first read the statistics in the aggregation index and then perform a simple quadratic operation on the statistics.
After the rule-based optimization is completed, the optimizer runs cost-based optimization to select a final execution plan. At this time, for the AggIndexScan node, the optimizer directly generates an execution plan of AggIndexScan.
And the executor stage, namely reading the index file into a shared memory when executing the AggIndexScan execution plan, searching the B-tree index according to the query condition, finding all index nodes with the full set of query conditions, and then sequentially returning the index values of the nodes and all corresponding statistical information. For example, each id meeting the filtering condition in the table b and its corresponding sum (score) value are retrieved and returned.
In the SQL query, although the partial query statement does not exactly match the definition in the aggregate index, it can be calculated twice in the information stored in the aggregate index, and still use the set index.
Still take the above aggregation index as an example:
example 1: the following queries: select id, sum (score) × 10as n from b group by id having n-s 10, simple secondary calculation can be performed on the query return value of the aggregation index and then the result is returned quickly.
Example 2: the following queries: and a select distinct id from b can directly return an id value by using an aggregation index, so that the time of the distinct deduplication operation is saved.
In the above process, when the database optimizer processes the aggregation query request, it determines that the aggregation operation is performed on the column on which the aggregation index is located, and if the aggregation index meets the condition of aggregation operation, an execution plan for querying using the aggregation index is generated. When the actuator executes the aggregation index, the index is inquired to obtain statistical information, and then secondary operation is performed by using the statistical information to return a final result.
And the aggregation index is used for inquiring, so that the data table is not required to be scanned and complicated aggregation operation is carried out, and the statistical information stored in the index is simply subjected to secondary operation and then returned. The execution time of SQL and the occupation of hardware resources are greatly reduced.
FIG. 6 is an example of data for executing a query using an aggregated index in a method for processing a database query according to one embodiment of the invention. In the data processing process diagram on the right side of fig. 6, the data accessed when the aggregation index is used is shown in a dotted line box 60, and the data accessed when the aggregation index is not used is shown in a dotted line box 61.
The creation statement of the aggregation index in fig. 6 is:
CREATE index test_idx on students(id)as(select id,sum(score)from students group by id);
after processing the query statement: when the aggregation index is not used, all data in the data table needs to be scanned, and operations such as grouping and aggregation operation need to be performed on each piece of data. And when all data scanning is finished, the result can be returned to the client. On a table of tens of millions of levels of data, this operation may take several minutes.
After the aggregation index is used, the executor may locate the first record "32 (1)" to be returned and return the result to the client, and then read the next record "33 (1)" to return the result in turn, where the execution time of this operation is several milliseconds.
For some SQL query statements, the execution time is longer and longer as the amount of data in the database increases or the distribution of the data is different. The original data table is not scanned any more by using the aggregated index query, so that the increase of the data amount is not greatly influenced by the execution time.
Using a query statement on the data instance of fig. 6: select count (partition id) from clients, a Select count partition operation is an example of a query whose execution time is affected by data distribution.
When the aggregation index is not used, each record needs to be read to remove the duplicate value, and it takes a long time to perform the duplicate removal by using a sorting method or a hash algorithm, and the execution time is higher and higher under the condition that the data repetition degree on the data column is low.
In contrast, after the aggregation index is adopted, because of the characteristic that the index value is not repeated, the result can be obtained only by traversing the leaf nodes of the index and then adding the number of all the leaf nodes.
FIG. 7 is a comparison of execution times before and after application of a method of processing a database query, in accordance with one embodiment of the present invention. In fig. 7, a curve L1 is an execution time curve in the case where the aggregation index is not used, and a curve L2 is an execution time curve in the case where the aggregation index is used.
Through the above description, it can be seen that the query execution time of the query statement can be greatly shortened, the occupation of hardware resources is reduced, and the execution efficiency is improved by using the aggregation index of the embodiment.
The embodiment also provides a machine-readable storage medium and a computer device. Fig. 8 is a schematic diagram of a machine-readable storage medium 70 according to one embodiment of the invention, and fig. 9 is a schematic diagram of a computer device 80 according to one embodiment of the invention.
The machine-readable storage medium 70 has stored thereon a machine-executable program 71, and when the machine-executable program 71 is executed by the processor 810, the method for processing the database query according to any one of the embodiments described above is implemented.
Computer device 80 may include memory 820, processor 810, and machine-executable program 71 stored on memory 820 and running on processor 810, and when processor 810 executes machine-executable program 71, the method of processing a database query of any of the embodiments described above is implemented.
It should be noted that the logic and/or steps shown in the flowcharts or otherwise described herein, such as an ordered listing of executable instructions that can be considered to implement logical functions, can be embodied in any machine-readable storage medium for use by or in connection with an instruction execution system, apparatus, or device, such as a computer-based system, processor-containing system, or other system that can fetch the instructions from the instruction execution system, apparatus, or device and execute the instructions.
For the purposes of this description, a machine-readable storage medium 70 can be any means that can contain, store, communicate, propagate, or transport the program for use by or in connection with the instruction execution system, apparatus, or device. More specific examples (a non-exhaustive list) of the machine-readable storage medium would include the following: an electrical connection (electronic device) having one or more wires, a portable computer diskette (magnetic device), a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber device, and a portable compact disc read-only memory (CDROM). Additionally, the machine-readable storage medium 70 may even be paper or another suitable medium upon which the program is printed, as the program can be electronically captured, via, for instance, optical scanning of the paper or other medium, then compiled, interpreted or otherwise processed in a suitable manner if necessary, and then stored in a computer memory.
It should be understood that portions of the present invention may be implemented in hardware, software, firmware, or a combination thereof. In the above embodiments, the various steps or methods may be implemented in software or firmware stored in memory and executed by a suitable instruction execution system.
The computer device 80 may be, for example, a server, a desktop computer, a notebook computer, a tablet computer, or a smartphone. In some examples, computer device 80 may be a cloud computing node. Computer device 80 may be described in the general context of computer system-executable instructions, such as program modules, being executed by a computer system. Generally, program modules may include routines, programs, objects, components, logic, data structures, etc. that perform particular tasks or implement particular abstract data types. Computer device 80 may be practiced in distributed cloud computing environments where tasks are performed by remote processing devices that are linked through a communications network. In a distributed cloud computing environment, program modules may be located in both local and remote computer system storage media including memory storage devices.
The computer device 80 may include a processor 810 adapted to execute stored instructions, a memory 820 that provides temporary storage for the operation of the instructions during operation. Processor 810 may be a single core processor, a multi-core processor, a computing cluster, or any number of other configurations. Memory 820 may include Random Access Memory (RAM), read only memory, flash memory, or any other suitable storage system.
The processor 810 may be connected via a system interconnect (e.g., PCI-Express, etc.) to an I/O interface (input/output interface) suitable for connecting the computer device 80 to one or more I/O devices (input/output devices). The I/O devices may include, for example, a keyboard and a pointing device, wherein the pointing device may include a touchpad or a touchscreen, among others. The I/O devices may be built-in components of the computing device 80 or may be devices that are externally connected to the computing device.
Processor 810 may also be linked through a system interconnect to a display interface suitable for connecting computer device 80 to a display device. The display device may include a display screen as a built-in component of the computer device 80. The display device may also include a computer monitor, television, or projector, etc. externally connected to the computer device 80. In addition, a Network Interface Controller (NIC) may be adapted to connect computer device 80 to a network via a system interconnect. In some embodiments, the NIC may use any suitable interface or protocol (such as an internet small computer system interface, etc.) to transfer data. The network may be a cellular network, a radio network, a Wide Area Network (WAN)), a Local Area Network (LAN), the internet, or the like. The remote device may be connected to the computing device through a network.
The flowcharts provided by this embodiment are not intended to indicate that the operations of the method are to be performed in any particular order, or that all the operations of the method are included in each case. Further, the method may include additional operations. Additional variations on the above-described method are possible within the scope of the technical idea provided by the method of the present embodiment.
Thus, it should be appreciated by those skilled in the art that while a number of exemplary embodiments of the invention have been illustrated and described in detail herein, many other variations or modifications consistent with the principles of the invention may be directly determined or derived from the disclosure of the present invention without departing from the spirit and scope of the invention. Accordingly, the scope of the invention should be understood and interpreted to cover all such other variations or modifications.

Claims (10)

1. A method of processing a database query, comprising:
obtaining a query statement, and analyzing the query prediction to obtain a query tree of the query statement;
traversing the tables related to the query tree to obtain an aggregation index generated in advance on the related tables;
replacing a matching sub-tree on the query tree, which is matched with the query sub-tree, by using the query sub-tree of the aggregated index, so as to obtain an optimized query tree;
and generating an execution plan of the query statement according to the optimized query tree.
2. The method of processing a database query of claim 1, wherein the step of replacing a matching sub-tree on the query tree that matches the query sub-tree with the query sub-tree of the aggregated index comprises:
compiling all the aggregation indexes obtained by traversing into aggregation index subtrees respectively;
querying a matching subtree matched with the aggregation index subtree on the query tree;
replacing the matched sub-tree with the aggregate index sub-tree.
3. The method for processing the database query according to claim 2, wherein the step of querying the query tree for the matching subtree matching the aggregated indexing subtree comprises:
and traversing the query tree again, and searching the subtree which is the same as the query subtree and the subtree of which the sub-query contains the query condition of the aggregation index subtree as the matching subtree.
4. The method of processing a database query of claim 1, wherein generating an execution plan for the query statement from the optimized query tree comprises:
performing, by an optimizer of the database, a cost evaluation on the optimized query tree;
and selecting the execution plan from the optimized query tree based on the result of the cost evaluation.
5. The method of processing a database query according to claim 1, wherein the step of generating an execution plan of the query statement according to the optimized query tree further comprises, after the step of:
and executing the execution plan by an executor of the database, and inquiring the index key value and the aggregation data meeting the conditions from the corresponding aggregation index in the process of processing the aggregation index subtree in the execution process.
6. The method for processing the database query according to claim 5, wherein the step of querying the corresponding aggregated index to obtain the eligible index key values and the aggregated data thereof further comprises:
judging whether the aggregated data is directly the required result data;
if not, performing secondary calculation on the aggregated data to obtain the result data.
7. The method for processing a database query according to claim 1,
after traversing the tables involved in the query tree, if all of the involved tables do not have an aggregate index, an execution plan is formulated by an optimizer of the database directly from the query tree.
8. The method for processing the database query according to claim 1, further comprising:
acquiring data operation for triggering the update of the aggregation index of the database, and modifying the target data table of the database according to the data operation;
reading a statistical formula predefined by the aggregation index;
carrying out statistical calculation on the modified data by using the statistical formula to obtain statistical information;
updating aggregated data of the aggregation index using the statistical information.
9. A machine readable storage medium having stored thereon a machine executable program which when executed by a processor implements a method of processing a database query according to any of claims 1 to 8.
10. A computer device comprising a memory, a processor and a machine executable program stored on the memory and run on the processor, and the processor when executing the machine executable program implements a method of processing a database query according to any one of claims 1 to 8.
CN202211037293.1A 2022-08-26 2022-08-26 Database query processing method, storage medium and computer equipment Pending CN115391424A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202211037293.1A CN115391424A (en) 2022-08-26 2022-08-26 Database query processing method, storage medium and computer equipment

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202211037293.1A CN115391424A (en) 2022-08-26 2022-08-26 Database query processing method, storage medium and computer equipment

Publications (1)

Publication Number Publication Date
CN115391424A true CN115391424A (en) 2022-11-25

Family

ID=84122775

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202211037293.1A Pending CN115391424A (en) 2022-08-26 2022-08-26 Database query processing method, storage medium and computer equipment

Country Status (1)

Country Link
CN (1) CN115391424A (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117390064A (en) * 2023-12-12 2024-01-12 天津南大通用数据技术股份有限公司 Database query optimization method based on embeddable subgraph

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117390064A (en) * 2023-12-12 2024-01-12 天津南大通用数据技术股份有限公司 Database query optimization method based on embeddable subgraph
CN117390064B (en) * 2023-12-12 2024-03-19 天津南大通用数据技术股份有限公司 Database query optimization method based on embeddable subgraph

Similar Documents

Publication Publication Date Title
US10606834B2 (en) Methods and apparatus of shared expression evaluation across RDBMS and storage layer
US8332389B2 (en) Join order for a database query
US6850925B2 (en) Query optimization by sub-plan memoization
US9390115B2 (en) Tables with unlimited number of sparse columns and techniques for an efficient implementation
US9053210B2 (en) Graph query processing using plurality of engines
US6374232B1 (en) Method and mechanism for retrieving values from a database
US8396852B2 (en) Evaluating execution plan changes after a wakeup threshold time
US9141666B2 (en) Incremental maintenance of range-partitioned statistics for query optimization
CA2388515C (en) System for managing rdbm fragmentations
EP2605158A1 (en) Mixed join of row and column database tables in native orientation
US8924373B2 (en) Query plans with parameter markers in place of object identifiers
US20060200438A1 (en) System and method for retrieving data from a relational database management system
US20120078904A1 (en) Approximate Index in Relational Databases
CN104765731A (en) Database query optimization method and equipment
CA2427228A1 (en) Information retrieval systems for optimization of queries having maximum or minimum function aggregation predicates
CN113688127A (en) Data compression technique
Srivastava et al. TBSAM: An access method for efficient processing of statistical queries
CN115391424A (en) Database query processing method, storage medium and computer equipment
CN115374121A (en) Database index generation method, machine-readable storage medium and computer equipment
CN115391346A (en) Database aggregation index generation method, storage medium and computer equipment
US8832157B1 (en) System, method, and computer-readable medium that facilitates efficient processing of distinct counts on several columns in a parallel processing system
Rupley Jr Introduction to query processing and optimization
CN112818010B (en) Database query method and device
US20170031909A1 (en) Locality-sensitive hashing for algebraic expressions
CN115391363A (en) Database index updating method, storage medium and computer equipment

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination