CN106446134B

CN106446134B - Local multi-query optimization method based on predicate specification and cost estimation

Info

Publication number: CN106446134B
Application number: CN201610833428.3A
Authority: CN
Inventors: 陈岭; 杨谊
Original assignee: Zhejiang University ZJU
Current assignee: Zhejiang University ZJU
Priority date: 2016-09-20
Filing date: 2016-09-20
Publication date: 2019-07-09
Anticipated expiration: 2036-09-20
Also published as: CN106446134A

Abstract

The present invention discloses a kind of local multi-query optimization method based on predicate specification and cost estimation, belong to big data query optimization field, the method of the present invention are as follows: the inquiry in query set is optimized respectively first with data query system existing optimizer, and it is indicated in the form of query tree, the query tree set after being optimized；Of equal value or specification is carried out to subtask same or similar between inquiry and is handled by successive ignition then in conjunction with local multi-query optimization method, generates global more query plan trees；The specification relationship between the more inquiry plans of the overall situation generated and subtask is finally combined, estimates that intermediate result reuses expense according to Cost Model, judges that direct subtasking still reuses intermediate result, global more inquiry plans are optimized.The present invention fully considers that intermediate result utilizes the balance between inquiry concurrently, reduces repetitive operation, effectively promotion query performance.

Description

Local multi-query optimization method based on predicate specification and cost estimation

Technical field

The present invention relates to big data query optimization field more particularly to a kind of parts based on predicate specification and cost estimation Multi-query optimization method.

Background technique

The research of the problems such as early stage query optimization and scheduling is mainly for single inquiry, but with the promotion of concurrency, and Data query system is continuously improved, and inquiry concurrent processing has become the essential function of modern data inquiry system.When It concurrently inquires when being made of the inquiry of related (being related to same or similar operation), traditional single enquiring and optimizing method is it is not intended that look into Correlation between inquiry, to limit the promotion of system queries performance.

Multi-query optimization, by analyzing this batch inquiry, will be wherein related to when system while when inputting multiple queries And the part of same or similar operation merges, and generates global more inquiry plans.By the execution of more inquiry plans come same When complete multiple queries, improve search efficiency.

Multi-query optimization method can be divided into two classes: one kind is global multi-query optimization method, and input is not optimized Query set, the advantages of such method is that the candidate executive plan quantity that generates is big, and the result of output is often more excellent, the disadvantage is that Optimizing Search expense is high, simultaneously because the optimizer of data query system can not be utilized, thus realizes that difficulty is higher；It is another kind of to be Local multi-query optimization method, input are the query set of data query system optimizer output, and the advantages of such method is Search space is smaller, while being easier to realize.Since local multi-query optimization method does not often consider to reuse intermediate result bring Expense, thus may cause the expense for reusing intermediate result higher than the expense directly executed in actually executing, it reduces instead System queries performance.

Existing multi-query optimization method only only accounts for I/O expense when using cost function estimation inquiry plan expense, That is disc page number involved in query processing, and ignore CPU computing cost and network transmission expense.However, when bottom is adopted With distributed computing architecture (such as most big data inquiry system) and there are when attended operation, CPU is calculated and network passes Defeated expense be can not ignore.Original Cost Model obviously can not accurately estimate the expense of inquiry at this time.

Summary of the invention

For the deficiency of prior art described above, the present invention provides the parts based on predicate specification and cost estimation Multi-query optimization method can be used to intermediate result same or similar between inquiry, repetitive operation be reduced, to mention Concurrent query performance is risen, query responding time is reduced.

Local multi-query optimization method based on predicate specification and cost estimation is divided into pretreatment, at local multi-query optimization Reason and more inquiry plans optimize three phases, and specific implementation step is as follows:

(1) pretreatment stage:

Step (1-1) carries out every inquiry in query set using the existing query optimizer of data query system Optimization, is found optimal inquiry plan, and indicated in the form of query plan tree respectively, finally obtains query plan tree set；

Step (1-2), redefines the node serial number in query plan tree；

Step (1-3), definition node mapping relations map<Node key, Node value>M, and M is initialized as sky；

Node in all query plan trees is added in global more query plan trees by step (1-4), and in the overall situation " super root node " is added in more query plan trees；

In step (1-2), the node serial number in rewritten query plan tree is for generating in global more query plan trees The serial number of node.

In step (1-3), mapping relations map<Node key, Node value>M show that key node will reuse That value node obtains as a result, map is data type, M is name variable, belongs to map data type, is used for memory node key To the mapping relations of node value.

In step (1-4), " super root node " is directed toward the root node of each inquiry in query plan tree set.

The present invention is every using the existing query optimizer of data query system and is looked into using local multi-query optimization method It askes sentence and finds the inquiry plan after optimization, reduce algorithm search space, shorten Query Optimal and execute the time.

(2) local multi-query optimization processing stage:

Step (2-1), the node set V={ v in the global more query plan trees of traversal queries₁,v₂,...,v_n, if set There is the node with present node equivalence in V, selects to number the smallest node v in equivalence query node in node set V_j ^*, into Row union operation, uses v_j ^*Instead of all v_i∈V-{v_j ^*}；

Step (2-2), for each node v_i, find the subtask task for meeting " most strong reduction condition "_j ^*, add in M Add v_j ^*To v_i, mapping, v is connected by directed line segment_j ^*→v_i, and change v_j ^*Operation description, with node v_iCorresponding son Task task_iResult as subtask task_j ^*Input；

Step (2-3) repeats step (2-1) and step (2-2), carries out to the node in global more query plan trees of equal value Replacement and specification, until can not be further simplified global more query plan trees；

In step (2-1), the process for judging whether two nodes belong to equivalence relation is specifically included that

(a) consistent with the type of two nodes and execute and judged on identical table and column for condition, it is such as discontented Sufficient condition then returns to false, shows that two nodes do not meet equivalence relation；

(b) equivalence relation judgement is carried out according to node type respectively: if node is disk scanning node, needing basis Predicate in node judges whether two nodes are that the data selection of same range in same column if eligible, is returned True is returned, shows equivalence relation；Otherwise false is returned, shows not meeting equivalence relation；

If (c) node is connecting node, firstly the need of by judging whether condition of contact is identical, if they are the same, then pass Return and judges whether its left and right child is of equal value, since attended operation is there may be two kinds of tree structures, i.e. left and right child nodes exchange, Therefore it needs to judge respectively.If one of which is tree-like to meet equivalence relation, returns to true and show to meet equivalence relation, it is on the contrary Return to false；

If (d) node is other nodes, including converging operation node and data transmission nodal, then recurrence judges its child Whether node is of equal value.

In step (2-2), v_j ^*→v_iIndicate the flow direction of data.

In step (2-2), for each subtask task_iIf task_iImplementing result contain other subtasks Result, then it is assumed that other subtasks can be by specification to task_i.It can be by specification to task_iAll subtasks in, as a result With task_iMost similar subtask is referred to as the subtask task for meeting " most strong reduction condition "_j ^*。

In step (2-2), the result is the intermediate data that task execution obtains after the completion.

In step (2-3), judge the node in global more query plan trees whether belong to specification relationship process it is main Include:

First, it is determined that whether the type of two nodes is disk scanning node, false is returned if being unsatisfactory for condition, Show that two nodes do not meet specification relationship；

Secondly, judging whether two nodes execute on identical table and column, if being unsatisfactory for condition, false, table are returned to Bright two nodes do not meet equivalence relation；

Finally, judging the specification relationship of two nodes according to specification relation table, true is returned to if meeting specification relationship, instead Return false.Specification relation table is as shown in table 1.

1 reduction relation of table relies on table

(1. a1=> b1and a2=> b2) or (a1=> b2and a2=> b1) or (a=> b1) or (a=> b2)

M, n in upper table indicate integer or floating number, when in predicate a, b m and n meet the size relation in relation table When, that is, show that a can be with specification to b.When b is made of two predicates of b1 and b2 with and relationship, also need to judge a and b1, b2 Specification relationship.Such as b is made of b1:col1<8 and b2:col1>3, col1>2 a, b2 can be with specification to a at this time, therefore b It can be with specification to a.

In step (2-3), since the process of step (2-1) and step (2-2) can change global more query plan trees Structure, the executive overhead for corresponding to task so as to cause global more inquiry plan tree nodes changes, therefore repeats step (2-1) With step (2-2), equivalencing and specification are carried out to the node in global more query plan trees, it is complete until that can not be further simplified Until the more query plan trees of office.

(3) more inquiry plan optimizing phases:

Step (3-1) obtains mapping relations and global more query plan trees that local multi-query optimization processing stage obtains；

Step (3-2), according to mapping relations map<Node key, Node value>M, the global more query plan trees of traversal In node, if traversal complete, then follow the steps (3-8)；

Step (3-3) estimates direct expense and reuses expense according to the corresponding Cost Model of different task；

Step (3-4), comparison reuse expense and direct expense, if reusing expense is greater than direct expense, then follow the steps (3- 5)；If reusing expense is less than direct expense, (3-6) is thened follow the steps；

The corresponding relationship map of mapping relations interior joint is oneself, does not utilize intermediate result directly to execute by step (3-5) Then the corresponding task of node executes step (3-7)；

Step (3-6) repeats step (3-3) and step (3-4), judges whether there is and reuses the lower node of expense, if In the presence of mapping relations are then updated, which is mapped to and reuses the lower node of expense；

Step (3-7) repeats step (3-2)；

Step (3-8) returns to global more query plan trees.

In step (3-3), if some node is mapped to another node in mapping relations, show the node corresponding Implementing result of the implementing result of being engaged in dependent on another node (relying on node) corresponding task.

In step (3-3), it is made of due to inquiring multiple subtasks, according to the corresponding Cost Model of different task The network transmission expense estimated direct expense, reuse expense and intermediate result.

In step (3-3), the direct expense is that the node in global more query plan trees directly executes corresponding appoint The CPU computing cost of business；The reuse expense corresponds to the CPU computing cost of task and to being relied on node to rely on node The result that corresponding task generates carries out the expense of network transmission and calculating.

In step (3-3), query plan tree is made of multiple plan nodes, and one in the corresponding inquiry of each node appoints Business, mainly for disk scanning task, attended operation task and network transmission task are modeled, according to different nodes, estimation The expense of different task；Cost Model and cost estimation method are as follows:

(a) disk scanning node

When there is disk scanning node, expression reads data to memory from disk, and meets item according to predicate screening The process of the data of part.The average seek time t that reading data cost is moved by magnetic arm_seek, magnetic head average rotation delay time t_latencyAnd data read time t_readThe cost of composition, predicate screening depends on specific predicate and data distribution；

When inquiry plan tree node is disk scanning node, disk scanning task is estimated using disk scanning Cost Model Cost.

(b) attended operation node

The time of attended operation by calculating tuple cryptographic Hash time t_hashTuple, will construct in memory for right table data The time t of Hash table_build, connection tuple will be participated in be inserted into the time t of Hash table_insertTupleAnd left and right list cell group is completed The time t of attended operation_joinTupleComposition, each execution time is mainly by machine cpu performance, the size decision of left-handed watch and right table；

When inquiry plan tree node is attended operation node, attended operation task is estimated using attended operation Cost Model Cost.

(c) data transmission nodal

The data that data transmission nodal is mainly responsible for receiving and aggregate transmission obtains.Data transfer task can be in different hosts Upper parallel execution, therefore the completion moment t of the node respective operations_exchangeDepending on finally complete data transfer task when It carves, the time overhead of transmission is mainly by the network bandwidth Net in byte data amount TransferByte and current cluster_bandCertainly It is fixed；

When inquiry plan tree node is data transmission nodal, data transmission cost model estimated data's transformation task is used Cost.

The present invention constructs Cost Model during query processing, and it is fixed to give respective queries processing cost to different operation Justice improves the accuracy of Query Cost estimation, selects efficient more inquiry plans convenient for algorithm.

In step (3-6), reuse expense and be less than and directly executes the expense that the node correspond to task, show to can be used according to Rely the intermediate result of task, at this time repeatedly step (3-3) and step (3-4), judge whether there is the reuse lower node of expense, Mapping relations are then updated if it exists, which is mapped to and reuses the lower node of expense.

The present invention utilizes the inquiry that the existing query optimizer of data query system is after every query statement finds optimization Plan carries out equivalencing or specification to part same or similar between inquiry plan by successive ignition, generates global more Inquiry plan, and expense is reused by estimation, judge that direct execution task still reuses intermediate result, to global more inquiry plans It optimizes, reduces query responding time.The present invention relatively traditional multi-query optimization method the advantages of include:

It (1) is every inquiry using the existing query optimizer of data query system using local multi-query optimization method Sentence finds the inquiry plan after optimization, reduces algorithm search space, shortens Query Optimal and executes the time；

(2) Cost Model is constructed during query processing, respective queries processing cost definition is given to different operation, is mentioned The high accuracy of Query Cost estimation, selects efficient more inquiry plans convenient for algorithm；

(3) the more inquiry plans of the overall situation of generation are optimized, fully considers that intermediate result reuses the expense generated, avoids The follow-up work waiting time is too long, ensure that system concurrency, improves query execution efficiency.

Detailed description of the invention

Fig. 1: the local multi-query optimization method flow diagram based on predicate specification and cost estimation；

Fig. 2: query plan tree schematic diagram.

Specific embodiment

In order to more specifically describe the present invention, with reference to the accompanying drawing and specific embodiment is to technical solution of the present invention It is described in detail.

As shown in Figure 1, the local multi-query optimization method based on predicate specification and cost estimation be divided into pretreatment, part it is more Query optimization processing and more inquiry plans optimize three phases.

(1) key step of pretreatment stage includes:

Step (1-1) looks into every in the query set of input using the existing query optimizer of data query system Inquiry optimizes, and finds optimal inquiry plan respectively, and indicate in the form of query plan tree, finally obtains query plan tree Set；

Query plan tree is expressed as T (V, E, D), and V is the set of all nodes in query plan tree, and each subtask is corresponding One inquiry plan tree node, node include some operation informations (such as operating involved table and column etc.), and E is query plan tree In all sides set, D is the description of query node concrete operations (including predicate etc. involved in operation).Inquiry plan leaf Child node is disk scanning node, is responsible for the reading of data, and non-leaf nodes represents different algebraic manipulations.Non-leaf nodes makes With the data from its child nodes, connected between node with a line, query plan tree is as shown in Figure 2；

Step (1-2) redefines the node serial number in inquiry, for generating the sequence of global more query plan tree interior joints Number；

Step (1-3), definition node mapping relations map<Node key, Node value>M, and M is initialized as sky； The mapping relations show that key node will reuse the result that value node obtains；Then it is more all nodes to be added to the overall situation In query plan tree, and " super root node " is added in global more query plan trees, which is directed toward in query plan tree set The root node of each inquiry.

(2) key step of local multi-query optimization processing stage includes:

Step (2-1), the set V={ v of traversal queries node₁,v₂,...,v_n, if existing in set V and present node Node of equal value selects to number the smallest point v in equivalence query node in set V_j ^*, operation is merged, that is, uses v_j ^*Instead of institute There is v_i∈V-{v_j ^*}；

The process for judging whether two nodes belong to equivalence relation specifically includes that

For each subtask task_iIf task_iResult contain the result of other subtasks, then it is assumed that other son Task can be by specification to task_i.It can be by specification to task_iAll subtasks in, as a result with task_iMost similar son is appointed Business is referred to as the subtask task for meeting " most strong reduction condition "_j ^*。

Judge whether node belongs to the process of specification relationship and specifically include that

Finally, judging the specification relationship of two nodes according to specification relation table, true is returned if meeting specification relationship, Otherwise return to false.Specification relation table is as shown in table 1.

Step (2-3), since the process of step (2-1) and step (2-2) can change the structure of global more query plan trees, The executive overhead for corresponding to task so as to cause global more inquiry plan tree nodes changes, and repeats step (2-1) and step (2- 2) equivalencing and specification, are carried out to the node in plan tree, until can not be further simplified global more query plan trees.

1 reduction relation of table relies on table

(1. a1=> b1and a2=> b2) or (a1=> b2and a2=> b1) or (a=> b1) or (a=> b2)

(3) more inquiry plan optimizing phases mainly comprise the steps that

Step (3-3) is made of due to inquiring multiple subtasks, and according to the corresponding Cost Model of different task, estimation is straight The network transmission expense for connecing expense, reusing expense and intermediate result；

Cost estimation method is as follows:

(a) disk scanning node

(b) attended operation node

(c) data transmission nodal

The data that data transmission nodal is mainly responsible for receiving and aggregate transmission obtains.Data transfer task can be in different hosts Upper parallel execution, therefore the completion moment t of the node respective operations_exchangeDepending on finally complete data transfer task when It carves, the time overhead of transmission is mainly by the network bandwidth Net in byte data amount TransferByte and current cluster_bandCertainly It is fixed.

Step (3-7) repeats step (3-2)；

Step (3-8) returns to global more query plan trees.

Technical solution of the present invention and beneficial effect is described in detail in above-described specific embodiment, Ying Li Solution is not intended to restrict the invention the foregoing is merely presently most preferred embodiment of the invention, all in principle model of the invention Interior done any modification, supplementary, and equivalent replacement etc. are enclosed, should all be included in the protection scope of the present invention.

Claims

1. a kind of local multi-query optimization method based on predicate specification and cost estimation, it is characterised in that: be divided into pretreatment, office The processing of portion's multi-query optimization and more inquiry plans optimize three phases, the specific steps are as follows:

(1) pretreatment stage:

Step (1-1) carries out every inquiry in query set excellent using the existing query optimizer of data query system Change, finds optimal inquiry plan respectively, and indicate in the form of query plan tree, obtain query plan tree set；

Step (1-2), redefines the node serial number in query plan tree；

Node in all query plan trees is added in global more query plan trees by step (1-4), and looks into the overall situation more It askes in plan tree and adds " super root node ", wherein " super root node " is directed toward the root node of each inquiry in query plan tree set；

(2) local multi-query optimization processing stage:

Step (2-1), the node set V={ v in the global more query plan trees of traversal queries₁,v₂,...,v_n, if in set V In the presence of the node with present node equivalence, select to number the smallest node v in equivalence query node in node set V_j ^*, closed And operate, use v_j ^*Instead of all v_i∈V-{v_j ^*}；

Step (2-2), for each node v_i, find the subtask task for meeting " most strong reduction condition "_j ^*, v is added in M_j ^* To v_iMapping, v is connected by directed line segment_j ^*→v_i, and change v_j ^*Operation description, with node v_iCorresponding subtask task_iResult as subtask task_j ^*Input, wherein for each subtask task_iIf task_iImplementing result Contain the result of other subtasks, then it is assumed that other subtasks can be by specification to task_i, can be by specification to task_iInstitute Have in subtask, as a result with task_iMost similar subtask is referred to as the subtask task for meeting " most strong reduction condition "_j ^*；

Step (2-3) repeats step (2-1) and step (2-2), carries out equivalencing to the node in global more query plan trees It is replaced with specification, until can not be further simplified global more query plan trees；

(3) more inquiry plan optimizing phases:

Step (3-2), according to mapping relations map<Node key, Node value>M, in the global more query plan trees of traversal Node thens follow the steps (3-8) if traversal is completed；

Step (3-3) estimates direct expense and reuses expense, wherein is described direct according to the corresponding Cost Model of different task Expense is the CPU computing cost that the node in global more query plan trees directly executes corresponding task, the reuse expense for according to The result for relying node to correspond to the CPU computing cost of task and correspond to task generation to relied on node carries out network transmission and meter The expense of calculation；

Step (3-4), comparison reuse expense and direct expense, if reusing expense is greater than direct expense, then follow the steps (3-5)； If reusing expense is less than direct expense, (3-6) is thened follow the steps；

The corresponding relationship map of mapping relations interior joint is oneself, does not utilize intermediate result directly to execute node by step (3-5) Then corresponding task executes step (3-7)；

Step (3-6) repeats step (3-3) and step (3-4), judges whether there is and reuses the lower node of expense, if it exists Mapping relations are then updated, which is mapped to and reuses the lower node of expense；

Step (3-7) repeats step (3-2)；

Step (3-8) returns to global more query plan trees.

2. the local multi-query optimization method according to claim 1 based on predicate specification and cost estimation, feature exist In: in step (2-1), the process for judging whether two nodes belong to equivalence relation is specifically included that

Step (a), it is consistent with the type of two nodes and execute and judged on identical table and column for condition, it is such as discontented Sufficient condition then returns to false；

Step (b) carries out equivalence relation judgement according to node type respectively: if node is disk scanning node, needing root According to the predicate in node, judge whether two nodes are that the data of same range in same column are selected, if eligible, True is returned, otherwise returns to false；

Step (c) judges whether condition of contact is identical if node is connecting node, and if they are the same, then recurrence judges its left side respectively Whether right child is of equal value, if one of which is tree-like to meet equivalence relation, returns to true, otherwise returns to false；

Step (d), if node is other nodes, including converging operation node and data transmission nodal, then recurrence judges its child Whether child node is of equal value.

3. the local multi-query optimization method according to claim 1 based on predicate specification and cost estimation, feature exist In: before carrying out the specification replacement in step (2-3), judge whether the node in global more query plan trees belongs to specification pass System, the process specifically include that

First, it is determined that whether the type of two nodes is disk scanning node, false is returned if being unsatisfactory for condition；

Secondly, judging whether two nodes execute on identical table and column, if being unsatisfactory for condition, false is returned to；

Finally, judging the specification relationship of two nodes according to specification relation table, true is returned if meeting specification relationship, otherwise is returned Return false.

4. the local multi-query optimization method according to claim 1 based on predicate specification and cost estimation, feature exist In: in step (3-3), cost estimation method includes disk scanning nodal method, attended operation nodal method and data transmission Nodal method.