CN111881160A

CN111881160A - Distributed query optimization method based on equivalent expansion method of relational algebra

Info

Publication number: CN111881160A
Application number: CN201910575857.9A
Authority: CN
Inventors: 秦小麟; 刘亮; 徐兴业
Original assignee: CHINA REALTIME DATABASE CO LTD; Nanjing University of Aeronautics and Astronautics; State Grid Corp of China SGCC; State Grid Jiangsu Electric Power Co Ltd; NARI Group Corp
Current assignee: CHINA REALTIME DATABASE CO LTD; Nanjing University of Aeronautics and Astronautics; State Grid Corp of China SGCC; State Grid Jiangsu Electric Power Co Ltd; NARI Group Corp
Priority date: 2019-06-28
Filing date: 2019-06-28
Publication date: 2020-11-03

Abstract

The invention discloses a distributed query optimization method based on an equivalent expansion method of relational algebra, which comprises the steps of sending a request, converting language, constructing a query tree, optimizing fragmentation, decomposing the query tree, judging whether connection operation exists or not, calculating benefit and judging whether traversal is completed or not, has scientific and reasonable structure and safe and convenient use, turns the traditional SQl language into a relational algebra expression by utilizing the consistency of SQL language and relational algebra, thereby indirectly improving the efficiency of the distributed query, decomposes the query tree in the process of constructing the query tree, better accords with the characteristic of data discrete distribution of a distributed system, reduces the data coupling among queries, correspondingly reduces the communication cost, optimizes the connection process by a MapReduce method, greatly reduces the communication cost among sites in the connection process, and accordingly, communication time is shortened, thereby improving query efficiency.

Description

Distributed query optimization method based on equivalent expansion method of relational algebra

Technical Field

The invention relates to the technical field of databases, in particular to a distributed query optimization method based on an equivalent expansion method of relational algebra.

Background

With the advent of the big data era, data storage and query of large data volume become research hotspots of scholars; distributed database systems that fall short of centralized databases, which are the product of a combination of database management techniques and network techniques, have also come into play; in real life, a company can be separated in geographic position, different subsidiaries are arranged in different areas, and the subsidiaries bear respective services; but different branches, their respective resources also need to be interacted and shared; the unity and autonomy are reflected; the distributed database system stores data in different nodes respectively according to the concept, and performs interaction and access in a network interconnection mode.

In a distributed database system, the query computation overhead is one of the main factors that restrict the performance of the system; wherein, the query computation overhead refers to the following steps: local expense expenses generated by a CPU, an I/O and the like and communication expenses of data interaction among all sites; the overhead of distributed query computation is different due to different data storage and query strategies adopted by different distributed systems; how to select a reasonable data distribution and query optimization method makes it extremely important to obtain the highest query efficiency with the minimum overhead cost.

The traditional distributed query optimization is based on SQL language, an optimization mode is considered from the perspective of global query initiated by a user, the method needs to determine an optimization scheme through multi-angle judgment due to the fact that the data distribution condition of each station related to the query needs to be considered, meanwhile, the SQL language is more biased to natural language, the process of decomposing into sub-queries is complicated, the condition that multi-table Join query is needed frequently occurs in the actual query process, the optimization algorithm based on direct Join easily causes the excessive data transmission amount, and therefore excessive communication cost is caused, and therefore a distributed query optimization method based on an equivalent expansion method of relational algebra is urgently needed to solve the problems.

Disclosure of Invention

The invention provides a distributed query optimization method based on an equivalent expansion method of relational algebra, which can effectively solve the problems that the traditional distributed query optimization is based on SQL language, the optimization mode is considered from the perspective of global query initiated by a user, the optimization scheme can be determined only by judging from multiple angles due to the fact that the data distribution condition of each station associated with query needs to be considered, meanwhile, the SQL language is more biased to natural language, the process of decomposing into sub-queries is complicated, the condition of requiring multi-table Join query frequently occurs in the actual query process, and the optimization algorithm based on direct Join easily causes the excessive transmission data volume and the excessive communication cost.

In order to achieve the purpose, the invention provides the following technical scheme: a distributed query optimization method based on an equivalent expansion method of relational algebra comprises the following steps:

s1, sending a request: the query user puts forward a query requirement to the distributed database system, and the request result is a tuple set meeting the query requirement;

s2, language conversion: the system converts a query request provided by a user into an equivalent relational algebra expression, and the relational algebra expression is used for expressing;

s3, constructing a query tree: optimizing the relational algebra expression according to the equivalence change rule, and converting the relational algebra expression into a query tree corresponding to the relational algebra expression;

s4, slicing optimization: optimizing each leaf node according to the fragmentation relation;

s5, query tree decomposition: decomposing the query tree into a sub query tree corresponding to the segment and a parent query tree corresponding to the sub query result according to the fragmentation condition;

s6, judging whether the connection operation exists: traversing all query trees, checking whether the query contains a connection operation, if so, performing step S7, otherwise, outputting a query result and continuing to step S6;

s7, benefit calculation: calculating the semi-connection benefit, deciding a connection scheme with the minimum cost in the corresponding query, and executing the scheme to output a query result;

s8, judging whether traversal is completed: if the traversal is completed, the process is terminated, and if the traversal is not completed, the process returns to step S6.

Preferably, in step S1, the query user makes a query request to the distributed database system, and the request result is a tuple set that satisfies the query request, where the tuple set is a multi-table Join operation performed by using MapReduce, and the specific method is as follows:

s11, reading the information in the table into a memory through a map function, wherein the attribute needing Join operation is used as a key value, calculating an intermediate key value pair through an algorithm, and storing the result locally;

s12, performing the same operation on the data information in other tables;

s13, tuples with equal key values from multiple tables are received through the reduce function, the tuples meeting the matching conditions are recombined to generate a new tuple relation, a relation table is constructed, and finally the calculation result after Join is written into a target database file system.

Preferably, in the step S2, the system converts the query request provided by the user into an equivalent relational algebra expression, and uses the relational algebra expression, wherein the SQL language is replaced by the relational algebra language equivalent.

Preferably, in step S3, the relational algebra expression is optimized according to the equivalence change rule, and is converted into a query tree corresponding to the relational algebra expression, where the query tree includes selection, projection, connection, and merge operations, where all the selection and projection operations move downward and the connection and merge operations move upward, and it is ensured that the selection and projection operations are completed before the connection and merge operations, so as to obtain the optimized relational algebra expression, and the query tree is constructed according to the relational algebra expression, where a root node of the query tree is a query result, and a leaf node is a certain relationship or a split.

Preferably, in step S4, for each leaf node, optimization is performed according to a fragmentation relationship, where if the fragmentation relationship is a horizontal fragmentation, the selection condition is moved down, a limited relationship between the selection condition and the fragmentation is compared, and a contradictory fragment is deleted; if the fragment relation is vertical fragment, comparing the attribute field related to the projection condition with the attribute field contained in the fragment, and removing irrelevant fragments.

Preferably, in step S5, the query tree is decomposed into a sub-query tree corresponding to the segment and an ancestor query tree corresponding to the sub-query result according to the fragmentation condition, and the specific steps of the decomposition are as follows:

s51: decomposing the current query book, and expressing the segmented relation through segment information;

s52: selecting and projecting a unary relation operation calculation at the leaf end of the query tree;

s53: merging the public relational algebraic expression trees;

s54: the sub-query tree with the query value empty is removed.

Preferably, in step S7, the half join benefit is calculated, a join scheme with the minimum cost in the corresponding query is decided, and the scheme is executed to output a query result, where the half join means that, when the type of a key related to a join attribute in one of two table joins is significantly smaller than that in the other table, all key values related to the join in the smaller table are extracted through preprocessing by the client or a round of MapReduce, and are stored in a node of the Map function, and when the Map stage reads the other large table, data tuples unrelated to the final join in the large table are removed according to the cached key values, so that the data tuples do not enter the Reduce stage, and the overall efficiency is not affected.

Preferably, the specific method of half-linking comprises:

firstly, locally reducing all relations, constructing a possible semi-connection reduction model, deducing a corresponding static characteristic table according to the existing reduction relation, calculating the income and the expense in the static characteristic table, determining a semi-connection field to obtain a static characteristic table of the final reduction relation, and obtaining an initial semi-connection program;

then, carrying out subsequent optimization on the initial semi-connection program, and if the selected query execution site is exactly the site where the last semi-connection program is reduced, cancelling the last semi-connection;

the cost of the semi-connection reduction program is obvious in the initial execution stage, if a semi-connection relation exists, the relation is reduced, then semi-connection operation is reduced, and otherwise, natural connection is directly performed.

Preferably, in the step S7, the semi-connection benefit is calculated, and the specific calculation steps are as follows;

a: assuming that the two relationships are A, B, the half-join selection factor for relationships A, B is noted as: f (A. varies.. beta. B). gtN (pi.)_a(A))/N(B)；

The number of tuples contained in the projection of the relation A on the common field a between the relations A and B is represented as N (pi)_a(A) N (B) represents the number of all tuples on the relation B;

b: assume two existing relationships A, B, whose half-join cost formula is:

cost(A∝B)＝N(π_a(A))×size(a)；

wherein, size (a) represents the byte length of the corresponding field a in the relation S;

c: defining the benefit of the half-join according to the steps, wherein a half-join benefit formula is defined as: benefit (a ∈ B) ═ 1-F (a ∈ B)) × N (pi ∈ B)_a(B))×size(a)；

And if the result of the benefit formula calculation is higher than the result of the cost calculation, selecting semi-connection to obtain benefit, otherwise, obtaining benefit by adopting a natural connection mode.

Preferably, the query processing and optimization of the global query needs to correlate all the site data related to the query, and the correlation query step comprises the steps of firstly materializing redundant data by selecting a copy, selecting a copy mode corresponding to each relation or segment for further research, then applying heuristic rules to execute unitary relation algebraic operation as early as possible to re-determine the sequence of relation operation, thereby optimizing the algorithm execution efficiency, and finally further optimizing the connection operation and determining the connection operation execution mode;

the method for selecting the optimal copy selection scheme comprises the following steps: remote access processes are reduced; remote transmission of large data volume data is reduced, and further communication overhead is reduced; the energy load and the storage load of the nodes are reduced, and the basic cost overhead is further reduced.

Compared with the prior art, the invention has the beneficial effects that: the invention has scientific and reasonable structure and safe and convenient use: the consistency of SQL language and relational algebra is utilized, the traditional SQl language is converted to a relational algebra expression, so that the efficiency of distributed query is indirectly improved, the query tree is decomposed in the process of constructing the query tree, the characteristic of data discrete distribution of a distributed system is better met, the data coupling between queries is reduced, the communication overhead is correspondingly reduced, the MapReduce method is optimized for the connection process, the communication overhead between sites in the connection process is greatly reduced, the communication time is correspondingly shortened, and the query efficiency is improved.

Drawings

The accompanying drawings, which are included to provide a further understanding of the invention and are incorporated in and constitute a part of this specification, illustrate embodiments of the invention and together with the description serve to explain the principles of the invention and not to limit the invention.

In the drawings:

FIG. 1 is a block diagram of a distributed query optimization method of the present invention;

FIG. 2 is a flow chart of a distributed query optimization method of the present invention;

FIG. 3 is a schematic diagram of the MapReduce data connection operation of the present invention;

fig. 4 is a schematic illustration of the benefit-selected semi-joining method of the present invention.

Detailed Description

The preferred embodiments of the present invention will be described in conjunction with the accompanying drawings, and it will be understood that they are described herein for the purpose of illustration and explanation and not limitation.

Example (b): as shown in fig. 1-2, the present invention provides a technical solution, a distributed query optimization method based on an equivalent expansion method of relational algebra, comprising the following steps:

Further, in step S1, the query user makes a query request to the distributed database system, and the request result is a tuple set that satisfies the query request, where the tuple set is a multi-table Join operation performed by using MapReduce, and the specific method is as follows:

s11, reading the information in the table into the memory through the map function, wherein the attribute which needs to carry out Join operation is used as the key value, calculating the middle key value pair through the algorithm, storing the result in the local,

s12, performing the same operation on the data information in other tables;

Further, in step S2, the system converts the query request made by the user into an equivalent relational algebra expression, and uses the relational algebra expression, wherein SQL language is equivalently replaced by the relational algebra language.

Further, in step S3, the relational algebra expression is optimized according to the equivalence change rule, and is converted into a query tree corresponding to the relational algebra expression, where the query tree includes selection, projection, connection, and merge operations, where all the selection and projection operations move down and the connection and merge operations move up, and it is ensured that the selection and projection operations are completed before the connection and merge operations, so as to obtain the optimized relational algebra expression, and a query tree is constructed according to the relational algebra expression, where a root node of the query tree is a query result, and leaf nodes are a certain relationship or a split.

Further, in step S4, for each leaf node, optimizing according to the fragmentation relationship, wherein if the fragmentation relationship is a horizontal fragmentation, the selection condition is moved down, the limited relationship between the selection condition and the fragmentation is compared, and the contradictory fragments are deleted; if the fragment relation is vertical fragment, comparing the attribute field related to the projection condition with the attribute field contained in the fragment, and removing irrelevant fragments.

Further, in step S5, the query tree is decomposed into a sub-query tree corresponding to the segment and an ancestor query tree corresponding to the sub-query result according to the fragmentation condition, and the specific steps of the decomposition are as follows:

s53: merging the public relational algebraic expression trees;

s54: the sub-query tree with the query value empty is removed.

Further, in step S7, calculating a half join benefit, deciding a join scheme with the minimum cost in the corresponding query, and executing the scheme to output a query result, where a half join means that, when the type of a key related to a join attribute in one of two table joins is significantly smaller than that in the other table, all key values related to the join in the smaller table are extracted through preprocessing by the client or a round of MapReduce, and are stored in a node of the Map function, and when the Map stage reads the other large table, data tuples unrelated to the final join in the large table are removed according to the cached key values, so that the data tuples do not enter the Reduce stage, and the overall efficiency is not affected.

Further, the specific method of half-linking comprises:

Further, in step S7, calculating a semi-connection benefit, which includes the following specific steps;

b: assume two existing relationships A, B, whose half-join cost formula is:

cost(A∝B)＝N(π_a(A))×size(a)；

Further, when query processing and optimization of global query are carried out, all relevant site data need to be associated and queried, and the association query step comprises the steps of firstly materializing redundant data by selecting a copy, selecting a corresponding copy mode for each relation or segment for further research, then applying heuristic rules to execute unitary relation algebraic operation as early as possible to re-determine the sequence of relation operation, thereby optimizing algorithm execution efficiency, and finally further optimizing connection operation and determining the execution mode of the connection operation;

As shown in FIG. 3, the Join operation is performed by using MapReduce to perform the table R and the table S, and the specific implementation process is as follows: reading information in a table R into a memory through a map function, wherein attributes needing Join operation are used as key values, calculating an intermediate key value pair through an algorithm, storing a result in a local table S, performing the same operation on data information, receiving tuples with the same key values from a table R, S through a reduce function, recombining the tuples meeting matching conditions to generate a new tuple relationship, constructing a relationship table, and finally writing calculation results after Join into a target database file system.

As shown in fig. 4, the relationship connection process based on the semi-connection mode maximizes the benefit of the connection process, and the specific method includes:

In summary, the advantages and positive effects of the invention are: the consistency of SQL language and relational algebra is utilized, the traditional SQl language is converted to a relational algebra expression, so that the efficiency of distributed query is indirectly improved, the query tree is decomposed in the process of constructing the query tree, the characteristic of data discrete distribution of a distributed system is better met, the data coupling between queries is reduced, the communication overhead is correspondingly reduced, the MapReduce method is optimized for the connection process, the communication overhead between sites in the connection process is greatly reduced, the communication time is correspondingly shortened, and the query efficiency is improved.

Finally, it should be noted that: although the present invention has been described in detail with reference to the foregoing embodiments, it will be apparent to those skilled in the art that changes may be made in the embodiments and/or equivalents thereof without departing from the spirit and scope of the invention. Any modification, equivalent replacement, or improvement made within the spirit and principle of the present invention should be included in the protection scope of the present invention.

Claims

1. A distributed query optimization method based on an equivalent expansion method of relational algebra is characterized in that; the method comprises the following steps:

2. The distributed query optimization method based on the equivalent expansion method of relational algebra as claimed in claim 1, wherein: in step S1, the query user makes a query request to the distributed database system, and the request result is a tuple set that satisfies the query request, where the tuple set is a multi-table Join operation performed by using MapReduce, and the specific method is as follows:

s12, performing the same operation on the data information in other tables;

3. The distributed query optimization method based on the equivalent expansion method of relational algebra as claimed in claim 1, wherein: in step S2, the system converts the query request provided by the user into an equivalent relational algebra expression, and uses the relational algebra expression, wherein the SQL language is equivalently replaced by the relational algebra language.

4. The distributed query optimization method based on the equivalent expansion method of relational algebra as claimed in claim 1, wherein: in step S3, the relational algebra expression is optimized according to the equivalence change rule and converted into a query tree corresponding to the relational algebra expression, where the query tree includes selection, projection, connection, and merge operations, where all the selection and projection operations move down and the connection and merge operations move up, and it is ensured that the selection and projection operations are completed before the connection and merge operations, so as to obtain the optimized relational algebra expression, and a query tree is constructed according to the relational algebra expression, where a root node of the query tree is a query result, and a leaf node is a certain relationship or a split.

5. The distributed query optimization method based on the equivalent expansion method of relational algebra as claimed in claim 1, wherein: in step S4, for each leaf node, optimizing according to the slicing relationship, wherein if the slicing relationship is a horizontal slice, the selection condition is moved down, the restriction relationship between the selection condition and the slice is compared, and the contradictory segments are deleted; if the fragment relation is vertical fragment, comparing the attribute field related to the projection condition with the attribute field contained in the fragment, and removing irrelevant fragments.

6. The distributed query optimization method based on the equivalent expansion method of relational algebra as claimed in claim 1, wherein: in step S5, the query tree is decomposed into a sub-query tree corresponding to the segment and a parent query tree corresponding to the sub-query result according to the fragmentation condition, and the specific steps of the decomposition are as follows:

s53: merging the public relational algebraic expression trees;

s54: the sub-query tree with the query value empty is removed.

7. The distributed query optimization method based on the equivalent expansion method of relational algebra as claimed in claim 1, wherein: in the step S7, the half join benefit is calculated, a join scheme with the minimum cost in the corresponding query is decided, and the scheme is executed to output the query result, where the half join means that under the condition that the key type related to the join attribute in one of the two table joins is significantly smaller than that in the other table, all the key values related to the join in the smaller table are extracted through preprocessing of the client or a round of MapReduce and are prestored in the node of the Map function, and when the Map stage reads the other large table, the data tuples unrelated to the final join in the large table are removed according to the cached key values, so as to avoid entering the Reduce stage and affecting the overall efficiency.

8. The distributed query optimization method based on the equivalent expansion method of relational algebra as claimed in claim 7, wherein: the semi-connection specific method comprises the following steps:

9. The distributed query optimization method based on the equivalent expansion method of relational algebra as claimed in claim 7, wherein: in the step S7, calculating a semi-connection benefit, wherein the specific calculation steps are as follows;

b: assume two existing relationships A, B, whose half-join cost formula is:

cost(A∝B)＝N(π_a(A))×size(a)；

10. The distributed query optimization method based on the equivalent expansion method of relational algebra as claimed in claim 1, wherein: the query processing and optimization of the global query need to correlate and query all relevant site data, and the correlation query step comprises the steps of firstly materializing redundant data by selecting a copy, selecting a copy mode corresponding to each relation or segment for further research, then applying heuristic rules to execute unary relation algebraic operation as early as possible to re-determine the sequence of relation operation, thereby optimizing algorithm execution efficiency, and finally further optimizing connection operation and determining the execution mode of the connection operation;