CN111881160A - Distributed query optimization method based on equivalent expansion method of relational algebra - Google Patents

Distributed query optimization method based on equivalent expansion method of relational algebra Download PDF

Info

Publication number
CN111881160A
CN111881160A CN201910575857.9A CN201910575857A CN111881160A CN 111881160 A CN111881160 A CN 111881160A CN 201910575857 A CN201910575857 A CN 201910575857A CN 111881160 A CN111881160 A CN 111881160A
Authority
CN
China
Prior art keywords
query
connection
relational algebra
relation
join
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN201910575857.9A
Other languages
Chinese (zh)
Inventor
秦小麟
刘亮
徐兴业
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
CHINA REALTIME DATABASE CO LTD
Nanjing University of Aeronautics and Astronautics
State Grid Corp of China SGCC
State Grid Jiangsu Electric Power Co Ltd
NARI Group Corp
Original Assignee
CHINA REALTIME DATABASE CO LTD
Nanjing University of Aeronautics and Astronautics
State Grid Corp of China SGCC
State Grid Jiangsu Electric Power Co Ltd
NARI Group Corp
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by CHINA REALTIME DATABASE CO LTD, Nanjing University of Aeronautics and Astronautics, State Grid Corp of China SGCC, State Grid Jiangsu Electric Power Co Ltd, NARI Group Corp filed Critical CHINA REALTIME DATABASE CO LTD
Priority to CN201910575857.9A priority Critical patent/CN111881160A/en
Publication of CN111881160A publication Critical patent/CN111881160A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/24Querying
    • G06F16/245Query processing
    • G06F16/2453Query optimisation
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/24Querying
    • G06F16/242Query formulation
    • G06F16/2433Query languages
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/24Querying
    • G06F16/245Query processing
    • G06F16/2458Special types of queries, e.g. statistical queries, fuzzy queries or distributed queries
    • G06F16/2471Distributed queries

Abstract

The invention discloses a distributed query optimization method based on an equivalent expansion method of relational algebra, which comprises the steps of sending a request, converting language, constructing a query tree, optimizing fragmentation, decomposing the query tree, judging whether connection operation exists or not, calculating benefit and judging whether traversal is completed or not, has scientific and reasonable structure and safe and convenient use, turns the traditional SQl language into a relational algebra expression by utilizing the consistency of SQL language and relational algebra, thereby indirectly improving the efficiency of the distributed query, decomposes the query tree in the process of constructing the query tree, better accords with the characteristic of data discrete distribution of a distributed system, reduces the data coupling among queries, correspondingly reduces the communication cost, optimizes the connection process by a MapReduce method, greatly reduces the communication cost among sites in the connection process, and accordingly, communication time is shortened, thereby improving query efficiency.

Description

Distributed query optimization method based on equivalent expansion method of relational algebra
Technical Field
The invention relates to the technical field of databases, in particular to a distributed query optimization method based on an equivalent expansion method of relational algebra.
Background
With the advent of the big data era, data storage and query of large data volume become research hotspots of scholars; distributed database systems that fall short of centralized databases, which are the product of a combination of database management techniques and network techniques, have also come into play; in real life, a company can be separated in geographic position, different subsidiaries are arranged in different areas, and the subsidiaries bear respective services; but different branches, their respective resources also need to be interacted and shared; the unity and autonomy are reflected; the distributed database system stores data in different nodes respectively according to the concept, and performs interaction and access in a network interconnection mode.
In a distributed database system, the query computation overhead is one of the main factors that restrict the performance of the system; wherein, the query computation overhead refers to the following steps: local expense expenses generated by a CPU, an I/O and the like and communication expenses of data interaction among all sites; the overhead of distributed query computation is different due to different data storage and query strategies adopted by different distributed systems; how to select a reasonable data distribution and query optimization method makes it extremely important to obtain the highest query efficiency with the minimum overhead cost.
The traditional distributed query optimization is based on SQL language, an optimization mode is considered from the perspective of global query initiated by a user, the method needs to determine an optimization scheme through multi-angle judgment due to the fact that the data distribution condition of each station related to the query needs to be considered, meanwhile, the SQL language is more biased to natural language, the process of decomposing into sub-queries is complicated, the condition that multi-table Join query is needed frequently occurs in the actual query process, the optimization algorithm based on direct Join easily causes the excessive data transmission amount, and therefore excessive communication cost is caused, and therefore a distributed query optimization method based on an equivalent expansion method of relational algebra is urgently needed to solve the problems.
Disclosure of Invention
The invention provides a distributed query optimization method based on an equivalent expansion method of relational algebra, which can effectively solve the problems that the traditional distributed query optimization is based on SQL language, the optimization mode is considered from the perspective of global query initiated by a user, the optimization scheme can be determined only by judging from multiple angles due to the fact that the data distribution condition of each station associated with query needs to be considered, meanwhile, the SQL language is more biased to natural language, the process of decomposing into sub-queries is complicated, the condition of requiring multi-table Join query frequently occurs in the actual query process, and the optimization algorithm based on direct Join easily causes the excessive transmission data volume and the excessive communication cost.
In order to achieve the purpose, the invention provides the following technical scheme: a distributed query optimization method based on an equivalent expansion method of relational algebra comprises the following steps:
s1, sending a request: the query user puts forward a query requirement to the distributed database system, and the request result is a tuple set meeting the query requirement;
s2, language conversion: the system converts a query request provided by a user into an equivalent relational algebra expression, and the relational algebra expression is used for expressing;
s3, constructing a query tree: optimizing the relational algebra expression according to the equivalence change rule, and converting the relational algebra expression into a query tree corresponding to the relational algebra expression;
s4, slicing optimization: optimizing each leaf node according to the fragmentation relation;
s5, query tree decomposition: decomposing the query tree into a sub query tree corresponding to the segment and a parent query tree corresponding to the sub query result according to the fragmentation condition;
s6, judging whether the connection operation exists: traversing all query trees, checking whether the query contains a connection operation, if so, performing step S7, otherwise, outputting a query result and continuing to step S6;
s7, benefit calculation: calculating the semi-connection benefit, deciding a connection scheme with the minimum cost in the corresponding query, and executing the scheme to output a query result;
s8, judging whether traversal is completed: if the traversal is completed, the process is terminated, and if the traversal is not completed, the process returns to step S6.
Preferably, in step S1, the query user makes a query request to the distributed database system, and the request result is a tuple set that satisfies the query request, where the tuple set is a multi-table Join operation performed by using MapReduce, and the specific method is as follows:
s11, reading the information in the table into a memory through a map function, wherein the attribute needing Join operation is used as a key value, calculating an intermediate key value pair through an algorithm, and storing the result locally;
s12, performing the same operation on the data information in other tables;
s13, tuples with equal key values from multiple tables are received through the reduce function, the tuples meeting the matching conditions are recombined to generate a new tuple relation, a relation table is constructed, and finally the calculation result after Join is written into a target database file system.
Preferably, in the step S2, the system converts the query request provided by the user into an equivalent relational algebra expression, and uses the relational algebra expression, wherein the SQL language is replaced by the relational algebra language equivalent.
Preferably, in step S3, the relational algebra expression is optimized according to the equivalence change rule, and is converted into a query tree corresponding to the relational algebra expression, where the query tree includes selection, projection, connection, and merge operations, where all the selection and projection operations move downward and the connection and merge operations move upward, and it is ensured that the selection and projection operations are completed before the connection and merge operations, so as to obtain the optimized relational algebra expression, and the query tree is constructed according to the relational algebra expression, where a root node of the query tree is a query result, and a leaf node is a certain relationship or a split.
Preferably, in step S4, for each leaf node, optimization is performed according to a fragmentation relationship, where if the fragmentation relationship is a horizontal fragmentation, the selection condition is moved down, a limited relationship between the selection condition and the fragmentation is compared, and a contradictory fragment is deleted; if the fragment relation is vertical fragment, comparing the attribute field related to the projection condition with the attribute field contained in the fragment, and removing irrelevant fragments.
Preferably, in step S5, the query tree is decomposed into a sub-query tree corresponding to the segment and an ancestor query tree corresponding to the sub-query result according to the fragmentation condition, and the specific steps of the decomposition are as follows:
s51: decomposing the current query book, and expressing the segmented relation through segment information;
s52: selecting and projecting a unary relation operation calculation at the leaf end of the query tree;
s53: merging the public relational algebraic expression trees;
s54: the sub-query tree with the query value empty is removed.
Preferably, in step S7, the half join benefit is calculated, a join scheme with the minimum cost in the corresponding query is decided, and the scheme is executed to output a query result, where the half join means that, when the type of a key related to a join attribute in one of two table joins is significantly smaller than that in the other table, all key values related to the join in the smaller table are extracted through preprocessing by the client or a round of MapReduce, and are stored in a node of the Map function, and when the Map stage reads the other large table, data tuples unrelated to the final join in the large table are removed according to the cached key values, so that the data tuples do not enter the Reduce stage, and the overall efficiency is not affected.
Preferably, the specific method of half-linking comprises:
firstly, locally reducing all relations, constructing a possible semi-connection reduction model, deducing a corresponding static characteristic table according to the existing reduction relation, calculating the income and the expense in the static characteristic table, determining a semi-connection field to obtain a static characteristic table of the final reduction relation, and obtaining an initial semi-connection program;
then, carrying out subsequent optimization on the initial semi-connection program, and if the selected query execution site is exactly the site where the last semi-connection program is reduced, cancelling the last semi-connection;
the cost of the semi-connection reduction program is obvious in the initial execution stage, if a semi-connection relation exists, the relation is reduced, then semi-connection operation is reduced, and otherwise, natural connection is directly performed.
Preferably, in the step S7, the semi-connection benefit is calculated, and the specific calculation steps are as follows;
a: assuming that the two relationships are A, B, the half-join selection factor for relationships A, B is noted as: f (A. varies.. beta. B). gtN (pi.)a(A))/N(B);
The number of tuples contained in the projection of the relation A on the common field a between the relations A and B is represented as N (pi)a(A) N (B) represents the number of all tuples on the relation B;
b: assume two existing relationships A, B, whose half-join cost formula is:
cost(A∝B)=N(πa(A))×size(a);
wherein, size (a) represents the byte length of the corresponding field a in the relation S;
c: defining the benefit of the half-join according to the steps, wherein a half-join benefit formula is defined as: benefit (a ∈ B) ═ 1-F (a ∈ B)) × N (pi ∈ B)a(B))×size(a);
And if the result of the benefit formula calculation is higher than the result of the cost calculation, selecting semi-connection to obtain benefit, otherwise, obtaining benefit by adopting a natural connection mode.
Preferably, the query processing and optimization of the global query needs to correlate all the site data related to the query, and the correlation query step comprises the steps of firstly materializing redundant data by selecting a copy, selecting a copy mode corresponding to each relation or segment for further research, then applying heuristic rules to execute unitary relation algebraic operation as early as possible to re-determine the sequence of relation operation, thereby optimizing the algorithm execution efficiency, and finally further optimizing the connection operation and determining the connection operation execution mode;
the method for selecting the optimal copy selection scheme comprises the following steps: remote access processes are reduced; remote transmission of large data volume data is reduced, and further communication overhead is reduced; the energy load and the storage load of the nodes are reduced, and the basic cost overhead is further reduced.
Compared with the prior art, the invention has the beneficial effects that: the invention has scientific and reasonable structure and safe and convenient use: the consistency of SQL language and relational algebra is utilized, the traditional SQl language is converted to a relational algebra expression, so that the efficiency of distributed query is indirectly improved, the query tree is decomposed in the process of constructing the query tree, the characteristic of data discrete distribution of a distributed system is better met, the data coupling between queries is reduced, the communication overhead is correspondingly reduced, the MapReduce method is optimized for the connection process, the communication overhead between sites in the connection process is greatly reduced, the communication time is correspondingly shortened, and the query efficiency is improved.
Drawings
The accompanying drawings, which are included to provide a further understanding of the invention and are incorporated in and constitute a part of this specification, illustrate embodiments of the invention and together with the description serve to explain the principles of the invention and not to limit the invention.
In the drawings:
FIG. 1 is a block diagram of a distributed query optimization method of the present invention;
FIG. 2 is a flow chart of a distributed query optimization method of the present invention;
FIG. 3 is a schematic diagram of the MapReduce data connection operation of the present invention;
fig. 4 is a schematic illustration of the benefit-selected semi-joining method of the present invention.
Detailed Description
The preferred embodiments of the present invention will be described in conjunction with the accompanying drawings, and it will be understood that they are described herein for the purpose of illustration and explanation and not limitation.
Example (b): as shown in fig. 1-2, the present invention provides a technical solution, a distributed query optimization method based on an equivalent expansion method of relational algebra, comprising the following steps:
s1, sending a request: the query user puts forward a query requirement to the distributed database system, and the request result is a tuple set meeting the query requirement;
s2, language conversion: the system converts a query request provided by a user into an equivalent relational algebra expression, and the relational algebra expression is used for expressing;
s3, constructing a query tree: optimizing the relational algebra expression according to the equivalence change rule, and converting the relational algebra expression into a query tree corresponding to the relational algebra expression;
s4, slicing optimization: optimizing each leaf node according to the fragmentation relation;
s5, query tree decomposition: decomposing the query tree into a sub query tree corresponding to the segment and a parent query tree corresponding to the sub query result according to the fragmentation condition;
s6, judging whether the connection operation exists: traversing all query trees, checking whether the query contains a connection operation, if so, performing step S7, otherwise, outputting a query result and continuing to step S6;
s7, benefit calculation: calculating the semi-connection benefit, deciding a connection scheme with the minimum cost in the corresponding query, and executing the scheme to output a query result;
s8, judging whether traversal is completed: if the traversal is completed, the process is terminated, and if the traversal is not completed, the process returns to step S6.
Further, in step S1, the query user makes a query request to the distributed database system, and the request result is a tuple set that satisfies the query request, where the tuple set is a multi-table Join operation performed by using MapReduce, and the specific method is as follows:
s11, reading the information in the table into the memory through the map function, wherein the attribute which needs to carry out Join operation is used as the key value, calculating the middle key value pair through the algorithm, storing the result in the local,
s12, performing the same operation on the data information in other tables;
s13, tuples with equal key values from multiple tables are received through the reduce function, the tuples meeting the matching conditions are recombined to generate a new tuple relation, a relation table is constructed, and finally the calculation result after Join is written into a target database file system.
Further, in step S2, the system converts the query request made by the user into an equivalent relational algebra expression, and uses the relational algebra expression, wherein SQL language is equivalently replaced by the relational algebra language.
Further, in step S3, the relational algebra expression is optimized according to the equivalence change rule, and is converted into a query tree corresponding to the relational algebra expression, where the query tree includes selection, projection, connection, and merge operations, where all the selection and projection operations move down and the connection and merge operations move up, and it is ensured that the selection and projection operations are completed before the connection and merge operations, so as to obtain the optimized relational algebra expression, and a query tree is constructed according to the relational algebra expression, where a root node of the query tree is a query result, and leaf nodes are a certain relationship or a split.
Further, in step S4, for each leaf node, optimizing according to the fragmentation relationship, wherein if the fragmentation relationship is a horizontal fragmentation, the selection condition is moved down, the limited relationship between the selection condition and the fragmentation is compared, and the contradictory fragments are deleted; if the fragment relation is vertical fragment, comparing the attribute field related to the projection condition with the attribute field contained in the fragment, and removing irrelevant fragments.
Further, in step S5, the query tree is decomposed into a sub-query tree corresponding to the segment and an ancestor query tree corresponding to the sub-query result according to the fragmentation condition, and the specific steps of the decomposition are as follows:
s51: decomposing the current query book, and expressing the segmented relation through segment information;
s52: selecting and projecting a unary relation operation calculation at the leaf end of the query tree;
s53: merging the public relational algebraic expression trees;
s54: the sub-query tree with the query value empty is removed.
Further, in step S7, calculating a half join benefit, deciding a join scheme with the minimum cost in the corresponding query, and executing the scheme to output a query result, where a half join means that, when the type of a key related to a join attribute in one of two table joins is significantly smaller than that in the other table, all key values related to the join in the smaller table are extracted through preprocessing by the client or a round of MapReduce, and are stored in a node of the Map function, and when the Map stage reads the other large table, data tuples unrelated to the final join in the large table are removed according to the cached key values, so that the data tuples do not enter the Reduce stage, and the overall efficiency is not affected.
Further, the specific method of half-linking comprises:
firstly, locally reducing all relations, constructing a possible semi-connection reduction model, deducing a corresponding static characteristic table according to the existing reduction relation, calculating the income and the expense in the static characteristic table, determining a semi-connection field to obtain a static characteristic table of the final reduction relation, and obtaining an initial semi-connection program;
then, carrying out subsequent optimization on the initial semi-connection program, and if the selected query execution site is exactly the site where the last semi-connection program is reduced, cancelling the last semi-connection;
the cost of the semi-connection reduction program is obvious in the initial execution stage, if a semi-connection relation exists, the relation is reduced, then semi-connection operation is reduced, and otherwise, natural connection is directly performed.
Further, in step S7, calculating a semi-connection benefit, which includes the following specific steps;
a: assuming that the two relationships are A, B, the half-join selection factor for relationships A, B is noted as: f (A. varies.. beta. B). gtN (pi.)a(A))/N(B);
The number of tuples contained in the projection of the relation A on the common field a between the relations A and B is represented as N (pi)a(A) N (B) represents the number of all tuples on the relation B;
b: assume two existing relationships A, B, whose half-join cost formula is:
cost(A∝B)=N(πa(A))×size(a);
wherein, size (a) represents the byte length of the corresponding field a in the relation S;
c: defining the benefit of the half-join according to the steps, wherein a half-join benefit formula is defined as: benefit (a ∈ B) ═ 1-F (a ∈ B)) × N (pi ∈ B)a(B))×size(a);
And if the result of the benefit formula calculation is higher than the result of the cost calculation, selecting semi-connection to obtain benefit, otherwise, obtaining benefit by adopting a natural connection mode.
Further, when query processing and optimization of global query are carried out, all relevant site data need to be associated and queried, and the association query step comprises the steps of firstly materializing redundant data by selecting a copy, selecting a corresponding copy mode for each relation or segment for further research, then applying heuristic rules to execute unitary relation algebraic operation as early as possible to re-determine the sequence of relation operation, thereby optimizing algorithm execution efficiency, and finally further optimizing connection operation and determining the execution mode of the connection operation;
the method for selecting the optimal copy selection scheme comprises the following steps: remote access processes are reduced; remote transmission of large data volume data is reduced, and further communication overhead is reduced; the energy load and the storage load of the nodes are reduced, and the basic cost overhead is further reduced.
As shown in FIG. 3, the Join operation is performed by using MapReduce to perform the table R and the table S, and the specific implementation process is as follows: reading information in a table R into a memory through a map function, wherein attributes needing Join operation are used as key values, calculating an intermediate key value pair through an algorithm, storing a result in a local table S, performing the same operation on data information, receiving tuples with the same key values from a table R, S through a reduce function, recombining the tuples meeting matching conditions to generate a new tuple relationship, constructing a relationship table, and finally writing calculation results after Join into a target database file system.
As shown in fig. 4, the relationship connection process based on the semi-connection mode maximizes the benefit of the connection process, and the specific method includes:
firstly, locally reducing all relations, constructing a possible semi-connection reduction model, deducing a corresponding static characteristic table according to the existing reduction relation, calculating the income and the expense in the static characteristic table, determining a semi-connection field to obtain a static characteristic table of the final reduction relation, and obtaining an initial semi-connection program;
then, carrying out subsequent optimization on the initial semi-connection program, and if the selected query execution site is exactly the site where the last semi-connection program is reduced, cancelling the last semi-connection;
the cost of the semi-connection reduction program is obvious in the initial execution stage, if a semi-connection relation exists, the relation is reduced, then semi-connection operation is reduced, and otherwise, natural connection is directly performed.
In summary, the advantages and positive effects of the invention are: the consistency of SQL language and relational algebra is utilized, the traditional SQl language is converted to a relational algebra expression, so that the efficiency of distributed query is indirectly improved, the query tree is decomposed in the process of constructing the query tree, the characteristic of data discrete distribution of a distributed system is better met, the data coupling between queries is reduced, the communication overhead is correspondingly reduced, the MapReduce method is optimized for the connection process, the communication overhead between sites in the connection process is greatly reduced, the communication time is correspondingly shortened, and the query efficiency is improved.
Finally, it should be noted that: although the present invention has been described in detail with reference to the foregoing embodiments, it will be apparent to those skilled in the art that changes may be made in the embodiments and/or equivalents thereof without departing from the spirit and scope of the invention. Any modification, equivalent replacement, or improvement made within the spirit and principle of the present invention should be included in the protection scope of the present invention.

Claims (10)

1. A distributed query optimization method based on an equivalent expansion method of relational algebra is characterized in that; the method comprises the following steps:
s1, sending a request: the query user puts forward a query requirement to the distributed database system, and the request result is a tuple set meeting the query requirement;
s2, language conversion: the system converts a query request provided by a user into an equivalent relational algebra expression, and the relational algebra expression is used for expressing;
s3, constructing a query tree: optimizing the relational algebra expression according to the equivalence change rule, and converting the relational algebra expression into a query tree corresponding to the relational algebra expression;
s4, slicing optimization: optimizing each leaf node according to the fragmentation relation;
s5, query tree decomposition: decomposing the query tree into a sub query tree corresponding to the segment and a parent query tree corresponding to the sub query result according to the fragmentation condition;
s6, judging whether the connection operation exists: traversing all query trees, checking whether the query contains a connection operation, if so, performing step S7, otherwise, outputting a query result and continuing to step S6;
s7, benefit calculation: calculating the semi-connection benefit, deciding a connection scheme with the minimum cost in the corresponding query, and executing the scheme to output a query result;
s8, judging whether traversal is completed: if the traversal is completed, the process is terminated, and if the traversal is not completed, the process returns to step S6.
2. The distributed query optimization method based on the equivalent expansion method of relational algebra as claimed in claim 1, wherein: in step S1, the query user makes a query request to the distributed database system, and the request result is a tuple set that satisfies the query request, where the tuple set is a multi-table Join operation performed by using MapReduce, and the specific method is as follows:
s11, reading the information in the table into a memory through a map function, wherein the attribute needing Join operation is used as a key value, calculating an intermediate key value pair through an algorithm, and storing the result locally;
s12, performing the same operation on the data information in other tables;
s13, tuples with equal key values from multiple tables are received through the reduce function, the tuples meeting the matching conditions are recombined to generate a new tuple relation, a relation table is constructed, and finally the calculation result after Join is written into a target database file system.
3. The distributed query optimization method based on the equivalent expansion method of relational algebra as claimed in claim 1, wherein: in step S2, the system converts the query request provided by the user into an equivalent relational algebra expression, and uses the relational algebra expression, wherein the SQL language is equivalently replaced by the relational algebra language.
4. The distributed query optimization method based on the equivalent expansion method of relational algebra as claimed in claim 1, wherein: in step S3, the relational algebra expression is optimized according to the equivalence change rule and converted into a query tree corresponding to the relational algebra expression, where the query tree includes selection, projection, connection, and merge operations, where all the selection and projection operations move down and the connection and merge operations move up, and it is ensured that the selection and projection operations are completed before the connection and merge operations, so as to obtain the optimized relational algebra expression, and a query tree is constructed according to the relational algebra expression, where a root node of the query tree is a query result, and a leaf node is a certain relationship or a split.
5. The distributed query optimization method based on the equivalent expansion method of relational algebra as claimed in claim 1, wherein: in step S4, for each leaf node, optimizing according to the slicing relationship, wherein if the slicing relationship is a horizontal slice, the selection condition is moved down, the restriction relationship between the selection condition and the slice is compared, and the contradictory segments are deleted; if the fragment relation is vertical fragment, comparing the attribute field related to the projection condition with the attribute field contained in the fragment, and removing irrelevant fragments.
6. The distributed query optimization method based on the equivalent expansion method of relational algebra as claimed in claim 1, wherein: in step S5, the query tree is decomposed into a sub-query tree corresponding to the segment and a parent query tree corresponding to the sub-query result according to the fragmentation condition, and the specific steps of the decomposition are as follows:
s51: decomposing the current query book, and expressing the segmented relation through segment information;
s52: selecting and projecting a unary relation operation calculation at the leaf end of the query tree;
s53: merging the public relational algebraic expression trees;
s54: the sub-query tree with the query value empty is removed.
7. The distributed query optimization method based on the equivalent expansion method of relational algebra as claimed in claim 1, wherein: in the step S7, the half join benefit is calculated, a join scheme with the minimum cost in the corresponding query is decided, and the scheme is executed to output the query result, where the half join means that under the condition that the key type related to the join attribute in one of the two table joins is significantly smaller than that in the other table, all the key values related to the join in the smaller table are extracted through preprocessing of the client or a round of MapReduce and are prestored in the node of the Map function, and when the Map stage reads the other large table, the data tuples unrelated to the final join in the large table are removed according to the cached key values, so as to avoid entering the Reduce stage and affecting the overall efficiency.
8. The distributed query optimization method based on the equivalent expansion method of relational algebra as claimed in claim 7, wherein: the semi-connection specific method comprises the following steps:
firstly, locally reducing all relations, constructing a possible semi-connection reduction model, deducing a corresponding static characteristic table according to the existing reduction relation, calculating the income and the expense in the static characteristic table, determining a semi-connection field to obtain a static characteristic table of the final reduction relation, and obtaining an initial semi-connection program;
then, carrying out subsequent optimization on the initial semi-connection program, and if the selected query execution site is exactly the site where the last semi-connection program is reduced, cancelling the last semi-connection;
the cost of the semi-connection reduction program is obvious in the initial execution stage, if a semi-connection relation exists, the relation is reduced, then semi-connection operation is reduced, and otherwise, natural connection is directly performed.
9. The distributed query optimization method based on the equivalent expansion method of relational algebra as claimed in claim 7, wherein: in the step S7, calculating a semi-connection benefit, wherein the specific calculation steps are as follows;
a: assuming that the two relationships are A, B, the half-join selection factor for relationships A, B is noted as: f (A. varies.. beta. B). gtN (pi.)a(A))/N(B);
The number of tuples contained in the projection of the relation A on the common field a between the relations A and B is represented as N (pi)a(A) N (B) represents the number of all tuples on the relation B;
b: assume two existing relationships A, B, whose half-join cost formula is:
cost(A∝B)=N(πa(A))×size(a);
wherein, size (a) represents the byte length of the corresponding field a in the relation S;
c: defining the benefit of the half-join according to the steps, wherein a half-join benefit formula is defined as: benefit (a ∈ B) ═ 1-F (a ∈ B)) × N (pi ∈ B)a(B))×size(a);
And if the result of the benefit formula calculation is higher than the result of the cost calculation, selecting semi-connection to obtain benefit, otherwise, obtaining benefit by adopting a natural connection mode.
10. The distributed query optimization method based on the equivalent expansion method of relational algebra as claimed in claim 1, wherein: the query processing and optimization of the global query need to correlate and query all relevant site data, and the correlation query step comprises the steps of firstly materializing redundant data by selecting a copy, selecting a copy mode corresponding to each relation or segment for further research, then applying heuristic rules to execute unary relation algebraic operation as early as possible to re-determine the sequence of relation operation, thereby optimizing algorithm execution efficiency, and finally further optimizing connection operation and determining the execution mode of the connection operation;
the method for selecting the optimal copy selection scheme comprises the following steps: remote access processes are reduced; remote transmission of large data volume data is reduced, and further communication overhead is reduced; the energy load and the storage load of the nodes are reduced, and the basic cost overhead is further reduced.
CN201910575857.9A 2019-06-28 2019-06-28 Distributed query optimization method based on equivalent expansion method of relational algebra Pending CN111881160A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201910575857.9A CN111881160A (en) 2019-06-28 2019-06-28 Distributed query optimization method based on equivalent expansion method of relational algebra

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201910575857.9A CN111881160A (en) 2019-06-28 2019-06-28 Distributed query optimization method based on equivalent expansion method of relational algebra

Publications (1)

Publication Number Publication Date
CN111881160A true CN111881160A (en) 2020-11-03

Family

ID=73153773

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201910575857.9A Pending CN111881160A (en) 2019-06-28 2019-06-28 Distributed query optimization method based on equivalent expansion method of relational algebra

Country Status (1)

Country Link
CN (1) CN111881160A (en)

Cited By (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113468383A (en) * 2021-07-19 2021-10-01 北京明略软件系统有限公司 Family relation map searching method and device, electronic equipment and storage medium
CN113535828A (en) * 2021-09-17 2021-10-22 西安热工研究院有限公司 Aggregation query method, system, equipment and storage medium of time sequence data
CN115114327A (en) * 2022-07-28 2022-09-27 昆明理工大学 Database query relation modeling method capable of reducing repeated calculation
CN116150162A (en) * 2023-04-20 2023-05-23 北京锐服信科技有限公司 Data chart updating method and device based on time slicing and electronic equipment
CN117435594A (en) * 2023-12-18 2024-01-23 天津南大通用数据技术股份有限公司 Optimization method for distributed database distribution key

Non-Patent Citations (3)

* Cited by examiner, † Cited by third party
Title
张瑞芳: "分布式数据库的查询优化方法设计与实现", 《中国优秀博硕士学位论文全文数据库(硕士)-信息科技辑》 *
张继福等: "基于MapReduce与相关子空间的局部离散数据挖掘算法", 《软件学报》 *
郑勇明等: "分布式数据库查询优化处理-基于关系代数等价变换的查询优化处理", 《数据库及信息管理》 *

Cited By (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113468383A (en) * 2021-07-19 2021-10-01 北京明略软件系统有限公司 Family relation map searching method and device, electronic equipment and storage medium
CN113535828A (en) * 2021-09-17 2021-10-22 西安热工研究院有限公司 Aggregation query method, system, equipment and storage medium of time sequence data
CN113535828B (en) * 2021-09-17 2021-11-30 西安热工研究院有限公司 Aggregation query method, system, equipment and storage medium of time sequence data
CN115114327A (en) * 2022-07-28 2022-09-27 昆明理工大学 Database query relation modeling method capable of reducing repeated calculation
CN116150162A (en) * 2023-04-20 2023-05-23 北京锐服信科技有限公司 Data chart updating method and device based on time slicing and electronic equipment
CN117435594A (en) * 2023-12-18 2024-01-23 天津南大通用数据技术股份有限公司 Optimization method for distributed database distribution key
CN117435594B (en) * 2023-12-18 2024-04-16 天津南大通用数据技术股份有限公司 Optimization method for distributed database distribution key

Similar Documents

Publication Publication Date Title
CN111881160A (en) Distributed query optimization method based on equivalent expansion method of relational algebra
US11126626B2 (en) Massively parallel and in-memory execution of grouping and aggregation in a heterogeneous system
CN109299102B (en) HBase secondary index system and method based on Elastcissearch
US10606834B2 (en) Methods and apparatus of shared expression evaluation across RDBMS and storage layer
US8886631B2 (en) Query execution systems and methods
US8935232B2 (en) Query execution systems and methods
US8126870B2 (en) System and methodology for parallel query optimization using semantic-based partitioning
US10585887B2 (en) Multi-system query execution plan
US9292570B2 (en) System and method for optimizing pattern query searches on a graph database
US9390115B2 (en) Tables with unlimited number of sparse columns and techniques for an efficient implementation
CN106547796B (en) Database execution method and device
US7171399B2 (en) Method for efficient query execution using dynamic queries in database environments
US20120246147A1 (en) Modular query optimizer
CN107169033A (en) Relation data enquiring and optimizing method with parallel framework is changed based on data pattern
Hubail et al. Couchbase analytics: NoETL for scalable NoSQL data analysis
US8554760B2 (en) System and method for optimizing queries
CN112015741A (en) Method and device for storing massive data in different databases and tables
CN104484472A (en) Database cluster for mixing various heterogeneous data sources and implementation method
Yuanyuan et al. Distributed database system query optimization algorithm research
US7958160B2 (en) Executing filter subqueries using a parallel single cursor model
US20040249845A1 (en) Efficient processing of multi-column and function-based in-list predicates
CN110032676B (en) SPARQL query optimization method and system based on predicate association
CN107609091B (en) Method for realizing cross-database multi-table combined query system
Arnold et al. HRDBMS: Combining the best of modern and traditional relational databases
Floratos et al. DBSpinner: Making a Case for Iterative Processing in Databases

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
RJ01 Rejection of invention patent application after publication
RJ01 Rejection of invention patent application after publication

Application publication date: 20201103