CN114860764A - Optimization method and system for distributed database query and electronic equipment - Google Patents
Optimization method and system for distributed database query and electronic equipment Download PDFInfo
- Publication number
- CN114860764A CN114860764A CN202210600055.0A CN202210600055A CN114860764A CN 114860764 A CN114860764 A CN 114860764A CN 202210600055 A CN202210600055 A CN 202210600055A CN 114860764 A CN114860764 A CN 114860764A
- Authority
- CN
- China
- Prior art keywords
- query
- database
- node
- data
- acyclic graph
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
- 238000000034 method Methods 0.000 title claims abstract description 65
- 238000005457 optimization Methods 0.000 title claims abstract description 42
- 238000013467 fragmentation Methods 0.000 claims description 42
- 238000006062 fragmentation reaction Methods 0.000 claims description 42
- 238000004891 communication Methods 0.000 claims description 19
- 239000012634 fragment Substances 0.000 claims description 13
- 230000008569 process Effects 0.000 claims description 9
- 238000004590 computer program Methods 0.000 claims description 5
- 238000004458 analytical method Methods 0.000 claims description 3
- 238000005192 partition Methods 0.000 description 10
- 238000010586 diagram Methods 0.000 description 5
- 238000005516 engineering process Methods 0.000 description 5
- 230000009471 action Effects 0.000 description 3
- 230000005540 biological transmission Effects 0.000 description 2
- 238000012986 modification Methods 0.000 description 2
- 230000004048 modification Effects 0.000 description 2
- 238000007792 addition Methods 0.000 description 1
- 238000003491 array Methods 0.000 description 1
- 230000006399 behavior Effects 0.000 description 1
- 230000009286 beneficial effect Effects 0.000 description 1
- 238000010276 construction Methods 0.000 description 1
- 238000013500 data storage Methods 0.000 description 1
- 238000012217 deletion Methods 0.000 description 1
- 230000037430 deletion Effects 0.000 description 1
- 238000011161 development Methods 0.000 description 1
- 239000000835 fiber Substances 0.000 description 1
- 230000006870 function Effects 0.000 description 1
- 230000006872 improvement Effects 0.000 description 1
- 238000007726 management method Methods 0.000 description 1
- 239000000178 monomer Substances 0.000 description 1
- 230000003287 optical effect Effects 0.000 description 1
- 230000002093 peripheral effect Effects 0.000 description 1
- 238000012545 processing Methods 0.000 description 1
- 238000012827 research and development Methods 0.000 description 1
- 239000004065 semiconductor Substances 0.000 description 1
- 239000007787 solid Substances 0.000 description 1
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/20—Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
- G06F16/24—Querying
- G06F16/245—Query processing
- G06F16/2453—Query optimisation
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/20—Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
- G06F16/24—Querying
- G06F16/242—Query formulation
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/20—Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
- G06F16/24—Querying
- G06F16/245—Query processing
- G06F16/2455—Query execution
- G06F16/24553—Query execution of query operations
- G06F16/24558—Binary matching operations
- G06F16/2456—Join operations
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/20—Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
- G06F16/24—Querying
- G06F16/245—Query processing
- G06F16/2458—Special types of queries, e.g. statistical queries, fuzzy queries or distributed queries
- G06F16/2471—Distributed queries
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/20—Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
- G06F16/27—Replication, distribution or synchronisation of data between databases or within a distributed database system; Distributed database system architectures therefor
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Physics & Mathematics (AREA)
- Databases & Information Systems (AREA)
- General Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- Data Mining & Analysis (AREA)
- Computational Linguistics (AREA)
- Mathematical Physics (AREA)
- Computing Systems (AREA)
- Fuzzy Systems (AREA)
- Probability & Statistics with Applications (AREA)
- Software Systems (AREA)
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
Abstract
The invention provides an optimization method, a system and electronic equipment for distributed database query, which are applied to the technical field of database query. In the invention, the nodes which have the same splitting rules between the distributed data tables and have the equivalence relation between the splitting columns of the two tables are preferentially selected as the root nodes or the secondary nodes, so that the data tables with the equivalence relation between the splitting columns are adjacent and have priority in the obtained topological ordering. Therefore, in the distributed query plan and the actual execution, join connection queries with equivalence relations among the split columns in the SQL query statement can be put down and executed in the database nodes, and the technical problem that in the prior art, if the join connection queries cannot be put down and executed in the database nodes due to the fact that the uniform equivalence relations do not exist among the tables when a plurality of data tables are subjected to connection queries, operation pressure and memory pressure of the distributed database middleware are increased finally is solved.
Description
Technical Field
The invention relates to the technical field of database query, in particular to a distributed database query optimization method, a distributed database query optimization system and electronic equipment.
Background
The database system hides the details of the data structure for the user, provides an interface irrelevant to the data, and the user only needs to simply describe the query behavior without paying attention to the internal process of acquiring the data. Query optimization is to minimize the total resource occupation and the total time consumption of a query as much as possible in the process of generating an execution strategy by a query execution engine. The database management system may have a built-in query optimizer for controlling and accelerating the query execution and data transmission process, and when the user query expression is parsed by the query processor, the query optimizer selects an appropriate data access policy for the query as output. However, query optimization has been a complex problem, particularly for queries of large amounts of data.
Aiming at the query of mass data, in order to reduce the operation pressure and the memory pressure of distributed database middleware, at present, a commonly used query optimization method is as follows: the distributed database middleware analyzes the SQL query statement generated by the application terminal after receiving the SQL query statement, and performs splitting optimization on the analyzed SQL query statement according to the equivalent relation of the two tables in the SQL query statement, so that part of join connection query is put into the storage library nodes for performing, and further the operation pressure and the memory pressure of the distributed database middleware are reduced.
However, because the SQL query statement optimization method based on the partition table equivalence relation in the prior art is only applicable to the case that the number of data tables to be queried related in the SQL query statement is two and there is only a uniform partition table equivalence relation between the two data tables, when the SQL query statement relates to queries of a plurality of data tables (the number of the data tables is greater than or equal to 3), if there is no uniform partition table equivalence relation between the plurality of data table tables, the SQL query statement optimization method in the prior art cannot be used, so that join queries cannot be transferred to a database node for execution, and finally, the operation pressure and the memory pressure of the distributed database middleware are increased.
Disclosure of Invention
The invention aims to provide an optimization method, an optimization system and electronic equipment for distributed database query, so as to reduce the operation pressure and the memory pressure of a database middleware.
In a first aspect, to solve the above technical problem, the present invention provides an optimization method for a distributed database query, applied to a middleware of a database, where the distributed database includes a plurality of database nodes for storing data tables, and the optimization method includes: receiving a data query request sent by a client, wherein the data query request comprises at least one SQL query statement, and the SQL query statement relates to JOIN connection query of a plurality of data tables and equivalence connection query of split columns among split tables contained in the plurality of data tables.
Analyzing the SQL query statement to obtain metadata corresponding to the SQL query statement, wherein the metadata comprises a plurality of data tables and an incidence relation among the data tables, and the number of the data tables included in the metadata is not less than 3.
And generating a directed acyclic graph with the data table as the node according to the metadata.
And generating a distributed query execution plan according to the directed acyclic graph, and respectively acquiring corresponding query results from a plurality of database nodes based on the distributed query execution plan.
Further, the association relationship between the data tables may include a fragmentation table equivalence relationship having the same splitting rule and an equivalence relationship between the split columns of the two tables.
Further, the step of analyzing the SQL query statement to obtain metadata corresponding to the SQL query statement, where the metadata includes a plurality of data tables and an association relationship between the data tables, may include: and analyzing the SQL query statement by using a preset analyzer, at least extracting the name of each data table and the associated field name of the equivalent relation of the fragment table from an analysis result, and taking the name and the associated field name as the associated relation.
Further, the step of generating a directed acyclic graph with the data tables as nodes according to the metadata may include: and according to the appearance sequence of the data tables in the metadata, taking the data table with the first appearance as a root node, taking the data tables with the association relation with the root node as child nodes in sequence, and taking the data tables with the association relation with the child nodes as secondary nodes in sequence.
Further, before generating the directed acyclic graph with the data tables as nodes according to the metadata, the method may further include: and determining whether the SQL query statement has INNER JOIN in-connection query of a plurality of data tables, if so, extrapolating the non-fragmented table equivalence relation in the INNER JOIN in-connection query to simplify the SQL query statement.
Further, the distributed query execution plan includes a plurality of execution sub-plans, and each execution sub-plan corresponds to a priority.
After generating the directed acyclic graph with the data tables as nodes according to the metadata, and before generating a distributed query execution plan according to the directed acyclic graph and respectively obtaining corresponding query results in a plurality of database nodes based on the distributed query execution plan, the method may further include: and aiming at the topological sorting of the directed acyclic graph, starting from a root node of the directed acyclic graph, using a data table with a fragmentation table equivalence relation with another data table as an optimized root node or an optimized child node, using a data table with a non-fragmentation table equivalence relation with another data table as a child node or a secondary node of the optimized root node or the optimized child node, so as to optimize the topological sorting of the directed acyclic graph, and enabling the priority of an execution sub-plan corresponding to the node with the fragmentation table equivalence relation in the topological sorting of the optimized directed acyclic graph to be higher than the priority of an execution sub-plan corresponding to the node with the non-fragmentation table equivalence relation.
Further, each node corresponds to a plurality of execution sub-plans.
The step of generating a query execution plan according to the directed acyclic graph, and obtaining corresponding query results in a plurality of database nodes based on the distributed query execution plan may include: and traversing each node in the optimized directed acyclic graph from the root node of the optimized directed acyclic graph according to the topological sorting of the optimized directed acyclic graph.
And in the process of traversing each node in the optimized directed acyclic graph, executing an execution sub-plan with high priority corresponding to each node with a fragmentation table equivalence relation, and after the execution of the execution sub-plan with high priority corresponding to each node with the fragmentation table equivalence relation in the optimized directed acyclic graph is finished, sequentially executing an execution sub-plan with low priority corresponding to each node with the fragmentation table equivalence relation and an execution sub-plan corresponding to each node with a non-fragmentation table equivalence relation.
In a second aspect, based on the same inventive concept, the present invention further provides an optimization system for distributed database query, which specifically includes: the system comprises a plurality of database nodes and database middleware respectively connected with the database nodes.
The database middleware is configured to receive a data query request sent by a client, optimize an SQL query statement in the analyzed data query request by using the distributed database query optimization method, so as to obtain an optimized directed acyclic graph to generate a distributed query execution plan, and obtain corresponding query results in a plurality of database nodes based on the distributed query execution plan.
The database nodes are used for storing data tables, and sub-table equivalence relations which have the same splitting rules and equivalence relations between splitting columns of two tables exist among database tables stored in the plurality of database nodes.
In a third aspect, based on the foregoing optimization method for distributed database query, the present invention further provides an electronic device, which includes a processor, a communication interface, a memory, and a communication bus, where the processor, the communication interface, and the memory complete communication with each other through the communication bus.
A memory for storing a computer program.
The processor is used for realizing the optimization method of the distributed database query when executing the program stored on the memory.
Compared with the prior art, the technical scheme of the invention has at least one of the following beneficial effects:
the invention provides an optimization method of multi-table complex query of a distributed database in an application scene of middleware of the distributed database, which comprises the steps of determining an incidence relation (including a fragmentation table equivalence relation) among a plurality of data tables by analyzing an SQL query statement, then, establishing a directed acyclic graph which takes the data tables as nodes and takes the incidence relation as a connecting line after simplifying the incidence relation among the data tables; selecting a traversal mode that nodes corresponding to the data table with the equivalent relation of the fragmentation table are earlier than nodes corresponding to the data table with the equivalent relation of the non-fragmentation table in multiple modes by utilizing the mode that the directed acyclic graph has multiple topological sorting; in the invention, the node where the equivalent relation of the fragment table of the distributed splitting rule is located is preferentially selected as a root node or a secondary node, so that the data table with the uniform equivalent relation of the fragment table in the obtained topological sorting can be preferentially processed, and therefore, when the topological sorting generated according to the topological sorting is inquired, the join connection inquiry containing the equivalent relation of the fragment table in the SQL inquiry statement can be put down into the database node for execution, thereby avoiding the technical problems that in the prior art, when a plurality of data tables are inquired in a connection mode, part of sub-plans of the join connection inquiry cannot be put down into the database node for execution if the uniform equivalent relation of the fragment table does not exist among the plurality of data tables, and finally the operation pressure and the memory pressure of the middleware of the distributed database are aggravated.
Drawings
Fig. 1 is a schematic diagram of a connection relationship structure between a client, a database middleware, and a plurality of database nodes in the prior art.
Fig. 2 is a schematic flowchart of an optimization method for a distributed database query according to an embodiment of the present invention.
Fig. 3 is a schematic structural diagram of a directed acyclic graph according to an embodiment of the present invention.
Fig. 4 is a schematic structural diagram of another directed acyclic graph according to an embodiment of the present invention.
Fig. 5 is a schematic structural diagram of another directed acyclic graph according to an embodiment of the present invention.
Fig. 6 is a schematic structural diagram of an electronic device according to an embodiment of the invention.
Detailed Description
The following describes in more detail embodiments of the present invention with reference to the schematic drawings. The advantages and features of the present invention will become more apparent from the following description. It is to be noted that the drawings are in a very simplified form and are not to precise scale, which is provided for the purpose of facilitating and clearly illustrating embodiments of the present invention.
As described in the background art, in the prior art, with the development of information technology, the data scale of various industries has been increased greatly, and the problems of massive concurrency and massive data need to be solved, and at the same time, high reliability is required. In the past, the problem is solved by using a medium-large computer in the industry. Because mainframes are generally expensive, the original requirements can be realized by using a relatively low-cost X86 machine by using a distributed architecture, the monopoly of the IOE in the aspect is broken, and partial independent and independent research and development are realized. Distributed database technology is then developed to address this problem. Taking a distributed database middleware dble as an example, the problem of massive data and concurrency is solved by splitting data to different database nodes in a certain splitting mode, and the high availability of the original single database can be utilized, and the simple structure is shown in fig. 1. Thus our concurrency and data volumes are spread across the databases of individual monomers. A simple SQL query sentence with addition, deletion and modification can be sent to one database node, and extra pressure brought by an application side is dispersed. However, in actual service, the query is not only performed according to a single data table, but also the JOIN connection query of multiple data tables may be performed.
For JOIN connection query of multiple data tables, a method commonly used before is to execute JOIN connection query of multiple data tables in the database middleware, which inevitably causes the problems of low performance and large memory occupation of the database middleware. For the problem, there is a method in the industry to perform optimization, that is, splitting is performed according to the partition table equivalence relationship between two tables, that is, the middleware of the distributed database parses an SQL query statement generated by an application terminal after receiving the SQL query statement, and performs splitting optimization on the parsed SQL query statement according to the partition table equivalence relationship between two tables in the SQL query statement, so that part of join connection queries are put into a repository node for performing, and further, the operation pressure and the memory pressure of the middleware of the distributed database are reduced.
However, because the SQL query statement optimization method based on the partition table equivalence relation in the prior art is only applicable to the case that the number of data tables to be queried related in the SQL query statement is two and there is only a uniform partition table equivalence relation between the two data tables, when the SQL query statement relates to queries of a plurality of data tables (the number of the data tables is greater than or equal to 3), if there is no uniform partition table equivalence relation between the plurality of data table tables, the SQL query statement optimization method in the prior art cannot be used, so that join queries cannot be transferred to a database node for execution, and finally, the operation pressure and the memory pressure of the distributed database middleware are increased.
Therefore, the invention provides an optimization method, a system and electronic equipment for distributed database query, so as to reduce the operation pressure and the memory pressure of a database middleware.
First, a method for optimizing a distributed database query according to an embodiment of the present invention is described below.
It should be noted that, in the embodiment of the present invention, as shown in fig. 1, the distributed database may include a plurality of database nodes Node m for storing database tables, for example, the database nodes Node1, the database nodes Node2, and the like, and the data stored between the database nodes may be the same or different.
In addition, in order to implement data transmission, the database middleware is connected with the database nodes in a one-to-one correspondence manner, and the connection technology may be bluetooth or a local area network. Of course, the connection technology in the embodiment of the present invention may be any connection technology that can connect the database middleware (the server in which the database middleware resides) with each database node, and is not limited herein.
Referring to fig. 2, fig. 2 is a schematic flowchart of an optimization method for a distributed database query according to an embodiment of the present invention, and as shown in fig. 2, the optimization method at least includes the following steps:
step S100, receiving a data query request sent by a client, where the data query request includes at least one SQL query statement, and the SQL query statement relates to JOIN connection queries of multiple data tables and equivalence connection queries of split columns between split tables included in the multiple data tables.
In this embodiment, a target user may send a data query request to the database middleware through a client, for example, an application terminal such as a mobile phone, a computer, a tablet, or the like, according to an actual situation, after receiving the data query request, the database middleware may query a configuration file according to request information in the data query request, so as to determine in which database node the data requested by the client exists, send a query execution plan to the database node, and finally send data returned by the database node to the client.
Moreover, the data query request may include one or more SQL query statements, and each SQL query statement may include a plurality of sub-statements. Because the optimization method for distributed database query provided by the invention is a query optimization method for JOIN connection query of a plurality of data tables without uniform partition table equivalence relation related to SQL query statement, the SQL query statement in the embodiment of the invention must relate to JOIN connection query of a plurality of data tables and equivalence connection query of split columns among split tables contained in the plurality of data tables.
Illustratively, the SQL query statement in the embodiment of the present invention may be:
SELECT a.Name,a.DeptName,b.Manager,c.salary FROM Employee a LEFT JOIN Level c on a.level=c.levelname LEFT JOIN Dept b on a.DeptName=b.DeptName order by a.name。
wherein, table a (employee) and table b (dept) are as follows:
TABLE a
Name | EmpId | DeptName | Level |
Mary | 1257 | HumanResources | P7 |
Harriet | 2201 | Sales | P8 |
Sally | 2242 | Sales | P7 |
George | 3402 | Finance | P8 |
Harry | 3415 | Finance | P7 |
Table b
DeptName | | Manager |
Finance | ||
2 | George | |
Sales | 3 | Harriet |
Table c
LevelName | LevelId | Salary |
P7 | 7 | 10000 |
P8 | 8 | 15000 |
Step S200, analyzing the SQL query statement to obtain metadata corresponding to the SQL query statement, wherein the metadata comprises a plurality of data tables and an incidence relation among the data tables.
In this embodiment, after the database middleware receives the data query request sent by the client, it first needs to perform syntax parsing on an SQL query statement in the data query request by using a parser to obtain a database node and a syntax tree structure that can be processed by the database middleware, where the syntax tree structure includes metadata corresponding to the SQL query statement, and specifically, the metadata may include a plurality of data tables and an association relationship between the plurality of data tables. The number of the data tables is not less than 3, and the association relationship among the plurality of data tables may include a fragmentation table equivalence relationship, and of course, other association relationships may also be used. The equivalence relation of the fragment tables is specifically that the fragment tables have the same splitting rule and the splitting columns of the two tables have the equivalence relation.
As a specific example, in a specific embodiment of the present invention, a step of parsing the SQL query statement to obtain metadata corresponding to the SQL query statement is further provided, where the metadata includes a plurality of data tables and an association relationship between the plurality of data tables, and the step includes:
and analyzing the SQL query statement by using a preset analyzer, at least extracting the name of each data table and the associated field name of the equivalent relation of the fragment table from an analysis result, and taking the name and the associated field name as the associated relation.
Illustratively, if the SQL query statement received by the database middleware is:
SELECT a.Name,a.DeptName,b.Manager,c.salary FROM Employee a LEFT JOIN Level c on a.level=c.levelname LEFT JOIN Dept b on a.DeptName=b.DeptName order by a.name。
then, the name of the data table obtained by analyzing the SQL query statement by using the analyzer is as follows: employee a, Level c and Dept b, and the obtained association relationship is as follows: level and a. Wherein, level name and DeptName are field names in the data table respectively.
And step S300, generating a directed acyclic graph taking the data table as a node according to the metadata.
In this embodiment, after obtaining the plurality of data tables included in the SQL query statement and the association relationships among the plurality of data tables in step S200, the association relationships of the SQL query statement may be simplified, and then, the names of the data tables are used as nodes, and the association relationships among the data tables are used as connecting lines, so as to construct a directed acyclic graph.
As a specific example, in a specific embodiment of the present invention, there is further provided a step of generating a directed acyclic graph with nodes being the data tables according to the metadata, where the step may include:
and according to the appearance sequence of the data tables in the metadata, taking the data table with the first appearance as a root node, taking the data tables with the association relation with the root node as child nodes in sequence, and taking the data tables with the association relation with the child nodes as secondary nodes in sequence.
Further, before generating the directed acyclic graph with the data tables as nodes according to the metadata, the optimization method for the distributed database query provided by the present invention may further include:
whether the InNER JOIN intra-connection query of a plurality of data tables exists in the association relationship of the SQL query statement or not can be determined, if yes, the non-fragmented table equivalence relationship in the InNER JOIN intra-connection query is extrapolated, and therefore the SQL query statement is simplified.
Example 1: if the SQL query statement sent by the client is:
a. name, a.deptname, b.manager, c.saray FROM Employee LEFT JOIN Level c on.level ═ c.levelname LEFT JOIN Dept b on.deptname ═ b.deptname order by a.name, then the directed acyclic graph formed by the present invention is as shown in fig. 3.
Example 2: if the SQL query statement sent by the client is:
a. name, a.deptname, b.manager, c.sarary FROM LEFT empty LEFT JOIN Level c on.level ═ c.levelname LEFT JOIN depth b. deptname ═ b.deptname and b.deptid ═ 2order by a. name, then the directed acyclic graph formed by the present invention is as shown in fig. 4.
It should be noted that, in the embodiment of the present invention, after the step S200 parses the SQL query statement to obtain the metadata corresponding to the SQL query statement, and before the step S300 generates the directed acyclic graph by using the metadata, the SQL query statement sent by the client may be simplified according to the following four relational algebra rules. The four relational algebra rules are specifically as follows:
rule A, when a plurality of data tables have join connection query but the latter two data tables have no direct fragmentation table equivalence relation, the results are not influenced by the exchange of join connection sequence of the data tables.
Rule B, if a plurality of data tables have leftjoin left connection query and the connection condition of the following third data table depends on the preceding 2 data tables, the join connection sequence cannot be exchanged.
Rule C the on condition for leftjoin left connection is that where cannot be extrapolated.
Rule D the on condition for the intra-innerjoin connection is where it can be extrapolated.
Example 1, assume that the SQL query statement sent by the client is:
SELECT a.Name,a.DeptName,b.Manager,c.salary FROM Employee a INNER JOIN Level c on a.level=c.levelname INNER JOIN Dept b on a.DeptName=b.DeptName AND b.DeptId=2order by a.name。
then, according to rule D, the above SQL query statement may be equivalent to:
SELECT a.Name,a.DeptName,b.Manager,c.salary FROM Employee a INNER JOIN Level c on a.level=c.levelname INNER JOIN Dept b on a.DeptName=b.DeptName WHERE b.DeptId=2order by a.name。
thus, the condition of b.deptid ═ 2 does not need to be considered during the construction of the ordered acyclic graph and optimization of the optimized SQL query statement. Then, the constructed directed acyclic graph will be like fig. 3, and the final topological ordering (the order can be exchanged according to rule a) is realized, thereby achieving the purpose of optimizing the distributed database query proposed by the present invention.
Example 2, suppose the SQL query statement sent by the client is:
SELECT a.Name,a.DeptName,b.Manager,c.salary FROM Employee a LEFT JOIN Level c on a.level=c.levelname LEFT JOIN Dept b on a.DeptName=b.DeptName AND b.DeptId=2order by a.name。
then according to rule C, b.deptid ═ 2 cannot be extrapolated. Then the directed acyclic graph constructed for the SQL query statement will have the final topological ordering (the order can be exchanged according to rule a) as shown in fig. 4, thereby achieving the purpose of optimizing the distributed database query proposed by the present invention.
Example 3, assume that the SQL query statement sent by the client is:
SELECT a.Name,a.DeptName,b.Manager,c.salary FROM Employee a LEFT JOIN Level c on a.level=c.levelname LEFT JOIN Dept b on a.DeptName=b.DeptName AND c.levelId=2order by a.name。
then the order of the three tables involved in the SQL statement cannot be swapped according to rule B. Therefore, the directed acyclic graph constructed for the SQL query statement is as shown in fig. 5, and the final topological ordering (the order cannot be exchanged according to rule B) is only one, so the SQL query statement in the embodiment cannot be optimized.
And step S400, generating a distributed query execution plan according to the directed acyclic graph, and respectively acquiring corresponding query results from a plurality of database nodes based on the distributed query execution plan.
The distributed query execution plan comprises a plurality of execution sub-plans, and each execution sub-plan corresponds to a priority; each node corresponds to a plurality of execution sub-plans.
In this embodiment, after the directed acyclic graph is formed, the topology ranking in the directed acyclic graph may be optimized, and then the nodes corresponding to the databases having the fragmentation table equivalence relation in the optimized directed acyclic graph are earlier than the nodes corresponding to the databases having the non-fragmentation table equivalence relation. And then, generating a distributed query execution plan according to the optimized directed acyclic graph, and then respectively sending the distributed query execution plan to a plurality of database nodes to perform corresponding query tasks to obtain corresponding query results.
Specifically, the database middleware may traverse each node in the optimized directed acyclic graph from a root node of the optimized directed acyclic graph according to a topological sort of the optimized directed acyclic graph; and in the process of traversing each node in the optimized directed acyclic graph, executing an execution sub-plan with high priority corresponding to each node with a fragmentation table equivalence relation, and after the execution of the execution sub-plan with high priority corresponding to each node with the fragmentation table equivalence relation in the optimized directed acyclic graph is finished, sequentially executing an execution sub-plan with low priority corresponding to each node with the fragmentation table equivalence relation and an execution sub-plan corresponding to each node with a non-fragmentation table equivalence relation.
Further, after generating a directed acyclic graph with the data tables as nodes according to the metadata, and before generating a distributed query execution plan according to the directed acyclic graph and obtaining corresponding query results in a plurality of database nodes respectively based on the distributed query execution plan, the optimization method may further include:
and aiming at the topological sorting of the directed acyclic graph, starting from a root node of the directed acyclic graph, using a data table with a fragmentation table equivalence relation with another data table as an optimized root node or an optimized child node, using a data table with a non-fragmentation table equivalence relation with another data table as a child node or a secondary node of the optimized root node or the optimized child node, so as to optimize the topological sorting of the directed acyclic graph, and enabling the priority of an execution sub-plan corresponding to the node with the fragmentation table equivalence relation in the topological sorting of the optimized directed acyclic graph to be higher than the priority of an execution sub-plan corresponding to the node with the non-fragmentation table equivalence relation.
In this embodiment, the purpose of optimizing the initially formed directed acyclic graph is to preferentially select a node where a fragmentation table equivalence relation of the distributed splitting rule is located from a plurality of topology sequences of the initially formed directed acyclic graph, then place a JOIN connection distributed query execution sub-plan corresponding to the nodes where the fragmentation table equivalence relation exists in a corresponding database node, execute a corresponding JOIN connection query in the database node, and execute another query execution sub-plan corresponding to the node where the fragmentation table equivalence relation exists or an execution sub-plan corresponding to the node where the non-fragmentation table equivalence relation exists after all JOIN connection distributed query execution sub-plans corresponding to the nodes where the fragmentation table equivalence relation exists are executed. In the invention, the association relation among a plurality of data tables with non-uniform sharding table equivalence relation is constructed and optimized by the directed acyclic graph, so that the data tables with sharding table equivalence relation are adjacent and have priority in the obtained topological ordering. Therefore, in the distributed query plan and the actual execution, the join connection query with the partition table equivalence relation in the SQL query statement can be put into the database node for execution, and the purpose of the invention is realized.
It should be noted that, when the JOIN connection query execution sub-plans corresponding to the nodes having the equivalent relationship of the fragmentation table are put into the corresponding database nodes for query, there is no order between them.
For ease of understanding, the following presents, in a complete form, a method for optimizing a distributed database query, using the present invention, in accordance with one embodiment.
Firstly, assume that the SQL query statement received by the database middleware is: SQL2-1: SELECT a.name, a.deptname, b.manager, c.sarary FROM Employee a LEFT JOIN Level c on.level ═ c.levelname LEFT JOIN depth b.deptname order by a.name;
then, the names of a plurality of database tables obtained by analyzing the SQL query statement are respectively: employee a, Dept b and Level c, and the obtained correlation among the data tables is as follows: level and a.
Secondly, a directed acyclic graph is formed based on the names of the plurality of database tables and the incidence relations thereof, as shown in fig. 3.
Finally, a distributed query execution plan is generated based on the directed acyclic graph shown in fig. 3, and corresponding query steps are respectively obtained from a plurality of database nodes based on the distributed query execution plan, which are briefly described as follows:
step 1.1 and step 1.2, a JOIN connection query is made to a part of data of the table Exployee and the table Dept in the database Node 1.
Step 2.1 and step 2.2, a JOIN connection query is made to the other part of data of table explicit and table Dept inside the database Node 2.
And 3.1 and 3.2, merging the data prepared in the steps 1.1 and 1.2 and the data prepared in the steps 2.1 and 2.2.
And 4.1 and 4.2, extracting the data in the Level table from the database Node1 and the database Node 2.
And 5.1 and 5.2, taking the data in the step 3.1 and the step 3.2 as a left table, taking the data in the Level table as a right table, performing JOIN connection to obtain a temporary database table, and returning the temporary database table to the client.
Based on the same technical concept, corresponding to the embodiment of the method shown in fig. 2, the embodiment of the present invention further provides an optimization system for distributed database query, including: the system comprises a plurality of database nodes and database middleware respectively connected with the database nodes.
The database middleware is configured to receive a data query request sent by a client, optimize an SQL query statement in the analyzed data query request by using the distributed database query optimization method, so as to obtain an optimized directed acyclic graph to generate a distributed query execution plan, and obtain corresponding query results in a plurality of database nodes based on the distributed query execution plan.
The database nodes are used for storing data tables, and the data tables stored in the database nodes have a fragmentation table equivalence relation.
In summary, the present invention provides an optimization method for multi-table complex query of a distributed database in an application scenario of distributed database middleware, which determines an association relationship (including a fragmentation table equivalence relationship) between a plurality of data tables by analyzing SQL query statements, and then establishes a directed acyclic graph using the data tables as nodes and the association relationship as a connection line after simplifying the association relationship between the plurality of data tables; selecting a traversal mode that nodes corresponding to the data table with the equivalent relation of the fragmentation table are earlier than nodes corresponding to the data table with the equivalent relation of the non-fragmentation table in multiple modes by utilizing the mode that the directed acyclic graph has multiple topological sorting; in the invention, the node where the equivalent relation of the fragment table of the distributed splitting rule is located is preferentially selected as a root node or a secondary node, so that the data table with the uniform equivalent relation of the fragment table in the obtained topological sorting can be preferentially processed, and therefore, when the topological sorting generated according to the topological sorting is inquired, the join connection inquiry containing the equivalent relation of the fragment table in the SQL inquiry statement can be put down into the database node for execution, thereby avoiding the technical problems that in the prior art, when a plurality of data tables are inquired in a connection mode, part of sub-plans of the join connection inquiry cannot be put down into the database node for execution if the uniform equivalent relation of the fragment table does not exist among the plurality of data tables, and finally the operation pressure and the memory pressure of the middleware of the distributed database are aggravated.
An embodiment of the present invention further provides an electronic device, as shown in fig. 6, including a processor 601, a communication interface 602, a memory 603, and a communication bus 604, where the processor 601, the communication interface 602, and the memory 603 complete mutual communication through the communication bus 604,
a memory 603 for storing a computer program;
the processor 601 is configured to implement the distributed database query optimization method provided by the embodiment of the present invention when executing the program stored in the memory 603.
Specifically, the method for optimizing query of distributed database includes: the method comprises the steps of pre-receiving a data query request sent by a client, wherein the data query request comprises at least one SQL query statement, and the SQL query statement relates to JOIN connection query of a plurality of data tables and equivalence connection query of split columns among split tables contained in the data tables.
And analyzing the SQL query statement to obtain metadata corresponding to the SQL query statement, wherein the metadata comprises a plurality of data tables and an incidence relation among the data tables.
And generating a directed acyclic graph with the data table as the node according to the metadata.
And generating a distributed query execution plan according to the directed acyclic graph, and respectively acquiring corresponding query results from a plurality of database nodes based on the distributed query execution plan.
In addition, other implementation manners of the distributed database query optimization method implemented by the processor 601 executing the program stored in the memory 603 are the same as the implementation manners mentioned in the foregoing method embodiment, and are not described herein again.
The communication bus mentioned above for the control terminal may be a Peripheral Component Interconnect (PCI) bus, an Extended Industry Standard Architecture (EISA) bus, or the like. The communication bus may be divided into an address bus, a data bus, a control bus, etc. For ease of illustration, only one thick line is shown, but this does not mean that there is only one bus or one type of bus.
The communication interface is used for communication between the electronic equipment and other equipment.
The Memory may include a Random Access Memory (RAM) or a Non-Volatile Memory (NVM), such as at least one disk Memory. Optionally, the memory may also be at least one memory device located remotely from the processor.
The Processor may be a general-purpose Processor, including a Central Processing Unit (CPU), a Network Processor (NP), and the like; but also Digital Signal Processors (DSPs), Application Specific Integrated Circuits (ASICs), Field Programmable Gate Arrays (FPGAs) or other Programmable logic devices, discrete Gate or transistor logic devices, discrete hardware components.
In yet another embodiment of the present invention, there is also provided a computer-readable storage medium having stored therein instructions, which when run on a computer, cause the computer to perform the distributed database query optimization method described in any of the above embodiments.
In the above embodiments, the implementation may be wholly or partially realized by software, hardware, firmware, or any combination thereof. When implemented in software, may be implemented in whole or in part in the form of a computer program product. The computer program product includes one or more computer instructions. When loaded and executed on a computer, cause the processes or functions described in accordance with the embodiments of the invention to be performed in whole or in part. The computer may be a general purpose computer, a special purpose computer, a network of computers, or other programmable device. The computer instructions may be stored in a computer readable storage medium or transmitted from one computer readable storage medium to another, for example, from one website site, computer, server, or data center to another website site, computer, server, or data center via wired (e.g., coaxial cable, fiber optic, Digital Subscriber Line (DSL)) or wireless (e.g., infrared, wireless, microwave, etc.). The computer-readable storage medium can be any available medium that can be accessed by a computer or a data storage device, such as a server, a data center, etc., that incorporates one or more of the available media. The usable medium may be a magnetic medium (e.g., floppy Disk, hard Disk, magnetic tape), an optical medium (e.g., DVD), or a semiconductor medium (e.g., Solid State Disk (SSD)), among others.
It is noted that, herein, relational terms such as first and second, and the like may be used solely to distinguish one entity or action from another entity or action without necessarily requiring or implying any actual such relationship or order between such entities or actions. Also, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising an … …" does not exclude the presence of other identical elements in a process, method, article, or apparatus that comprises the element.
All the embodiments in the present specification are described in a related manner, and the same and similar parts among the embodiments may be referred to each other, and each embodiment focuses on the differences from the other embodiments. In particular, as for the apparatus, the electronic device, and the computer-readable storage medium embodiments, since they are substantially similar to the method embodiments, the description is relatively simple, and in relation to the description, reference may be made to some portions of the description of the method embodiments.
The above description is only for the preferred embodiment of the present invention, and is not intended to limit the scope of the present invention. Any modification, equivalent replacement, or improvement made within the spirit and principle of the present invention shall fall within the protection scope of the present invention.
Claims (9)
1. An optimization method for a distributed database query, applied to a database middleware, wherein the distributed database comprises a plurality of database nodes for storing data tables, the optimization method comprising:
receiving a data query request sent by a client, wherein the data query request comprises at least one SQL query statement, and the SQL query statement relates to JOIN connection query of a plurality of data tables and equivalence connection query of split columns among split tables contained in the plurality of data tables;
analyzing the SQL query statement to obtain metadata corresponding to the SQL query statement, wherein the metadata comprises a plurality of data tables and an incidence relation among the data tables, and the number of the data tables contained in the metadata is not less than 3;
generating a directed acyclic graph with the data table as a node according to the metadata;
and generating a distributed query execution plan according to the directed acyclic graph, and respectively acquiring corresponding query results from a plurality of database nodes based on the distributed query execution plan.
2. The method of optimizing distributed database queries of claim 1, wherein the associations between the data tables include shard table isograms having the same splitting rules and an isogram between split columns of two tables.
3. The method for optimizing distributed database query according to claim 2, wherein the step of parsing the SQL query statement to obtain metadata corresponding to the SQL query statement, the metadata including a plurality of data tables, and an association relationship between the plurality of data tables includes:
and analyzing the SQL query statement by using a preset analyzer, at least extracting the name of each data table and the associated field name of the equivalent relation of the fragment table from an analysis result, and taking the name and the associated field name as the associated relation.
4. The method for optimizing distributed database queries of claim 1 wherein the step of generating a directed acyclic graph having nodes from said data tables based on said metadata comprises:
and according to the appearance sequence of the data tables in the metadata, taking the data table with the first appearance as a root node, taking the data tables with the association relation with the root node as child nodes in sequence, and taking the data tables with the association relation with the child nodes as secondary nodes in sequence.
5. The method of optimizing distributed database queries of claim 2, wherein prior to generating a directed acyclic graph from the metadata that is a node of the data table, the method further comprises:
and determining whether the SQL query statement has INNER JOIN in-connection query of a plurality of data tables, if so, extrapolating the non-fragmented table equivalence relation in the INNER JOIN in-connection query to simplify the SQL query statement.
6. The method of claim 4, wherein the distributed query execution plan comprises a plurality of execution sub-plans, each of the execution sub-plans corresponding to a priority level;
after generating a directed acyclic graph with the data tables as nodes according to the metadata, and before generating a distributed query execution plan according to the directed acyclic graph and respectively obtaining corresponding query results in a plurality of database nodes based on the distributed query execution plan, the method further includes:
and aiming at the topological sorting of the directed acyclic graph, starting from a root node of the directed acyclic graph, using a data table with a fragmentation table equivalence relation with another data table as an optimized root node or an optimized child node, using a data table with a non-fragmentation table equivalence relation with another data table as a child node or a secondary node of the optimized root node or the optimized child node, so as to optimize the topological sorting of the directed acyclic graph, and enabling the priority of an execution sub-plan corresponding to the node with the fragmentation table equivalence relation in the topological sorting of the optimized directed acyclic graph to be higher than the priority of an execution sub-plan corresponding to the node with the non-fragmentation table equivalence relation.
7. The method for optimizing distributed database queries of claim 6, wherein each node corresponds to a plurality of execution sub-plans;
the step of generating a query execution plan according to the directed acyclic graph and obtaining corresponding query results in a plurality of database nodes respectively based on the distributed query execution plan includes:
traversing each node in the optimized directed acyclic graph from a root node of the optimized directed acyclic graph according to the topological sorting of the optimized directed acyclic graph;
and in the process of traversing each node in the optimized directed acyclic graph, executing an execution sub-plan with high priority corresponding to each node with a fragmentation table equivalence relation, and after the execution of the execution sub-plan with high priority corresponding to each node with the fragmentation table equivalence relation in the optimized directed acyclic graph is finished, sequentially executing an execution sub-plan with low priority corresponding to each node with the fragmentation table equivalence relation and an execution sub-plan corresponding to each node with a non-fragmentation table equivalence relation.
8. A system for optimizing distributed database queries, comprising: a plurality of database nodes and a database middleware respectively connected with the database nodes, wherein,
the database middleware is used for receiving a data query request sent by a client, optimizing an SQL query statement in the analyzed data query request by using the distributed database query optimization method according to any one of claims 1 to 7 to obtain an optimized directed acyclic graph to generate a distributed query execution plan, and respectively obtaining corresponding query results from a plurality of database nodes based on the distributed query execution plan;
the database nodes are used for storing data tables, and sub-table equivalence relations which have the same splitting rules and equivalence relations between splitting columns of two tables exist among database tables stored in the plurality of database nodes.
9. An electronic device is characterized by comprising a processor, a communication interface, a memory and a communication bus, wherein the processor and the communication interface are used for realizing mutual communication by the memory through the communication bus;
a memory for storing a computer program;
a processor for implementing a method of optimizing a distributed database query as claimed in any one of claims 1 to 7 when executing a program stored on a memory.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202210600055.0A CN114860764A (en) | 2022-05-26 | 2022-05-26 | Optimization method and system for distributed database query and electronic equipment |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202210600055.0A CN114860764A (en) | 2022-05-26 | 2022-05-26 | Optimization method and system for distributed database query and electronic equipment |
Publications (1)
Publication Number | Publication Date |
---|---|
CN114860764A true CN114860764A (en) | 2022-08-05 |
Family
ID=82641560
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202210600055.0A Pending CN114860764A (en) | 2022-05-26 | 2022-05-26 | Optimization method and system for distributed database query and electronic equipment |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN114860764A (en) |
Cited By (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN116383238A (en) * | 2023-06-06 | 2023-07-04 | 湖南红普创新科技发展有限公司 | Data virtualization system, method, device, equipment and medium based on graph structure |
CN117131078A (en) * | 2023-10-20 | 2023-11-28 | 天津南大通用数据技术股份有限公司 | Cross-distribution mode-based associated query optimization method |
Citations (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN111352950A (en) * | 2020-03-04 | 2020-06-30 | 上海达梦数据库有限公司 | Database table equivalent connection optimization method and device, server and storage medium |
CN112395303A (en) * | 2019-08-15 | 2021-02-23 | 阿里巴巴集团控股有限公司 | Query execution method and device, electronic equipment and computer readable medium |
CN113297250A (en) * | 2021-05-28 | 2021-08-24 | 北京思特奇信息技术股份有限公司 | Method and system for multi-table association query of distributed database |
-
2022
- 2022-05-26 CN CN202210600055.0A patent/CN114860764A/en active Pending
Patent Citations (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN112395303A (en) * | 2019-08-15 | 2021-02-23 | 阿里巴巴集团控股有限公司 | Query execution method and device, electronic equipment and computer readable medium |
CN111352950A (en) * | 2020-03-04 | 2020-06-30 | 上海达梦数据库有限公司 | Database table equivalent connection optimization method and device, server and storage medium |
CN113297250A (en) * | 2021-05-28 | 2021-08-24 | 北京思特奇信息技术股份有限公司 | Method and system for multi-table association query of distributed database |
Cited By (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN116383238A (en) * | 2023-06-06 | 2023-07-04 | 湖南红普创新科技发展有限公司 | Data virtualization system, method, device, equipment and medium based on graph structure |
CN116383238B (en) * | 2023-06-06 | 2023-08-29 | 湖南红普创新科技发展有限公司 | Data virtualization system, method, device, equipment and medium based on graph structure |
CN117131078A (en) * | 2023-10-20 | 2023-11-28 | 天津南大通用数据技术股份有限公司 | Cross-distribution mode-based associated query optimization method |
CN117131078B (en) * | 2023-10-20 | 2024-02-06 | 天津南大通用数据技术股份有限公司 | Cross-distribution mode-based associated query optimization method |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN109766345B (en) | Metadata processing method and device, equipment and readable storage medium | |
US10635671B2 (en) | Sort-merge band join optimization | |
CN114860764A (en) | Optimization method and system for distributed database query and electronic equipment | |
US8224807B2 (en) | Enhanced utilization of query optimization | |
US10831737B2 (en) | Method and device for partitioning association table in distributed database | |
US7606827B2 (en) | Query optimization using materialized views in database management systems | |
EP2637111A1 (en) | Data management system and method using database middleware | |
US11514009B2 (en) | Method and systems for mapping object oriented/functional languages to database languages | |
CN111581234B (en) | RAC multi-node database query method, device and system | |
US11176133B2 (en) | Filter evaluation for table fragments | |
US8812492B2 (en) | Automatic and dynamic design of cache groups | |
CN101916280A (en) | Parallel computing system and method for carrying out load balance according to query contents | |
CN113094387A (en) | Data query method and device, electronic equipment and machine-readable storage medium | |
CN114490724B (en) | Method and device for processing database query statement | |
CN111782634B (en) | Data distributed storage method, device, electronic equipment and storage medium | |
CN112818010B (en) | Database query method and device | |
CN114168620A (en) | Execution plan processing method and device | |
CN114547083A (en) | Data processing method and device and electronic equipment | |
CN114238387A (en) | Data query method and device, electronic equipment and storage medium | |
CN110413642B (en) | Application-unaware fragmentation database parsing and optimizing method | |
US20190370259A1 (en) | Devices and methods for implementing dynamic collaborative workflow systems | |
CN117573730B (en) | Data processing method, apparatus, device, readable storage medium, and program product | |
CN112988809B (en) | Data query method, device, equipment and medium based on relational database | |
WO2021215101A1 (en) | Data management system and data management method | |
US20240126757A1 (en) | Outer semi join for disjunctive subquery unnesting |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
RJ01 | Rejection of invention patent application after publication |
Application publication date: 20220805 |
|
RJ01 | Rejection of invention patent application after publication |