CN114860764A - Optimization method and system for distributed database query and electronic equipment - Google Patents

Optimization method and system for distributed database query and electronic equipment Download PDF

Info

Publication number
CN114860764A
CN114860764A CN202210600055.0A CN202210600055A CN114860764A CN 114860764 A CN114860764 A CN 114860764A CN 202210600055 A CN202210600055 A CN 202210600055A CN 114860764 A CN114860764 A CN 114860764A
Authority
CN
China
Prior art keywords
query
database
node
data
acyclic graph
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202210600055.0A
Other languages
Chinese (zh)
Inventor
阎虎青
鲍凤其
董诚怡
袁琳铸
郭奥门
文韵涵
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Shanghai Aikesheng Information Technology Co ltd
Original Assignee
Shanghai Aikesheng Information Technology Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Shanghai Aikesheng Information Technology Co ltd filed Critical Shanghai Aikesheng Information Technology Co ltd
Priority to CN202210600055.0A priority Critical patent/CN114860764A/en
Publication of CN114860764A publication Critical patent/CN114860764A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/24Querying
    • G06F16/245Query processing
    • G06F16/2453Query optimisation
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/24Querying
    • G06F16/242Query formulation
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/24Querying
    • G06F16/245Query processing
    • G06F16/2455Query execution
    • G06F16/24553Query execution of query operations
    • G06F16/24558Binary matching operations
    • G06F16/2456Join operations
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/24Querying
    • G06F16/245Query processing
    • G06F16/2458Special types of queries, e.g. statistical queries, fuzzy queries or distributed queries
    • G06F16/2471Distributed queries
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/27Replication, distribution or synchronisation of data between databases or within a distributed database system; Distributed database system architectures therefor

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Databases & Information Systems (AREA)
  • General Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Computational Linguistics (AREA)
  • Mathematical Physics (AREA)
  • Computing Systems (AREA)
  • Fuzzy Systems (AREA)
  • Probability & Statistics with Applications (AREA)
  • Software Systems (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The invention provides an optimization method, a system and electronic equipment for distributed database query, which are applied to the technical field of database query. In the invention, the nodes which have the same splitting rules between the distributed data tables and have the equivalence relation between the splitting columns of the two tables are preferentially selected as the root nodes or the secondary nodes, so that the data tables with the equivalence relation between the splitting columns are adjacent and have priority in the obtained topological ordering. Therefore, in the distributed query plan and the actual execution, join connection queries with equivalence relations among the split columns in the SQL query statement can be put down and executed in the database nodes, and the technical problem that in the prior art, if the join connection queries cannot be put down and executed in the database nodes due to the fact that the uniform equivalence relations do not exist among the tables when a plurality of data tables are subjected to connection queries, operation pressure and memory pressure of the distributed database middleware are increased finally is solved.

Description

Optimization method and system for distributed database query and electronic equipment
Technical Field
The invention relates to the technical field of database query, in particular to a distributed database query optimization method, a distributed database query optimization system and electronic equipment.
Background
The database system hides the details of the data structure for the user, provides an interface irrelevant to the data, and the user only needs to simply describe the query behavior without paying attention to the internal process of acquiring the data. Query optimization is to minimize the total resource occupation and the total time consumption of a query as much as possible in the process of generating an execution strategy by a query execution engine. The database management system may have a built-in query optimizer for controlling and accelerating the query execution and data transmission process, and when the user query expression is parsed by the query processor, the query optimizer selects an appropriate data access policy for the query as output. However, query optimization has been a complex problem, particularly for queries of large amounts of data.
Aiming at the query of mass data, in order to reduce the operation pressure and the memory pressure of distributed database middleware, at present, a commonly used query optimization method is as follows: the distributed database middleware analyzes the SQL query statement generated by the application terminal after receiving the SQL query statement, and performs splitting optimization on the analyzed SQL query statement according to the equivalent relation of the two tables in the SQL query statement, so that part of join connection query is put into the storage library nodes for performing, and further the operation pressure and the memory pressure of the distributed database middleware are reduced.
However, because the SQL query statement optimization method based on the partition table equivalence relation in the prior art is only applicable to the case that the number of data tables to be queried related in the SQL query statement is two and there is only a uniform partition table equivalence relation between the two data tables, when the SQL query statement relates to queries of a plurality of data tables (the number of the data tables is greater than or equal to 3), if there is no uniform partition table equivalence relation between the plurality of data table tables, the SQL query statement optimization method in the prior art cannot be used, so that join queries cannot be transferred to a database node for execution, and finally, the operation pressure and the memory pressure of the distributed database middleware are increased.
Disclosure of Invention
The invention aims to provide an optimization method, an optimization system and electronic equipment for distributed database query, so as to reduce the operation pressure and the memory pressure of a database middleware.
In a first aspect, to solve the above technical problem, the present invention provides an optimization method for a distributed database query, applied to a middleware of a database, where the distributed database includes a plurality of database nodes for storing data tables, and the optimization method includes: receiving a data query request sent by a client, wherein the data query request comprises at least one SQL query statement, and the SQL query statement relates to JOIN connection query of a plurality of data tables and equivalence connection query of split columns among split tables contained in the plurality of data tables.
Analyzing the SQL query statement to obtain metadata corresponding to the SQL query statement, wherein the metadata comprises a plurality of data tables and an incidence relation among the data tables, and the number of the data tables included in the metadata is not less than 3.
And generating a directed acyclic graph with the data table as the node according to the metadata.
And generating a distributed query execution plan according to the directed acyclic graph, and respectively acquiring corresponding query results from a plurality of database nodes based on the distributed query execution plan.
Further, the association relationship between the data tables may include a fragmentation table equivalence relationship having the same splitting rule and an equivalence relationship between the split columns of the two tables.
Further, the step of analyzing the SQL query statement to obtain metadata corresponding to the SQL query statement, where the metadata includes a plurality of data tables and an association relationship between the data tables, may include: and analyzing the SQL query statement by using a preset analyzer, at least extracting the name of each data table and the associated field name of the equivalent relation of the fragment table from an analysis result, and taking the name and the associated field name as the associated relation.
Further, the step of generating a directed acyclic graph with the data tables as nodes according to the metadata may include: and according to the appearance sequence of the data tables in the metadata, taking the data table with the first appearance as a root node, taking the data tables with the association relation with the root node as child nodes in sequence, and taking the data tables with the association relation with the child nodes as secondary nodes in sequence.
Further, before generating the directed acyclic graph with the data tables as nodes according to the metadata, the method may further include: and determining whether the SQL query statement has INNER JOIN in-connection query of a plurality of data tables, if so, extrapolating the non-fragmented table equivalence relation in the INNER JOIN in-connection query to simplify the SQL query statement.
Further, the distributed query execution plan includes a plurality of execution sub-plans, and each execution sub-plan corresponds to a priority.
After generating the directed acyclic graph with the data tables as nodes according to the metadata, and before generating a distributed query execution plan according to the directed acyclic graph and respectively obtaining corresponding query results in a plurality of database nodes based on the distributed query execution plan, the method may further include: and aiming at the topological sorting of the directed acyclic graph, starting from a root node of the directed acyclic graph, using a data table with a fragmentation table equivalence relation with another data table as an optimized root node or an optimized child node, using a data table with a non-fragmentation table equivalence relation with another data table as a child node or a secondary node of the optimized root node or the optimized child node, so as to optimize the topological sorting of the directed acyclic graph, and enabling the priority of an execution sub-plan corresponding to the node with the fragmentation table equivalence relation in the topological sorting of the optimized directed acyclic graph to be higher than the priority of an execution sub-plan corresponding to the node with the non-fragmentation table equivalence relation.
Further, each node corresponds to a plurality of execution sub-plans.
The step of generating a query execution plan according to the directed acyclic graph, and obtaining corresponding query results in a plurality of database nodes based on the distributed query execution plan may include: and traversing each node in the optimized directed acyclic graph from the root node of the optimized directed acyclic graph according to the topological sorting of the optimized directed acyclic graph.
And in the process of traversing each node in the optimized directed acyclic graph, executing an execution sub-plan with high priority corresponding to each node with a fragmentation table equivalence relation, and after the execution of the execution sub-plan with high priority corresponding to each node with the fragmentation table equivalence relation in the optimized directed acyclic graph is finished, sequentially executing an execution sub-plan with low priority corresponding to each node with the fragmentation table equivalence relation and an execution sub-plan corresponding to each node with a non-fragmentation table equivalence relation.
In a second aspect, based on the same inventive concept, the present invention further provides an optimization system for distributed database query, which specifically includes: the system comprises a plurality of database nodes and database middleware respectively connected with the database nodes.
The database middleware is configured to receive a data query request sent by a client, optimize an SQL query statement in the analyzed data query request by using the distributed database query optimization method, so as to obtain an optimized directed acyclic graph to generate a distributed query execution plan, and obtain corresponding query results in a plurality of database nodes based on the distributed query execution plan.
The database nodes are used for storing data tables, and sub-table equivalence relations which have the same splitting rules and equivalence relations between splitting columns of two tables exist among database tables stored in the plurality of database nodes.
In a third aspect, based on the foregoing optimization method for distributed database query, the present invention further provides an electronic device, which includes a processor, a communication interface, a memory, and a communication bus, where the processor, the communication interface, and the memory complete communication with each other through the communication bus.
A memory for storing a computer program.
The processor is used for realizing the optimization method of the distributed database query when executing the program stored on the memory.
Compared with the prior art, the technical scheme of the invention has at least one of the following beneficial effects:
the invention provides an optimization method of multi-table complex query of a distributed database in an application scene of middleware of the distributed database, which comprises the steps of determining an incidence relation (including a fragmentation table equivalence relation) among a plurality of data tables by analyzing an SQL query statement, then, establishing a directed acyclic graph which takes the data tables as nodes and takes the incidence relation as a connecting line after simplifying the incidence relation among the data tables; selecting a traversal mode that nodes corresponding to the data table with the equivalent relation of the fragmentation table are earlier than nodes corresponding to the data table with the equivalent relation of the non-fragmentation table in multiple modes by utilizing the mode that the directed acyclic graph has multiple topological sorting; in the invention, the node where the equivalent relation of the fragment table of the distributed splitting rule is located is preferentially selected as a root node or a secondary node, so that the data table with the uniform equivalent relation of the fragment table in the obtained topological sorting can be preferentially processed, and therefore, when the topological sorting generated according to the topological sorting is inquired, the join connection inquiry containing the equivalent relation of the fragment table in the SQL inquiry statement can be put down into the database node for execution, thereby avoiding the technical problems that in the prior art, when a plurality of data tables are inquired in a connection mode, part of sub-plans of the join connection inquiry cannot be put down into the database node for execution if the uniform equivalent relation of the fragment table does not exist among the plurality of data tables, and finally the operation pressure and the memory pressure of the middleware of the distributed database are aggravated.
Drawings
Fig. 1 is a schematic diagram of a connection relationship structure between a client, a database middleware, and a plurality of database nodes in the prior art.
Fig. 2 is a schematic flowchart of an optimization method for a distributed database query according to an embodiment of the present invention.
Fig. 3 is a schematic structural diagram of a directed acyclic graph according to an embodiment of the present invention.
Fig. 4 is a schematic structural diagram of another directed acyclic graph according to an embodiment of the present invention.
Fig. 5 is a schematic structural diagram of another directed acyclic graph according to an embodiment of the present invention.
Fig. 6 is a schematic structural diagram of an electronic device according to an embodiment of the invention.
Detailed Description
The following describes in more detail embodiments of the present invention with reference to the schematic drawings. The advantages and features of the present invention will become more apparent from the following description. It is to be noted that the drawings are in a very simplified form and are not to precise scale, which is provided for the purpose of facilitating and clearly illustrating embodiments of the present invention.
As described in the background art, in the prior art, with the development of information technology, the data scale of various industries has been increased greatly, and the problems of massive concurrency and massive data need to be solved, and at the same time, high reliability is required. In the past, the problem is solved by using a medium-large computer in the industry. Because mainframes are generally expensive, the original requirements can be realized by using a relatively low-cost X86 machine by using a distributed architecture, the monopoly of the IOE in the aspect is broken, and partial independent and independent research and development are realized. Distributed database technology is then developed to address this problem. Taking a distributed database middleware dble as an example, the problem of massive data and concurrency is solved by splitting data to different database nodes in a certain splitting mode, and the high availability of the original single database can be utilized, and the simple structure is shown in fig. 1. Thus our concurrency and data volumes are spread across the databases of individual monomers. A simple SQL query sentence with addition, deletion and modification can be sent to one database node, and extra pressure brought by an application side is dispersed. However, in actual service, the query is not only performed according to a single data table, but also the JOIN connection query of multiple data tables may be performed.
For JOIN connection query of multiple data tables, a method commonly used before is to execute JOIN connection query of multiple data tables in the database middleware, which inevitably causes the problems of low performance and large memory occupation of the database middleware. For the problem, there is a method in the industry to perform optimization, that is, splitting is performed according to the partition table equivalence relationship between two tables, that is, the middleware of the distributed database parses an SQL query statement generated by an application terminal after receiving the SQL query statement, and performs splitting optimization on the parsed SQL query statement according to the partition table equivalence relationship between two tables in the SQL query statement, so that part of join connection queries are put into a repository node for performing, and further, the operation pressure and the memory pressure of the middleware of the distributed database are reduced.
However, because the SQL query statement optimization method based on the partition table equivalence relation in the prior art is only applicable to the case that the number of data tables to be queried related in the SQL query statement is two and there is only a uniform partition table equivalence relation between the two data tables, when the SQL query statement relates to queries of a plurality of data tables (the number of the data tables is greater than or equal to 3), if there is no uniform partition table equivalence relation between the plurality of data table tables, the SQL query statement optimization method in the prior art cannot be used, so that join queries cannot be transferred to a database node for execution, and finally, the operation pressure and the memory pressure of the distributed database middleware are increased.
Therefore, the invention provides an optimization method, a system and electronic equipment for distributed database query, so as to reduce the operation pressure and the memory pressure of a database middleware.
First, a method for optimizing a distributed database query according to an embodiment of the present invention is described below.
It should be noted that, in the embodiment of the present invention, as shown in fig. 1, the distributed database may include a plurality of database nodes Node m for storing database tables, for example, the database nodes Node1, the database nodes Node2, and the like, and the data stored between the database nodes may be the same or different.
In addition, in order to implement data transmission, the database middleware is connected with the database nodes in a one-to-one correspondence manner, and the connection technology may be bluetooth or a local area network. Of course, the connection technology in the embodiment of the present invention may be any connection technology that can connect the database middleware (the server in which the database middleware resides) with each database node, and is not limited herein.
Referring to fig. 2, fig. 2 is a schematic flowchart of an optimization method for a distributed database query according to an embodiment of the present invention, and as shown in fig. 2, the optimization method at least includes the following steps:
step S100, receiving a data query request sent by a client, where the data query request includes at least one SQL query statement, and the SQL query statement relates to JOIN connection queries of multiple data tables and equivalence connection queries of split columns between split tables included in the multiple data tables.
In this embodiment, a target user may send a data query request to the database middleware through a client, for example, an application terminal such as a mobile phone, a computer, a tablet, or the like, according to an actual situation, after receiving the data query request, the database middleware may query a configuration file according to request information in the data query request, so as to determine in which database node the data requested by the client exists, send a query execution plan to the database node, and finally send data returned by the database node to the client.
Moreover, the data query request may include one or more SQL query statements, and each SQL query statement may include a plurality of sub-statements. Because the optimization method for distributed database query provided by the invention is a query optimization method for JOIN connection query of a plurality of data tables without uniform partition table equivalence relation related to SQL query statement, the SQL query statement in the embodiment of the invention must relate to JOIN connection query of a plurality of data tables and equivalence connection query of split columns among split tables contained in the plurality of data tables.
Illustratively, the SQL query statement in the embodiment of the present invention may be:
SELECT a.Name,a.DeptName,b.Manager,c.salary FROM Employee a LEFT JOIN Level c on a.level=c.levelname LEFT JOIN Dept b on a.DeptName=b.DeptName order by a.name。
wherein, table a (employee) and table b (dept) are as follows:
TABLE a
Name EmpId DeptName Level
Mary 1257 HumanResources P7
Harriet 2201 Sales P8
Sally 2242 Sales P7
George 3402 Finance P8
Harry 3415 Finance P7
Table b
DeptName DeptId Manager
Finance
2 George
Sales 3 Harriet
Table c
LevelName LevelId Salary
P7 7 10000
P8 8 15000
Step S200, analyzing the SQL query statement to obtain metadata corresponding to the SQL query statement, wherein the metadata comprises a plurality of data tables and an incidence relation among the data tables.
In this embodiment, after the database middleware receives the data query request sent by the client, it first needs to perform syntax parsing on an SQL query statement in the data query request by using a parser to obtain a database node and a syntax tree structure that can be processed by the database middleware, where the syntax tree structure includes metadata corresponding to the SQL query statement, and specifically, the metadata may include a plurality of data tables and an association relationship between the plurality of data tables. The number of the data tables is not less than 3, and the association relationship among the plurality of data tables may include a fragmentation table equivalence relationship, and of course, other association relationships may also be used. The equivalence relation of the fragment tables is specifically that the fragment tables have the same splitting rule and the splitting columns of the two tables have the equivalence relation.
As a specific example, in a specific embodiment of the present invention, a step of parsing the SQL query statement to obtain metadata corresponding to the SQL query statement is further provided, where the metadata includes a plurality of data tables and an association relationship between the plurality of data tables, and the step includes:
and analyzing the SQL query statement by using a preset analyzer, at least extracting the name of each data table and the associated field name of the equivalent relation of the fragment table from an analysis result, and taking the name and the associated field name as the associated relation.
Illustratively, if the SQL query statement received by the database middleware is:
SELECT a.Name,a.DeptName,b.Manager,c.salary FROM Employee a LEFT JOIN Level c on a.level=c.levelname LEFT JOIN Dept b on a.DeptName=b.DeptName order by a.name。
then, the name of the data table obtained by analyzing the SQL query statement by using the analyzer is as follows: employee a, Level c and Dept b, and the obtained association relationship is as follows: level and a. Wherein, level name and DeptName are field names in the data table respectively.
And step S300, generating a directed acyclic graph taking the data table as a node according to the metadata.
In this embodiment, after obtaining the plurality of data tables included in the SQL query statement and the association relationships among the plurality of data tables in step S200, the association relationships of the SQL query statement may be simplified, and then, the names of the data tables are used as nodes, and the association relationships among the data tables are used as connecting lines, so as to construct a directed acyclic graph.
As a specific example, in a specific embodiment of the present invention, there is further provided a step of generating a directed acyclic graph with nodes being the data tables according to the metadata, where the step may include:
and according to the appearance sequence of the data tables in the metadata, taking the data table with the first appearance as a root node, taking the data tables with the association relation with the root node as child nodes in sequence, and taking the data tables with the association relation with the child nodes as secondary nodes in sequence.
Further, before generating the directed acyclic graph with the data tables as nodes according to the metadata, the optimization method for the distributed database query provided by the present invention may further include:
whether the InNER JOIN intra-connection query of a plurality of data tables exists in the association relationship of the SQL query statement or not can be determined, if yes, the non-fragmented table equivalence relationship in the InNER JOIN intra-connection query is extrapolated, and therefore the SQL query statement is simplified.
Example 1: if the SQL query statement sent by the client is:
a. name, a.deptname, b.manager, c.saray FROM Employee LEFT JOIN Level c on.level ═ c.levelname LEFT JOIN Dept b on.deptname ═ b.deptname order by a.name, then the directed acyclic graph formed by the present invention is as shown in fig. 3.
Example 2: if the SQL query statement sent by the client is:
a. name, a.deptname, b.manager, c.sarary FROM LEFT empty LEFT JOIN Level c on.level ═ c.levelname LEFT JOIN depth b. deptname ═ b.deptname and b.deptid ═ 2order by a. name, then the directed acyclic graph formed by the present invention is as shown in fig. 4.
It should be noted that, in the embodiment of the present invention, after the step S200 parses the SQL query statement to obtain the metadata corresponding to the SQL query statement, and before the step S300 generates the directed acyclic graph by using the metadata, the SQL query statement sent by the client may be simplified according to the following four relational algebra rules. The four relational algebra rules are specifically as follows:
rule A, when a plurality of data tables have join connection query but the latter two data tables have no direct fragmentation table equivalence relation, the results are not influenced by the exchange of join connection sequence of the data tables.
Rule B, if a plurality of data tables have leftjoin left connection query and the connection condition of the following third data table depends on the preceding 2 data tables, the join connection sequence cannot be exchanged.
Rule C the on condition for leftjoin left connection is that where cannot be extrapolated.
Rule D the on condition for the intra-innerjoin connection is where it can be extrapolated.
Example 1, assume that the SQL query statement sent by the client is:
SELECT a.Name,a.DeptName,b.Manager,c.salary FROM Employee a INNER JOIN Level c on a.level=c.levelname INNER JOIN Dept b on a.DeptName=b.DeptName AND b.DeptId=2order by a.name。
then, according to rule D, the above SQL query statement may be equivalent to:
SELECT a.Name,a.DeptName,b.Manager,c.salary FROM Employee a INNER JOIN Level c on a.level=c.levelname INNER JOIN Dept b on a.DeptName=b.DeptName WHERE b.DeptId=2order by a.name。
thus, the condition of b.deptid ═ 2 does not need to be considered during the construction of the ordered acyclic graph and optimization of the optimized SQL query statement. Then, the constructed directed acyclic graph will be like fig. 3, and the final topological ordering (the order can be exchanged according to rule a) is realized, thereby achieving the purpose of optimizing the distributed database query proposed by the present invention.
Example 2, suppose the SQL query statement sent by the client is:
SELECT a.Name,a.DeptName,b.Manager,c.salary FROM Employee a LEFT JOIN Level c on a.level=c.levelname LEFT JOIN Dept b on a.DeptName=b.DeptName AND b.DeptId=2order by a.name。
then according to rule C, b.deptid ═ 2 cannot be extrapolated. Then the directed acyclic graph constructed for the SQL query statement will have the final topological ordering (the order can be exchanged according to rule a) as shown in fig. 4, thereby achieving the purpose of optimizing the distributed database query proposed by the present invention.
Example 3, assume that the SQL query statement sent by the client is:
SELECT a.Name,a.DeptName,b.Manager,c.salary FROM Employee a LEFT JOIN Level c on a.level=c.levelname LEFT JOIN Dept b on a.DeptName=b.DeptName AND c.levelId=2order by a.name。
then the order of the three tables involved in the SQL statement cannot be swapped according to rule B. Therefore, the directed acyclic graph constructed for the SQL query statement is as shown in fig. 5, and the final topological ordering (the order cannot be exchanged according to rule B) is only one, so the SQL query statement in the embodiment cannot be optimized.
And step S400, generating a distributed query execution plan according to the directed acyclic graph, and respectively acquiring corresponding query results from a plurality of database nodes based on the distributed query execution plan.
The distributed query execution plan comprises a plurality of execution sub-plans, and each execution sub-plan corresponds to a priority; each node corresponds to a plurality of execution sub-plans.
In this embodiment, after the directed acyclic graph is formed, the topology ranking in the directed acyclic graph may be optimized, and then the nodes corresponding to the databases having the fragmentation table equivalence relation in the optimized directed acyclic graph are earlier than the nodes corresponding to the databases having the non-fragmentation table equivalence relation. And then, generating a distributed query execution plan according to the optimized directed acyclic graph, and then respectively sending the distributed query execution plan to a plurality of database nodes to perform corresponding query tasks to obtain corresponding query results.
Specifically, the database middleware may traverse each node in the optimized directed acyclic graph from a root node of the optimized directed acyclic graph according to a topological sort of the optimized directed acyclic graph; and in the process of traversing each node in the optimized directed acyclic graph, executing an execution sub-plan with high priority corresponding to each node with a fragmentation table equivalence relation, and after the execution of the execution sub-plan with high priority corresponding to each node with the fragmentation table equivalence relation in the optimized directed acyclic graph is finished, sequentially executing an execution sub-plan with low priority corresponding to each node with the fragmentation table equivalence relation and an execution sub-plan corresponding to each node with a non-fragmentation table equivalence relation.
Further, after generating a directed acyclic graph with the data tables as nodes according to the metadata, and before generating a distributed query execution plan according to the directed acyclic graph and obtaining corresponding query results in a plurality of database nodes respectively based on the distributed query execution plan, the optimization method may further include:
and aiming at the topological sorting of the directed acyclic graph, starting from a root node of the directed acyclic graph, using a data table with a fragmentation table equivalence relation with another data table as an optimized root node or an optimized child node, using a data table with a non-fragmentation table equivalence relation with another data table as a child node or a secondary node of the optimized root node or the optimized child node, so as to optimize the topological sorting of the directed acyclic graph, and enabling the priority of an execution sub-plan corresponding to the node with the fragmentation table equivalence relation in the topological sorting of the optimized directed acyclic graph to be higher than the priority of an execution sub-plan corresponding to the node with the non-fragmentation table equivalence relation.
In this embodiment, the purpose of optimizing the initially formed directed acyclic graph is to preferentially select a node where a fragmentation table equivalence relation of the distributed splitting rule is located from a plurality of topology sequences of the initially formed directed acyclic graph, then place a JOIN connection distributed query execution sub-plan corresponding to the nodes where the fragmentation table equivalence relation exists in a corresponding database node, execute a corresponding JOIN connection query in the database node, and execute another query execution sub-plan corresponding to the node where the fragmentation table equivalence relation exists or an execution sub-plan corresponding to the node where the non-fragmentation table equivalence relation exists after all JOIN connection distributed query execution sub-plans corresponding to the nodes where the fragmentation table equivalence relation exists are executed. In the invention, the association relation among a plurality of data tables with non-uniform sharding table equivalence relation is constructed and optimized by the directed acyclic graph, so that the data tables with sharding table equivalence relation are adjacent and have priority in the obtained topological ordering. Therefore, in the distributed query plan and the actual execution, the join connection query with the partition table equivalence relation in the SQL query statement can be put into the database node for execution, and the purpose of the invention is realized.
It should be noted that, when the JOIN connection query execution sub-plans corresponding to the nodes having the equivalent relationship of the fragmentation table are put into the corresponding database nodes for query, there is no order between them.
For ease of understanding, the following presents, in a complete form, a method for optimizing a distributed database query, using the present invention, in accordance with one embodiment.
Firstly, assume that the SQL query statement received by the database middleware is: SQL2-1: SELECT a.name, a.deptname, b.manager, c.sarary FROM Employee a LEFT JOIN Level c on.level ═ c.levelname LEFT JOIN depth b.deptname order by a.name;
then, the names of a plurality of database tables obtained by analyzing the SQL query statement are respectively: employee a, Dept b and Level c, and the obtained correlation among the data tables is as follows: level and a.
Secondly, a directed acyclic graph is formed based on the names of the plurality of database tables and the incidence relations thereof, as shown in fig. 3.
Finally, a distributed query execution plan is generated based on the directed acyclic graph shown in fig. 3, and corresponding query steps are respectively obtained from a plurality of database nodes based on the distributed query execution plan, which are briefly described as follows:
step 1.1 and step 1.2, a JOIN connection query is made to a part of data of the table Exployee and the table Dept in the database Node 1.
Step 2.1 and step 2.2, a JOIN connection query is made to the other part of data of table explicit and table Dept inside the database Node 2.
And 3.1 and 3.2, merging the data prepared in the steps 1.1 and 1.2 and the data prepared in the steps 2.1 and 2.2.
And 4.1 and 4.2, extracting the data in the Level table from the database Node1 and the database Node 2.
And 5.1 and 5.2, taking the data in the step 3.1 and the step 3.2 as a left table, taking the data in the Level table as a right table, performing JOIN connection to obtain a temporary database table, and returning the temporary database table to the client.
Based on the same technical concept, corresponding to the embodiment of the method shown in fig. 2, the embodiment of the present invention further provides an optimization system for distributed database query, including: the system comprises a plurality of database nodes and database middleware respectively connected with the database nodes.
The database middleware is configured to receive a data query request sent by a client, optimize an SQL query statement in the analyzed data query request by using the distributed database query optimization method, so as to obtain an optimized directed acyclic graph to generate a distributed query execution plan, and obtain corresponding query results in a plurality of database nodes based on the distributed query execution plan.
The database nodes are used for storing data tables, and the data tables stored in the database nodes have a fragmentation table equivalence relation.
In summary, the present invention provides an optimization method for multi-table complex query of a distributed database in an application scenario of distributed database middleware, which determines an association relationship (including a fragmentation table equivalence relationship) between a plurality of data tables by analyzing SQL query statements, and then establishes a directed acyclic graph using the data tables as nodes and the association relationship as a connection line after simplifying the association relationship between the plurality of data tables; selecting a traversal mode that nodes corresponding to the data table with the equivalent relation of the fragmentation table are earlier than nodes corresponding to the data table with the equivalent relation of the non-fragmentation table in multiple modes by utilizing the mode that the directed acyclic graph has multiple topological sorting; in the invention, the node where the equivalent relation of the fragment table of the distributed splitting rule is located is preferentially selected as a root node or a secondary node, so that the data table with the uniform equivalent relation of the fragment table in the obtained topological sorting can be preferentially processed, and therefore, when the topological sorting generated according to the topological sorting is inquired, the join connection inquiry containing the equivalent relation of the fragment table in the SQL inquiry statement can be put down into the database node for execution, thereby avoiding the technical problems that in the prior art, when a plurality of data tables are inquired in a connection mode, part of sub-plans of the join connection inquiry cannot be put down into the database node for execution if the uniform equivalent relation of the fragment table does not exist among the plurality of data tables, and finally the operation pressure and the memory pressure of the middleware of the distributed database are aggravated.
An embodiment of the present invention further provides an electronic device, as shown in fig. 6, including a processor 601, a communication interface 602, a memory 603, and a communication bus 604, where the processor 601, the communication interface 602, and the memory 603 complete mutual communication through the communication bus 604,
a memory 603 for storing a computer program;
the processor 601 is configured to implement the distributed database query optimization method provided by the embodiment of the present invention when executing the program stored in the memory 603.
Specifically, the method for optimizing query of distributed database includes: the method comprises the steps of pre-receiving a data query request sent by a client, wherein the data query request comprises at least one SQL query statement, and the SQL query statement relates to JOIN connection query of a plurality of data tables and equivalence connection query of split columns among split tables contained in the data tables.
And analyzing the SQL query statement to obtain metadata corresponding to the SQL query statement, wherein the metadata comprises a plurality of data tables and an incidence relation among the data tables.
And generating a directed acyclic graph with the data table as the node according to the metadata.
And generating a distributed query execution plan according to the directed acyclic graph, and respectively acquiring corresponding query results from a plurality of database nodes based on the distributed query execution plan.
In addition, other implementation manners of the distributed database query optimization method implemented by the processor 601 executing the program stored in the memory 603 are the same as the implementation manners mentioned in the foregoing method embodiment, and are not described herein again.
The communication bus mentioned above for the control terminal may be a Peripheral Component Interconnect (PCI) bus, an Extended Industry Standard Architecture (EISA) bus, or the like. The communication bus may be divided into an address bus, a data bus, a control bus, etc. For ease of illustration, only one thick line is shown, but this does not mean that there is only one bus or one type of bus.
The communication interface is used for communication between the electronic equipment and other equipment.
The Memory may include a Random Access Memory (RAM) or a Non-Volatile Memory (NVM), such as at least one disk Memory. Optionally, the memory may also be at least one memory device located remotely from the processor.
The Processor may be a general-purpose Processor, including a Central Processing Unit (CPU), a Network Processor (NP), and the like; but also Digital Signal Processors (DSPs), Application Specific Integrated Circuits (ASICs), Field Programmable Gate Arrays (FPGAs) or other Programmable logic devices, discrete Gate or transistor logic devices, discrete hardware components.
In yet another embodiment of the present invention, there is also provided a computer-readable storage medium having stored therein instructions, which when run on a computer, cause the computer to perform the distributed database query optimization method described in any of the above embodiments.
In the above embodiments, the implementation may be wholly or partially realized by software, hardware, firmware, or any combination thereof. When implemented in software, may be implemented in whole or in part in the form of a computer program product. The computer program product includes one or more computer instructions. When loaded and executed on a computer, cause the processes or functions described in accordance with the embodiments of the invention to be performed in whole or in part. The computer may be a general purpose computer, a special purpose computer, a network of computers, or other programmable device. The computer instructions may be stored in a computer readable storage medium or transmitted from one computer readable storage medium to another, for example, from one website site, computer, server, or data center to another website site, computer, server, or data center via wired (e.g., coaxial cable, fiber optic, Digital Subscriber Line (DSL)) or wireless (e.g., infrared, wireless, microwave, etc.). The computer-readable storage medium can be any available medium that can be accessed by a computer or a data storage device, such as a server, a data center, etc., that incorporates one or more of the available media. The usable medium may be a magnetic medium (e.g., floppy Disk, hard Disk, magnetic tape), an optical medium (e.g., DVD), or a semiconductor medium (e.g., Solid State Disk (SSD)), among others.
It is noted that, herein, relational terms such as first and second, and the like may be used solely to distinguish one entity or action from another entity or action without necessarily requiring or implying any actual such relationship or order between such entities or actions. Also, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising an … …" does not exclude the presence of other identical elements in a process, method, article, or apparatus that comprises the element.
All the embodiments in the present specification are described in a related manner, and the same and similar parts among the embodiments may be referred to each other, and each embodiment focuses on the differences from the other embodiments. In particular, as for the apparatus, the electronic device, and the computer-readable storage medium embodiments, since they are substantially similar to the method embodiments, the description is relatively simple, and in relation to the description, reference may be made to some portions of the description of the method embodiments.
The above description is only for the preferred embodiment of the present invention, and is not intended to limit the scope of the present invention. Any modification, equivalent replacement, or improvement made within the spirit and principle of the present invention shall fall within the protection scope of the present invention.

Claims (9)

1. An optimization method for a distributed database query, applied to a database middleware, wherein the distributed database comprises a plurality of database nodes for storing data tables, the optimization method comprising:
receiving a data query request sent by a client, wherein the data query request comprises at least one SQL query statement, and the SQL query statement relates to JOIN connection query of a plurality of data tables and equivalence connection query of split columns among split tables contained in the plurality of data tables;
analyzing the SQL query statement to obtain metadata corresponding to the SQL query statement, wherein the metadata comprises a plurality of data tables and an incidence relation among the data tables, and the number of the data tables contained in the metadata is not less than 3;
generating a directed acyclic graph with the data table as a node according to the metadata;
and generating a distributed query execution plan according to the directed acyclic graph, and respectively acquiring corresponding query results from a plurality of database nodes based on the distributed query execution plan.
2. The method of optimizing distributed database queries of claim 1, wherein the associations between the data tables include shard table isograms having the same splitting rules and an isogram between split columns of two tables.
3. The method for optimizing distributed database query according to claim 2, wherein the step of parsing the SQL query statement to obtain metadata corresponding to the SQL query statement, the metadata including a plurality of data tables, and an association relationship between the plurality of data tables includes:
and analyzing the SQL query statement by using a preset analyzer, at least extracting the name of each data table and the associated field name of the equivalent relation of the fragment table from an analysis result, and taking the name and the associated field name as the associated relation.
4. The method for optimizing distributed database queries of claim 1 wherein the step of generating a directed acyclic graph having nodes from said data tables based on said metadata comprises:
and according to the appearance sequence of the data tables in the metadata, taking the data table with the first appearance as a root node, taking the data tables with the association relation with the root node as child nodes in sequence, and taking the data tables with the association relation with the child nodes as secondary nodes in sequence.
5. The method of optimizing distributed database queries of claim 2, wherein prior to generating a directed acyclic graph from the metadata that is a node of the data table, the method further comprises:
and determining whether the SQL query statement has INNER JOIN in-connection query of a plurality of data tables, if so, extrapolating the non-fragmented table equivalence relation in the INNER JOIN in-connection query to simplify the SQL query statement.
6. The method of claim 4, wherein the distributed query execution plan comprises a plurality of execution sub-plans, each of the execution sub-plans corresponding to a priority level;
after generating a directed acyclic graph with the data tables as nodes according to the metadata, and before generating a distributed query execution plan according to the directed acyclic graph and respectively obtaining corresponding query results in a plurality of database nodes based on the distributed query execution plan, the method further includes:
and aiming at the topological sorting of the directed acyclic graph, starting from a root node of the directed acyclic graph, using a data table with a fragmentation table equivalence relation with another data table as an optimized root node or an optimized child node, using a data table with a non-fragmentation table equivalence relation with another data table as a child node or a secondary node of the optimized root node or the optimized child node, so as to optimize the topological sorting of the directed acyclic graph, and enabling the priority of an execution sub-plan corresponding to the node with the fragmentation table equivalence relation in the topological sorting of the optimized directed acyclic graph to be higher than the priority of an execution sub-plan corresponding to the node with the non-fragmentation table equivalence relation.
7. The method for optimizing distributed database queries of claim 6, wherein each node corresponds to a plurality of execution sub-plans;
the step of generating a query execution plan according to the directed acyclic graph and obtaining corresponding query results in a plurality of database nodes respectively based on the distributed query execution plan includes:
traversing each node in the optimized directed acyclic graph from a root node of the optimized directed acyclic graph according to the topological sorting of the optimized directed acyclic graph;
and in the process of traversing each node in the optimized directed acyclic graph, executing an execution sub-plan with high priority corresponding to each node with a fragmentation table equivalence relation, and after the execution of the execution sub-plan with high priority corresponding to each node with the fragmentation table equivalence relation in the optimized directed acyclic graph is finished, sequentially executing an execution sub-plan with low priority corresponding to each node with the fragmentation table equivalence relation and an execution sub-plan corresponding to each node with a non-fragmentation table equivalence relation.
8. A system for optimizing distributed database queries, comprising: a plurality of database nodes and a database middleware respectively connected with the database nodes, wherein,
the database middleware is used for receiving a data query request sent by a client, optimizing an SQL query statement in the analyzed data query request by using the distributed database query optimization method according to any one of claims 1 to 7 to obtain an optimized directed acyclic graph to generate a distributed query execution plan, and respectively obtaining corresponding query results from a plurality of database nodes based on the distributed query execution plan;
the database nodes are used for storing data tables, and sub-table equivalence relations which have the same splitting rules and equivalence relations between splitting columns of two tables exist among database tables stored in the plurality of database nodes.
9. An electronic device is characterized by comprising a processor, a communication interface, a memory and a communication bus, wherein the processor and the communication interface are used for realizing mutual communication by the memory through the communication bus;
a memory for storing a computer program;
a processor for implementing a method of optimizing a distributed database query as claimed in any one of claims 1 to 7 when executing a program stored on a memory.
CN202210600055.0A 2022-05-26 2022-05-26 Optimization method and system for distributed database query and electronic equipment Pending CN114860764A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202210600055.0A CN114860764A (en) 2022-05-26 2022-05-26 Optimization method and system for distributed database query and electronic equipment

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202210600055.0A CN114860764A (en) 2022-05-26 2022-05-26 Optimization method and system for distributed database query and electronic equipment

Publications (1)

Publication Number Publication Date
CN114860764A true CN114860764A (en) 2022-08-05

Family

ID=82641560

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202210600055.0A Pending CN114860764A (en) 2022-05-26 2022-05-26 Optimization method and system for distributed database query and electronic equipment

Country Status (1)

Country Link
CN (1) CN114860764A (en)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116383238A (en) * 2023-06-06 2023-07-04 湖南红普创新科技发展有限公司 Data virtualization system, method, device, equipment and medium based on graph structure
CN117131078A (en) * 2023-10-20 2023-11-28 天津南大通用数据技术股份有限公司 Cross-distribution mode-based associated query optimization method

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111352950A (en) * 2020-03-04 2020-06-30 上海达梦数据库有限公司 Database table equivalent connection optimization method and device, server and storage medium
CN112395303A (en) * 2019-08-15 2021-02-23 阿里巴巴集团控股有限公司 Query execution method and device, electronic equipment and computer readable medium
CN113297250A (en) * 2021-05-28 2021-08-24 北京思特奇信息技术股份有限公司 Method and system for multi-table association query of distributed database

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112395303A (en) * 2019-08-15 2021-02-23 阿里巴巴集团控股有限公司 Query execution method and device, electronic equipment and computer readable medium
CN111352950A (en) * 2020-03-04 2020-06-30 上海达梦数据库有限公司 Database table equivalent connection optimization method and device, server and storage medium
CN113297250A (en) * 2021-05-28 2021-08-24 北京思特奇信息技术股份有限公司 Method and system for multi-table association query of distributed database

Cited By (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116383238A (en) * 2023-06-06 2023-07-04 湖南红普创新科技发展有限公司 Data virtualization system, method, device, equipment and medium based on graph structure
CN116383238B (en) * 2023-06-06 2023-08-29 湖南红普创新科技发展有限公司 Data virtualization system, method, device, equipment and medium based on graph structure
CN117131078A (en) * 2023-10-20 2023-11-28 天津南大通用数据技术股份有限公司 Cross-distribution mode-based associated query optimization method
CN117131078B (en) * 2023-10-20 2024-02-06 天津南大通用数据技术股份有限公司 Cross-distribution mode-based associated query optimization method

Similar Documents

Publication Publication Date Title
CN109766345B (en) Metadata processing method and device, equipment and readable storage medium
US10635671B2 (en) Sort-merge band join optimization
CN114860764A (en) Optimization method and system for distributed database query and electronic equipment
US8224807B2 (en) Enhanced utilization of query optimization
US10831737B2 (en) Method and device for partitioning association table in distributed database
US7606827B2 (en) Query optimization using materialized views in database management systems
EP2637111A1 (en) Data management system and method using database middleware
US11514009B2 (en) Method and systems for mapping object oriented/functional languages to database languages
CN111581234B (en) RAC multi-node database query method, device and system
US11176133B2 (en) Filter evaluation for table fragments
US8812492B2 (en) Automatic and dynamic design of cache groups
CN101916280A (en) Parallel computing system and method for carrying out load balance according to query contents
CN113094387A (en) Data query method and device, electronic equipment and machine-readable storage medium
CN114490724B (en) Method and device for processing database query statement
CN111782634B (en) Data distributed storage method, device, electronic equipment and storage medium
CN112818010B (en) Database query method and device
CN114168620A (en) Execution plan processing method and device
CN114547083A (en) Data processing method and device and electronic equipment
CN114238387A (en) Data query method and device, electronic equipment and storage medium
CN110413642B (en) Application-unaware fragmentation database parsing and optimizing method
US20190370259A1 (en) Devices and methods for implementing dynamic collaborative workflow systems
CN117573730B (en) Data processing method, apparatus, device, readable storage medium, and program product
CN112988809B (en) Data query method, device, equipment and medium based on relational database
WO2021215101A1 (en) Data management system and data management method
US20240126757A1 (en) Outer semi join for disjunctive subquery unnesting

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
RJ01 Rejection of invention patent application after publication

Application publication date: 20220805

RJ01 Rejection of invention patent application after publication