EP3951609A1 - Query optimization method and apparatus - Google Patents

Query optimization method and apparatus Download PDF

Info

Publication number
EP3951609A1
EP3951609A1 EP21182225.9A EP21182225A EP3951609A1 EP 3951609 A1 EP3951609 A1 EP 3951609A1 EP 21182225 A EP21182225 A EP 21182225A EP 3951609 A1 EP3951609 A1 EP 3951609A1
Authority
EP
European Patent Office
Prior art keywords
query
subtask
task
subtasks
merged
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
EP21182225.9A
Other languages
German (de)
French (fr)
Inventor
Jian NIU
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Alipay Hangzhou Information Technology Co Ltd
Original Assignee
Alipay Hangzhou Information Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Alipay Hangzhou Information Technology Co Ltd filed Critical Alipay Hangzhou Information Technology Co Ltd
Publication of EP3951609A1 publication Critical patent/EP3951609A1/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/24Querying
    • G06F16/245Query processing
    • G06F16/2453Query optimisation
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/24Querying
    • G06F16/245Query processing
    • G06F16/2453Query optimisation
    • G06F16/24534Query rewriting; Transformation
    • G06F16/24542Plan optimisation
    • G06F16/24544Join order optimisation
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/24Querying
    • G06F16/245Query processing
    • G06F16/2453Query optimisation
    • G06F16/24534Query rewriting; Transformation
    • G06F16/24535Query rewriting; Transformation of sub-queries or views
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F11/00Error detection; Error correction; Monitoring
    • G06F11/30Monitoring
    • G06F11/34Recording or statistical evaluation of computer activity, e.g. of down time, of input/output operation ; Recording or statistical evaluation of user activity, e.g. usability assessment
    • G06F11/3409Recording or statistical evaluation of computer activity, e.g. of down time, of input/output operation ; Recording or statistical evaluation of user activity, e.g. usability assessment for performance assessment
    • G06F11/3419Recording or statistical evaluation of computer activity, e.g. of down time, of input/output operation ; Recording or statistical evaluation of user activity, e.g. usability assessment for performance assessment by assessing time
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/24Querying
    • G06F16/245Query processing
    • G06F16/2453Query optimisation
    • G06F16/24534Query rewriting; Transformation
    • G06F16/24542Plan optimisation
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/28Databases characterised by their database models, e.g. relational or object models
    • G06F16/284Relational databases

Definitions

  • the present specification relates to the field of Internet technologies, and in particular, to a query optimization method and apparatus.
  • an online query scenario there is often a query requirement of high concurrency, high throughput, and low delay.
  • OTP online transaction processing
  • Each batch query request may include dozens or even hundreds of query tasks, and a query response time is required to be within 10 ms.
  • the query tasks need to be optimized to improve query efficiency.
  • a query logic of a single query task can be optimized only in a range of the query task.
  • many repeated query logics still exist in different query tasks. Therefore, query optimization in the range of the single query task still has low query efficiency.
  • At least one embodiment of the present specification provides a query optimization method to improve query efficiency of a query task.
  • a query optimization method including: obtaining a plurality of to-be-optimized query tasks, each query task including at least one subtask, and when a same query task includes at least two subtasks, a query dependence relationship existing between the at least two subtasks; combining the plurality of query tasks based on the query dependence relationship to obtain one combined task; and merging at least two subtasks in different query tasks based on the combined task to obtain a first merged subtask, the first merged subtask being executed to obtain query results of the at least two subtasks in the different query tasks.
  • a query optimization apparatus including: an acquisition module, configured to obtain a plurality of to-be-optimized query tasks, each query task including at least one subtask, and when a same query task includes at least two subtasks, a query dependence relationship existing between the at least two subtasks; a combination module, configured to combine the plurality of query tasks based on the query dependence relationship to obtain one combined task; and a merging module, configured to merge at least two subtasks in different query tasks based on the combined task to obtain a first merged subtask, the first merged subtask being executed to obtain query results of the at least two subtasks in the different query tasks.
  • a computer device including a memory, a processor, and a computer program that is stored in the memory and that can run on the processor, the processor implementing the query optimization method in any embodiment of the present specification when executing the program.
  • a computer readable storage medium storing a computer program, and steps of the query optimization method in any embodiment of the present specification being implemented when the program is executed by a processor.
  • a plurality of query tasks are combined to obtain a combined task, and subtasks in different query tasks are merged into one task based on the complete combined task. Because subtasks optimized based on the combined task belong to different query tasks, in this solution, query optimization in ranges of a plurality of different query tasks is implemented. Compared with an optimization method of a single query task, repeated query logics in different query tasks can be merged and optimized, thereby improving query efficiency. Further, because tasks can be merged in the ranges of the plurality of different query task, a high throughput of hundreds of thousands or even hundreds of thousands of batch queries per second is possible.
  • first, second, third, etc. may be used in the present specification to describe various types of information, the information is not limited to the terms. These terms are merely used to differentiate between information of the same type. For example, without departing from the scope of the present specification, first information can also be referred to as second information, and similarly, the second information can be referred to as the first information. Depending on the context, for example, the word “if” used herein can be explained as “while”, “when”, or "in response to determining”.
  • an online query scenario there is often a query requirement of high concurrency, high throughput, and low delay.
  • OTP online transaction processing
  • Each batch query request may include dozens or even hundreds of query tasks, and a query response time is required to be within 10 ms.
  • the query tasks need to be optimized to improve query efficiency.
  • a query logic of a single query task can be optimized only in a range of the query task. For example, two same query logics in a same query task are merged into one query logic, thereby avoiding repeated execution of a same query logic. However, many repeated query logics still exist in different query tasks. When different query tasks are executed, many same query logics are repeatedly executed. It can be learned that, in a method for optimizing a single query task, query efficiency is still low and even query resources are wasted.
  • the present specification discloses a query optimization method to merge a plurality of query tasks into one combined task.
  • query logics that belong to different query tasks can also be optimized.
  • query optimization method can be applied to various online queries that can include but are not limited to an OLTP query or an online analytical processing (OLAP) query.
  • OLTP online analytical processing
  • FIG. 1 is a flowchart illustrating a query optimization method according to an embodiment of the present specification. As shown in FIG. 1 , the process includes the following steps.
  • Step 101 Obtain a plurality of to-be-optimized query tasks, each query task including at least one subtask, and when a same query task includes at least two subtasks, a query dependence relationship existing between the at least two subtasks.
  • the query task is a process of obtaining a final query result from a data source based on a query condition. For example, when the query condition is "male name”, a "name" that meets the condition can be found from a corresponding data source based on the query condition, and used as a query result. This is a complete query task.
  • Each query task includes different query logics, and a query result of the query task can be finally obtained by executing the query logics in the query task.
  • different query logics are represented by using subtasks in a query task, and each query task can include one or more subtasks.
  • the query task can include only one subtask: a name that meets the condition is found from a data source based on the query condition "male”. That is, a query result of the query task can be obtained by querying the subtask.
  • the query task can include two subtasks: subtask 1: name information that meets a height condition needs to be found from a data source based on the query condition "taller than 175 cm", and is used as a query result of subtask 1; and subtask 2: a name that meets a gender condition needs to be found from the query result of subtask 1 based on the query condition "male”.
  • a query result of the query task needs to be obtained after the two subtasks are both queried.
  • a query task may include three or more subtasks.
  • a query dependence relationship exists between the plurality of subtasks. For example, if execution of one subtask in a query task needs to depend on an execution result of another subtask in the query task, a query dependence relationship exists between the two subtasks.
  • FIG. 2 illustrates a plurality of query tasks in a form of a directed acyclic graph (DAG): query task 1, query task 2, and query task 3.
  • Query task 1 is used as an example for description.
  • Node "Input1" represents a query entry in which a query condition of a query task including the node can be input.
  • Node "A” represents a query data source corresponding to the query task including the node, such as an HBASE database or a MySQL database.
  • Node “Join1” represents a query result obtained through query based on the query condition in "Input1" and the query data source in "A".
  • query task 1 includes only subtask a, and subtask a is obtained by connecting the three nodes "Input1", "A", and "Join1".
  • query can be performed based on the query condition in Input1 and the query data source in A, to obtain a query result of subtask a. Because query task 1 includes only subtask a, the query result of subtask a is a query result of query task 1.
  • query task 2 includes two subtasks: subtask b and subtask c.
  • Subtask b is obtained by connecting three nodes: "Input2", "B", and "Join2". Executing subtask b can obtain a query result of subtask b. A process of obtaining the query result of subtask b is the same as the process of obtaining the query result of subtask a, and details are omitted for simplicity.
  • Subtask c is obtained by connecting three nodes: "Join2", “C”, and "Join3". "Join2" further exists as a query condition of subtask c while representing the query result of subtask b.
  • query can be performed based on the query result "Join2" of subtask b and a query data source in C, to obtain a query result of subtask c. Because no other subtask needs to be executed after subtask c, the obtained query result of subtask c is a query result of the query task 2.
  • query task 3 includes subtask d and subtask e, and subtask e depends on subtask d.
  • query dependence relationship exists between subtask b and subtask c.
  • the query dependence relationship is specifically that subtask c depends on subtask b.
  • Step 102 Combine the plurality of query tasks based on the query dependence relationship to obtain one combined task.
  • a plurality of different query tasks can be combined based on a query dependence relationship between subtasks in each query task, to obtain a complete combined task.
  • a to-be-connected subtask can be determined based on the query dependence relationship between the subtasks in each query task.
  • FIG. 2 is used as an example. Because query task 1 includes only subtask a, subtask a can be used as a to-be-connected subtask. Based on the query dependence relationship between subtask b and subtask c in query task 2, subtask c can be used as a to-be-connected subtask because subtask c is the last subtask when query task 2 is executed. Based on the query dependence relationship between subtask d and subtask e in query task 3, subtask e can be used as a to-be-connected subtask.
  • the to-be-connected subtasks are connected to connect the query tasks together to obtain a complete combined task.
  • query task 1, query task 2, and query task 3 can be connected by using two nodes “JOIN1” and "JOIN2", to obtain complete combined task 1.
  • Nodes "JOIN1” and "JOIN2" do not affect final query results of the connected different query tasks, that is, a query result of each query task obtained based on combined task 1 is the same as a query result obtained by separately executing each query task.
  • each subtask includes a query data source
  • query data sources can be data sources of different physical architectures.
  • FIG. 2 is used as an example.
  • Query data source "A" of subtask a can be an HBASE database.
  • Query data source "B" of subtask b can be a MySQL database.
  • Query data source "C" of subtask c cam be an HBASE database.
  • query data sources can be heterogeneous data sources.
  • query data sources can be heterogeneous data sources.
  • a physical structure of a data source in a subtask is not limited, and a query data source in any subtask can be a heterogeneous data source.
  • the method before the combining the plurality of query tasks to obtain the one combined task, the method further includes: determining a first query logic from a query task based on a predetermined determining condition, the first query logic including at least one subtask in the same query task; and storing a query result of the first query logic.
  • the method further includes: if determining that a query logic that is the same as the first query logic exists in the combined task, using the query result of the first query logic as a query result of the query logic that is the same as the first query logic.
  • a common query logic can be determined in the same query task as the first query logic based on the predetermined determining condition.
  • the predetermined determining condition can be a number of execution times, an execution frequency, etc. For example, if a number of execution times of a certain query logic is greater than a predetermined number of times, it can be determined that the query logic is a common query logic and is used as the first query logic.
  • the above query logic can be one subtask or a plurality of subtasks in a same query task that have a query dependence relationship.
  • the query result of the first query logic can be directly obtained without repeatedly executing the same query logic.
  • Combined task 1 shown in FIG. 2 is used as an example.
  • the first query logic can be first determined.
  • subtask b can be used as the first query logic
  • query task 2 including subtask b and subtask c can be used as the first query logic. If the determining condition is that the number of execution times is greater than the predetermined number of times, subtask b is used as the first query logic when a number of execution times of subtask b is greater than the predetermined number of times to execute subtask b and store the query result.
  • Step 103 Merge at least two subtasks in different query tasks based on the combined task to obtain a first merged subtask, the first merged subtask being executed to obtain query results of the at least two subtasks in the different query tasks.
  • query optimization can be performed only on query task 1, query optimization can be performed only on query task 2, or query optimization can be performed only on query task 3.
  • At least two subtasks that belong to different query tasks can be merged based on the combined task obtained by combining different query tasks, to obtain the first merged subtask.
  • query results of a plurality of subtasks can be obtained by performing the first merged subtask once.
  • a method for merging subtasks can be merging a plurality of same subtasks into one task, to remove a repeated subtask.
  • the plurality of subtasks can be merged into one task based on a same part of the plurality of subtasks. For example, subtasks whose storage paths have a same part can be merged into one executable task.
  • Combined task 1 shown in FIG. 2 is used as an example.
  • subtasks that belong to different query tasks can be merged based on combined task 1.
  • subtask a in query task 1 is the same as subtask b in query task 2
  • subtask a and subtask b can be merged into a first merged subtask.
  • the query results of both subtask a and subtask b can be obtained by executing the first merged subtask only once.
  • subtask a can be executed only once, or subtask b can be executed only once.
  • a same determining condition of the two subtasks can be: query conditions and query data sources that are included in the two subtasks are the same.
  • this optimization method repeatedly executing a query process of a subtask is removed, and an effect of query optimization is implemented.
  • this optimization method implements cross-query task optimization.
  • subtask a is the same as subtask c
  • subtask a and subtask c can be merged to obtain a first merged subtask.
  • the query results of both subtask a and subtask c are obtained by executing the first merged subtask.
  • subtasks in the combined task for optimization belong to different query tasks
  • cross-query task optimization is implemented based on the combined task in this method. Compared with optimization of a single query task, query efficiency is improved. Further, because cross-query task optimization is implemented in the method, repeated queries existing in different query tasks can be greatly reduced, and query resources can be saved.
  • the merging the at least two subtasks in different query tasks to obtain the first merged subtask includes: determining a non-dependent subtask in each query task based on the query dependence relationship to form a first task set of the combined task; and merging at least two subtasks in the first task set to obtain the first merged subtask.
  • a subtask that can be executed without depending on another subtask needs to be determined from each query task based on a query dependence relationship between subtasks in the query task, and subtasks determined from the query tasks form the first task set.
  • Combined task 1 shown in FIG. 2 is used as an example.
  • Subtask a in query task 1 can be executed without depending on another subtask.
  • Subtask b in query task 2 does not depend on another subtask.
  • Subtask d in query task 3 does not depend on another subtask. Therefore, subtask a, subtask b, and subtask d can form a set to obtain the first task set of combined task 1.
  • optimization can be performed based on a feature of the subtask in the first task set because all the subtasks in the first task set are non-dependent subtasks.
  • the at least two subtasks in the first task set can be merged to obtain the first merged subtask.
  • query results of a plurality of subtasks used for combination can be obtained by executing the first merged subtask once.
  • the subtasks in the first task set belong to different query tasks
  • the subtasks of the first task set are optimized, thereby implementing an effect of cross-query task optimization. Therefore, an optimization effect and query efficiency are improved.
  • cross-query task optimization can reduce repeated query logics in different query tasks and save query resources.
  • the method further includes: concurrently executing the subtasks in the first task set; and placing, into a cache, a query result of a subtask whose query duration meets a predetermined condition.
  • a plurality of subtasks in the first task set can be executed concurrently. Therefore, in a process of concurrently executing the plurality of subtasks, query duration used by the subtasks is also different because specific query processes of the subtasks are different. Query duration of one or more subtasks may be longer (longer than query duration of another subtask), and query duration of concurrent execution may be affected by the one or more query tasks.
  • the query result of the subtask whose query duration meets the predetermined condition can be placed into the cache to accelerate query.
  • the predetermined condition can include but is not limited to a query duration threshold. For example, the predetermined condition can be: "query duration is longer than 10 ms".
  • each subtask includes a query condition and a query data source.
  • the merging the at least two subtasks in the first task set to obtain the first merged subtask includes: merging at least two subtasks in the first task set that have a same query condition and query data source, to obtain the first merged subtask.
  • each subtask includes a query condition and a query data source.
  • subtask a in combined task 1 includes the query condition input in node "Input1" and the query data source in node "A”.
  • Subtask b includes the query condition input in node “Input2” and the query data source in node "B”.
  • Subtask c can use the query result "Join2" of subtask b as the query condition, and further includes the query data source in node "C”.
  • subtasks that have a same query condition and query data source can be determined from the subtasks in the first task set for de-duplication.
  • query conditions and query data sources of two subtasks are the same, it can be considered that the two subtasks are the same, and a same query result is obtained by executing the two subtasks. Therefore, a plurality of subtasks in the first task set that have a same query condition and query data source can be merged to obtain the first merged subtask. Query results of the plurality of subtasks can be obtained by executing the first merged subtask once.
  • the first merged subtask can be one of a plurality of subtasks used for merging. Therefore, only one of the subtasks can be executed after the merging, to avoid repeatedly executing a same query logic.
  • FIG. 2 is used as an example.
  • de-duplication can be performed if the query condition input in "Input1" of subtask a is the same as the query condition input in "Input2" of subtask b and query data source "A" of subtask a is the same as query data source "B” of subtask b. That is, subtask a and subtask b can be merged to obtain the first merged subtask, for example, subtask a is used as the first merged subtask. As such, the query results of both subtask a and subtask b are obtained by executing subtask a once.
  • de-duplication optimization can be performed on the subtasks in the first task set. Because the subtasks in the first task set belong to different query tasks, in this de-optimization method, cross-query task optimization is implemented, repeated query logics in different query tasks are reduced, query efficiency is improved, and query resources are saved.
  • each subtask includes a query data source.
  • the merging the at least two subtasks in the first task set to obtain the first merged subtask includes: determining a plurality of subtasks with a same query data source from the first task set; and merging at least two subtasks whose query paths have a same part in the plurality of subtasks with the same query data source, to obtain the first merged subtask.
  • FIG. 2 is used as an example.
  • the query data source of subtask a is an HBASE database
  • the query data source of subtask b is also an HBASE database.
  • the HBASE database is used as the query data source.
  • Each query task includes a table, a row, a column cluster, and a column.
  • a query path is successively as follows: table-Tl, row-KEY1, column cluster-f, and column-col1. Finally, a query result of query task Query 1 is obtained.
  • subtasks whose query paths have a same part in subtasks with a same query data source can be merged into the first merged subtask.
  • the HBASE database is still used as an example. Assume that the first task set includes a total of five subtasks: query tasks Query1 to 5. Because query paths of two subtasks Query1 and Query2 are totally the same, only subtask Query1 can be executed, and a query result of Query2 can be obtained at the same time. Query1 can be referred to as a first merged subtask obtained by merging Query 1 and Query2.
  • the two query tasks can be merged into a "Get task".
  • the "Get task” can be referred to as a first merged subtask obtained by merging Query 1 and Query3.
  • the "MultiGet task” can be referred to as a first merged subtask obtained by merging Query1, Query3, and Query4.
  • Query5 can only serve as a "new Get task”. Based on the above merging processes, a total of five subtasks Query1 to 5 can be merged into two first merged subtasks: a "MultiGet task” and a "new Get task”.
  • the subtasks in the first task set can be merged based on a query feature of the database. It can be understood that in the above merging process, merging is performed based on a same query path. In another feasible embodiment, another form of task merging can be performed based on a query feature of a query data source.
  • Query results of a plurality of subtasks used for merging can be obtained after the first merged subtask is executed.
  • query results of a total of four subtasks Query 1 to 4 can be obtained after the "MultiGet task" is executed.
  • the subtasks can be merged based on the query path.
  • a number of query times is reduced, a number of times of accessing a query data source is reduced, and a query time is reduced.
  • cross-query task optimization is also implemented, query efficiency is improved, and query resources are saved.
  • the method further includes: determining, from the combined task based on the query dependence relationship, subtasks that depend on each subtask in the first task set, to form a second task set of the combined task; executing the subtasks in the first task set, and storing query results; and if the second task set and the first task set have same subtasks with a same query condition and query data source, using a query result of the subtask in the first task set as a query result of the same subtask in the second task set.
  • the subtasks in the second task set can be determined based on the subtasks in the first task set and a query dependence relationship between subtasks in each query task.
  • the subtask in the second task set depends on the subtask in the first task set.
  • FIG. 2 is used as an example.
  • Subtask c can be used as a subtask in the second task set because subtask c depends on subtask b in the first task set.
  • Subtask e can be used as a subtask in the second task set because subtask e depends on subtask d in the first task set.
  • the subtasks in the first task set are executed to obtain and store query results. For example, a plurality of subtasks in the first task set are concurrently executed to obtain query results of the plurality of subtasks, and then the query results are correspondingly stored. For example, in FIG. 2 , the subtasks in the first task set can be concurrently executed, and the query results are correspondingly stored.
  • a query result obtained after the subtask in the first task set is executed can be used as a query condition of the subtask in the second task set.
  • the subtask in the second task set can be compared with a subtask that has been executed in the first task set. If the second task set includes a same subtask that has a same query condition and query data source as the subtask in the first task set, a query result of the corresponding same subtask can be directly obtained from the stored query results, and a query logic of the same subtask in the second task set does not need to be executed again.
  • Combined task 1 shown in FIG. 2 is used as an example.
  • Combined task 1 includes the first task set and the second task set.
  • a plurality of subtasks in the first task set are executed in advance, and corresponding query results are stored.
  • query data source "E" of subtask e in the second task set is the same as query data source "B" of subtask b in the first task set and both are the HBASE database
  • the query result of subtask d in the first task set is the same as the query condition input in "Input" of subtask b.
  • it can be determined that subtask e in the second task set and subtask b in the first task set are same subtasks with a same query condition and query data source. Because subtask b has been already executed and the query result is stored, the query result of subtask b can be directly obtained as the query result of subtask e without executing subtask e.
  • query results of executed subtasks can be stored.
  • a corresponding query result is directly obtained if a same subtask exists.
  • cross-query task optimization is still implemented, repeated queries in different query tasks are reduced, query efficiency is improved, and query resources are saved.
  • the present specification provides a query optimization apparatus, and the apparatus can execute the query optimization method in any embodiment of the present specification.
  • the apparatus can include an acquisition module 201, a combination module 202, and a merging module 203.
  • the acquisition module 201 is configured to obtain a plurality of to-be-optimized query tasks, each query task including at least one subtask, and when a same query task includes at least two subtasks, a query dependence relationship existing between the at least two subtasks.
  • the combination module 202 is configured to combine the plurality of query tasks based on the query dependence relationship to obtain one combined task.
  • the merging module 203 is configured to merge at least two subtasks in different query tasks based on the combined task to obtain a first merged subtask, the first merged subtask being executed to obtain query results of the at least two subtasks in the different query tasks.
  • the merging module 203 includes: a first set submodule 301, configured to determine a non-dependent subtask in each query task based on the query dependence relationship to form a first task set of the combined task; and a merging submodule 302, configured to merge at least two subtasks in the first task set to obtain the first merged subtask.
  • each subtask includes a query condition and a query data source
  • each subtask includes a query data source
  • the merging module 203 further includes: a second set submodule 401, configured to determine, from the combined task based on the query dependence relationship, subtasks that depend on each subtask in the first task set, to form a second task set of the combined task; a first execution submodule 402, configured to execute the subtasks in the first task set, and store query results; and a result acquisition submodule 403, configured to: if the second task set and the first task set have same subtasks with a same query condition and query data source, obtain a query result of the subtask in the first task set as a query result of the same subtask in the second task set.
  • a second set submodule 401 configured to determine, from the combined task based on the query dependence relationship, subtasks that depend on each subtask in the first task set, to form a second task set of the combined task
  • a first execution submodule 402 configured to execute the subtasks in the first task set, and store query results
  • a result acquisition submodule 403 configured to:
  • each subtask includes a query data source
  • the query data source includes at least two data sources of different physical architectures.
  • the merging module 203 further includes: a concurrence execution submodule 501, configured to concurrently execute the subtasks in the first task set; and a caching submodule 502, configured to place, into a cache, a query result of a subtask whose query duration meets a predetermined condition.
  • a concurrence execution submodule 501 configured to concurrently execute the subtasks in the first task set
  • a caching submodule 502 configured to place, into a cache, a query result of a subtask whose query duration meets a predetermined condition.
  • the apparatus further includes: a query logic determining module 601, configured to determine first query logic from a query task based on a predetermined determining condition, the first query logic including at least one subtask in the same query task; a result storage module 602, configured to store a query result of the first query logic; and a result acquisition module 603, configured to: if determining that query logic that is the same as the first query logic exists in the combined task, obtain the query result of the first query logic as a query result of the query logic that is the same as the first query logic.
  • a query logic determining module 601 configured to determine first query logic from a query task based on a predetermined determining condition, the first query logic including at least one subtask in the same query task
  • a result storage module 602 configured to store a query result of the first query logic
  • a result acquisition module 603 configured to: if determining that query logic that is the same as the first query logic exists in the combined task, obtain the query result of the first query logic as a query result of the query logic
  • an apparatus embodiment corresponds to a method embodiment, for related parts, references can be made to related descriptions in the method embodiment.
  • the apparatus embodiment described above is merely an example.
  • the units described as separate parts can or cannot be physically separate, and parts displayed as units can or cannot be physical units, can be located in one position, or can be distributed on a plurality of network units. Some or all of the modules can be selected based on actual needs to achieve the objectives of the solutions in at least one embodiment of the present specification. A person of ordinary skill in the art can understand and implement the embodiments of the present application without creative efforts.
  • the present specification further provides a computer device, including a memory, a processor, and a computer program that is stored in the memory and that can run on the processor.
  • the processor can implement the query optimization method in any implementation of the present specification when executing the program.
  • the present specification further provides a computer readable storage medium, and the computer readable storage medium stores a computer program. Steps of the query optimization method in any implementation of the present specification are implemented when the program is executed by a processor.
  • the non-transitory computer-readable storage medium can be a read-only memory (ROM), a random access memory (RAM), a CD-ROM, a magnetic tape, a floppy disk, an optical data storage device, etc. No limitation is imposed in the present specification.

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • General Engineering & Computer Science (AREA)
  • Databases & Information Systems (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • Data Mining & Analysis (AREA)
  • Computational Linguistics (AREA)
  • Operations Research (AREA)
  • Computer Hardware Design (AREA)
  • Quality & Reliability (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The present specification provides a query optimization method and apparatus. The method includes: obtaining a plurality of to-be-optimized query tasks, each query task including at least one subtask, and when a same query task includes at least two subtasks, a query dependence relationship existing between the at least two subtasks; combining the plurality of query tasks based on the query dependence relationship to obtain one combined task; and merging at least two subtasks in different query tasks based on the combined task to obtain a first merged subtask, the first merged subtask being executed to obtain query results of the at least two subtasks in the different query tasks.

Description

    TECHNICAL FIELD
  • The present specification relates to the field of Internet technologies, and in particular, to a query optimization method and apparatus.
  • BACKGROUND
  • In an online query scenario, there is often a query requirement of high concurrency, high throughput, and low delay. For example, in an online transaction processing (OLTP) query scenario, there may be tens of thousands of batch query requests per second. Each batch query request may include dozens or even hundreds of query tasks, and a query response time is required to be within 10 ms. In such a scenario of a large number of query tasks, the query tasks need to be optimized to improve query efficiency.
  • In the related query optimization technology, a query logic of a single query task can be optimized only in a range of the query task. However, many repeated query logics still exist in different query tasks. Therefore, query optimization in the range of the single query task still has low query efficiency.
  • SUMMARY
  • At least one embodiment of the present specification provides a query optimization method to improve query efficiency of a query task.
  • According to a first aspect, a query optimization method is provided, the method including: obtaining a plurality of to-be-optimized query tasks, each query task including at least one subtask, and when a same query task includes at least two subtasks, a query dependence relationship existing between the at least two subtasks; combining the plurality of query tasks based on the query dependence relationship to obtain one combined task; and merging at least two subtasks in different query tasks based on the combined task to obtain a first merged subtask, the first merged subtask being executed to obtain query results of the at least two subtasks in the different query tasks.
  • According to a second aspect, a query optimization apparatus is provided, the apparatus including: an acquisition module, configured to obtain a plurality of to-be-optimized query tasks, each query task including at least one subtask, and when a same query task includes at least two subtasks, a query dependence relationship existing between the at least two subtasks; a combination module, configured to combine the plurality of query tasks based on the query dependence relationship to obtain one combined task; and a merging module, configured to merge at least two subtasks in different query tasks based on the combined task to obtain a first merged subtask, the first merged subtask being executed to obtain query results of the at least two subtasks in the different query tasks.
  • According to a third aspect, a computer device is provided, including a memory, a processor, and a computer program that is stored in the memory and that can run on the processor, the processor implementing the query optimization method in any embodiment of the present specification when executing the program.
  • According to a fourth aspect, a computer readable storage medium is provided, the computer readable storage medium storing a computer program, and steps of the query optimization method in any embodiment of the present specification being implemented when the program is executed by a processor.
  • It can be learned from the above technical solution that, in the at least one embodiment of the present specification, a plurality of query tasks are combined to obtain a combined task, and subtasks in different query tasks are merged into one task based on the complete combined task. Because subtasks optimized based on the combined task belong to different query tasks, in this solution, query optimization in ranges of a plurality of different query tasks is implemented. Compared with an optimization method of a single query task, repeated query logics in different query tasks can be merged and optimized, thereby improving query efficiency. Further, because tasks can be merged in the ranges of the plurality of different query task, a high throughput of hundreds of thousands or even hundreds of thousands of batch queries per second is possible.
  • It should be understood that the general descriptions above and the detailed descriptions below are merely examples for description, and cannot limit the present specification.
  • BRIEF DESCRIPTION OF DRAWINGS
    • FIG. 1 is a flowchart illustrating a query optimization method according to an example embodiment;
    • FIG. 2 is a schematic diagram illustrating a combined task according to an example embodiment;
    • FIG. 3 is a schematic diagram illustrating a query optimization apparatus according to an example embodiment;
    • FIG. 4 is a schematic diagram illustrating another query optimization apparatus according to an example embodiment;
    • FIG. 5 is a schematic diagram illustrating another query optimization apparatus according to an example embodiment;
    • FIG. 6 is a schematic diagram illustrating another query optimization apparatus according to an example embodiment; and
    • FIG. 7 is a schematic diagram illustrating another query optimization apparatus according to an example embodiment.
    DETAILED DESCRIPTION
  • Embodiments are described in detail herein, and examples of which are presented in the accompanying drawings. When the following description refers to the accompanying drawings, unless specified otherwise, same numbers in different accompanying drawings represent same or similar elements. Implementations described in the following example embodiments do not represent all implementations consistent with the present specification. On the contrary, the implementations are merely examples of apparatuses and methods that are described in the appended claims in details and consistent with some aspects of the present specification.
  • The terms used in the present specification are merely for illustrating specific embodiments, and are not intended to limit the present specification. The terms "a" and "the" of singular forms used in the present specification and the appended claims are also intended to include plural forms, unless otherwise specified in the context clearly. It should be further understood that the term "and/or" used in the present specification indicates and includes any or all possible combinations of one or more associated listed items.
  • It should be understood that although terms "first", "second", "third", etc. may be used in the present specification to describe various types of information, the information is not limited to the terms. These terms are merely used to differentiate between information of the same type. For example, without departing from the scope of the present specification, first information can also be referred to as second information, and similarly, the second information can be referred to as the first information. Depending on the context, for example, the word "if" used herein can be explained as "while", "when", or "in response to determining".
  • In an online query scenario, there is often a query requirement of high concurrency, high throughput, and low delay. For example, in an online transaction processing (OLTP) query scenario during promotion of online retailers, there may be tens of thousands of batch query requests per second. Each batch query request may include dozens or even hundreds of query tasks, and a query response time is required to be within 10 ms. In such a scenario of a large number of query tasks, the query tasks need to be optimized to improve query efficiency.
  • In the related query optimization technology, a query logic of a single query task can be optimized only in a range of the query task. For example, two same query logics in a same query task are merged into one query logic, thereby avoiding repeated execution of a same query logic. However, many repeated query logics still exist in different query tasks. When different query tasks are executed, many same query logics are repeatedly executed. It can be learned that, in a method for optimizing a single query task, query efficiency is still low and even query resources are wasted.
  • Therefore, the present specification discloses a query optimization method to merge a plurality of query tasks into one combined task. In ranges of different query tasks included in the combined task, query logics that belong to different query tasks can also be optimized.
  • It should be noted that the query optimization method provided in the present specification can be applied to various online queries that can include but are not limited to an OLTP query or an online analytical processing (OLAP) query.
  • To make the query optimization method provided in the present specification clearer, the following describes the solution execution process provided in the present specification in detail with reference to the accompanying drawings and the embodiments.
  • FIG. 1 is a flowchart illustrating a query optimization method according to an embodiment of the present specification. As shown in FIG. 1, the process includes the following steps.
  • Step 101: Obtain a plurality of to-be-optimized query tasks, each query task including at least one subtask, and when a same query task includes at least two subtasks, a query dependence relationship existing between the at least two subtasks.
  • In this embodiment, the query task is a process of obtaining a final query result from a data source based on a query condition. For example, when the query condition is "male name", a "name" that meets the condition can be found from a corresponding data source based on the query condition, and used as a query result. This is a complete query task.
  • Each query task includes different query logics, and a query result of the query task can be finally obtained by executing the query logics in the query task. In this embodiment, different query logics are represented by using subtasks in a query task, and each query task can include one or more subtasks.
  • For example, when the query condition is "male name", the query task can include only one subtask: a name that meets the condition is found from a data source based on the query condition "male". That is, a query result of the query task can be obtained by querying the subtask.
  • For example, when the query condition is "name of a male taller than 175 cm", the query task can include two subtasks: subtask 1: name information that meets a height condition needs to be found from a data source based on the query condition "taller than 175 cm", and is used as a query result of subtask 1; and subtask 2: a name that meets a gender condition needs to be found from the query result of subtask 1 based on the query condition "male".
  • That is, a query result of the query task needs to be obtained after the two subtasks are both queried. Based on a same principle, it can be understood that when a query condition is more complex, a query task may include three or more subtasks.
  • When a same query task includes two or more subtasks, a query dependence relationship exists between the plurality of subtasks. For example, if execution of one subtask in a query task needs to depend on an execution result of another subtask in the query task, a query dependence relationship exists between the two subtasks.
  • FIG. 2 illustrates a plurality of query tasks in a form of a directed acyclic graph (DAG): query task 1, query task 2, and query task 3. Query task 1 is used as an example for description. Node "Input1" represents a query entry in which a query condition of a query task including the node can be input. Node "A" represents a query data source corresponding to the query task including the node, such as an HBASE database or a MySQL database. Node "Join1" represents a query result obtained through query based on the query condition in "Input1" and the query data source in "A".
  • Referring to FIG. 2, query task 1 includes only subtask a, and subtask a is obtained by connecting the three nodes "Input1", "A", and "Join1". When subtask a is executed, query can be performed based on the query condition in Input1 and the query data source in A, to obtain a query result of subtask a. Because query task 1 includes only subtask a, the query result of subtask a is a query result of query task 1.
  • Referring to FIG. 2, query task 2 includes two subtasks: subtask b and subtask c.
  • Subtask b is obtained by connecting three nodes: "Input2", "B", and "Join2". Executing subtask b can obtain a query result of subtask b. A process of obtaining the query result of subtask b is the same as the process of obtaining the query result of subtask a, and details are omitted for simplicity.
  • Subtask c is obtained by connecting three nodes: "Join2", "C", and "Join3". "Join2" further exists as a query condition of subtask c while representing the query result of subtask b. When subtask c is executed, query can be performed based on the query result "Join2" of subtask b and a query data source in C, to obtain a query result of subtask c. Because no other subtask needs to be executed after subtask c, the obtained query result of subtask c is a query result of the query task 2. Based on a same principle as query task 2, query task 3 includes subtask d and subtask e, and subtask e depends on subtask d.
  • For query task 2 shown in FIG. 2, because execution of subtask c depends on the query result of subtask b as the query condition, a query dependence relationship exists between subtask b and subtask c. In addition, the query dependence relationship is specifically that subtask c depends on subtask b.
  • Step 102: Combine the plurality of query tasks based on the query dependence relationship to obtain one combined task.
  • In the present step, a plurality of different query tasks can be combined based on a query dependence relationship between subtasks in each query task, to obtain a complete combined task.
  • A to-be-connected subtask can be determined based on the query dependence relationship between the subtasks in each query task. FIG. 2 is used as an example. Because query task 1 includes only subtask a, subtask a can be used as a to-be-connected subtask. Based on the query dependence relationship between subtask b and subtask c in query task 2, subtask c can be used as a to-be-connected subtask because subtask c is the last subtask when query task 2 is executed. Based on the query dependence relationship between subtask d and subtask e in query task 3, subtask e can be used as a to-be-connected subtask.
  • The to-be-connected subtasks are connected to connect the query tasks together to obtain a complete combined task. For example, query task 1, query task 2, and query task 3 can be connected by using two nodes "JOIN1" and "JOIN2", to obtain complete combined task 1. Nodes "JOIN1" and "JOIN2" do not affect final query results of the connected different query tasks, that is, a query result of each query task obtained based on combined task 1 is the same as a query result obtained by separately executing each query task.
  • In an example, each subtask includes a query data source, and query data sources can be data sources of different physical architectures. FIG. 2 is used as an example. Query data source "A" of subtask a can be an HBASE database. Query data source "B" of subtask b can be a MySQL database. Query data source "C" of subtask c cam be an HBASE database. Between different subtasks in a same query task, query data sources can be heterogeneous data sources. Alternatively, between subtasks in different query tasks, query data sources can be heterogeneous data sources. In this embodiment, a physical structure of a data source in a subtask is not limited, and a query data source in any subtask can be a heterogeneous data source.
  • In an example, before the combining the plurality of query tasks to obtain the one combined task, the method further includes: determining a first query logic from a query task based on a predetermined determining condition, the first query logic including at least one subtask in the same query task; and storing a query result of the first query logic. After the combining the plurality of query tasks to obtain the one combined task, the method further includes: if determining that a query logic that is the same as the first query logic exists in the combined task, using the query result of the first query logic as a query result of the query logic that is the same as the first query logic.
  • In the above example, before a subtask of a query task in the combined task is optimized, a common query logic can be determined in the same query task as the first query logic based on the predetermined determining condition. The predetermined determining condition can be a number of execution times, an execution frequency, etc. For example, if a number of execution times of a certain query logic is greater than a predetermined number of times, it can be determined that the query logic is a common query logic and is used as the first query logic. The above query logic can be one subtask or a plurality of subtasks in a same query task that have a query dependence relationship.
  • When a subtask in the combined task is executed, if determining that a query logic that is the same as the first query logic exists, the query result of the first query logic can be directly obtained without repeatedly executing the same query logic.
  • Combined task 1 shown in FIG. 2 is used as an example. Before a subtask in combined task 1 is executed, the first query logic can be first determined. For example, subtask b can be used as the first query logic, or query task 2 including subtask b and subtask c can be used as the first query logic. If the determining condition is that the number of execution times is greater than the predetermined number of times, subtask b is used as the first query logic when a number of execution times of subtask b is greater than the predetermined number of times to execute subtask b and store the query result.
  • When a subtask in combined task 1 is executed, if determining that a query logic "subtask b" that is the same as the first query logic exists in combined task 1, the stored query result of the first query logic can be used as the query result of subtask b without repeatedly executing a query process of subtask b.
  • Step 103: Merge at least two subtasks in different query tasks based on the combined task to obtain a first merged subtask, the first merged subtask being executed to obtain query results of the at least two subtasks in the different query tasks.
  • In the related query optimization technology, only a single query task can be optimized. FIG. 2 is used as an example. In the related technology, query optimization can be performed only on query task 1, query optimization can be performed only on query task 2, or query optimization can be performed only on query task 3.
  • In the present step, at least two subtasks that belong to different query tasks can be merged based on the combined task obtained by combining different query tasks, to obtain the first merged subtask. As such, query results of a plurality of subtasks can be obtained by performing the first merged subtask once. A method for merging subtasks can be merging a plurality of same subtasks into one task, to remove a repeated subtask. Alternatively, the plurality of subtasks can be merged into one task based on a same part of the plurality of subtasks. For example, subtasks whose storage paths have a same part can be merged into one executable task.
  • Combined task 1 shown in FIG. 2 is used as an example. In the present step, subtasks that belong to different query tasks can be merged based on combined task 1. For example, if subtask a in query task 1 is the same as subtask b in query task 2, subtask a and subtask b can be merged into a first merged subtask. As such, the query results of both subtask a and subtask b can be obtained by executing the first merged subtask only once. Specifically, for example, subtask a can be executed only once, or subtask b can be executed only once. A same determining condition of the two subtasks can be: query conditions and query data sources that are included in the two subtasks are the same. In this method, repeatedly executing a query process of a subtask is removed, and an effect of query optimization is implemented. In addition, in the "de-duplication optimization" process, because subtask a and subtask b belong to different query tasks, this optimization method implements cross-query task optimization.
  • It should be noted that in the combined task, in addition to the above "de-duplication optimization" method, there are many specific methods for merging subtasks in different query tasks. No limitation is imposed in this embodiment. For example, if subtask a is the same as subtask c, subtask a and subtask c can be merged to obtain a first merged subtask. The query results of both subtask a and subtask c are obtained by executing the first merged subtask.
  • In the query optimization method in this embodiment, subtasks in the combined task for optimization belong to different query tasks, cross-query task optimization is implemented based on the combined task in this method. Compared with optimization of a single query task, query efficiency is improved. Further, because cross-query task optimization is implemented in the method, repeated queries existing in different query tasks can be greatly reduced, and query resources can be saved.
  • In an example, the merging the at least two subtasks in different query tasks to obtain the first merged subtask includes: determining a non-dependent subtask in each query task based on the query dependence relationship to form a first task set of the combined task; and merging at least two subtasks in the first task set to obtain the first merged subtask.
  • In the above embodiment, a subtask that can be executed without depending on another subtask needs to be determined from each query task based on a query dependence relationship between subtasks in the query task, and subtasks determined from the query tasks form the first task set. Combined task 1 shown in FIG. 2 is used as an example. Subtask a in query task 1 can be executed without depending on another subtask. Subtask b in query task 2 does not depend on another subtask. Subtask d in query task 3 does not depend on another subtask. Therefore, subtask a, subtask b, and subtask d can form a set to obtain the first task set of combined task 1.
  • After the first task set of the combined task is determined, optimization can be performed based on a feature of the subtask in the first task set because all the subtasks in the first task set are non-dependent subtasks. In this embodiment, the at least two subtasks in the first task set can be merged to obtain the first merged subtask. As such, query results of a plurality of subtasks used for combination can be obtained by executing the first merged subtask once.
  • In this embodiment, because the subtasks in the first task set belong to different query tasks, the subtasks of the first task set are optimized, thereby implementing an effect of cross-query task optimization. Therefore, an optimization effect and query efficiency are improved. In addition, cross-query task optimization can reduce repeated query logics in different query tasks and save query resources.
  • In an example, after the forming the first task set of the combined task, the method further includes: concurrently executing the subtasks in the first task set; and placing, into a cache, a query result of a subtask whose query duration meets a predetermined condition.
  • In the above embodiment, a plurality of subtasks in the first task set can be executed concurrently. Therefore, in a process of concurrently executing the plurality of subtasks, query duration used by the subtasks is also different because specific query processes of the subtasks are different. Query duration of one or more subtasks may be longer (longer than query duration of another subtask), and query duration of concurrent execution may be affected by the one or more query tasks. In the above embodiment, the query result of the subtask whose query duration meets the predetermined condition can be placed into the cache to accelerate query. The predetermined condition can include but is not limited to a query duration threshold. For example, the predetermined condition can be: "query duration is longer than 10 ms".
  • In an example, each subtask includes a query condition and a query data source. The merging the at least two subtasks in the first task set to obtain the first merged subtask includes: merging at least two subtasks in the first task set that have a same query condition and query data source, to obtain the first merged subtask.
  • In the above embodiment, each subtask includes a query condition and a query data source.
  • As shown in FIG. 2, subtask a in combined task 1 includes the query condition input in node "Input1" and the query data source in node "A". Subtask b includes the query condition input in node "Input2" and the query data source in node "B". Subtask c can use the query result "Join2" of subtask b as the query condition, and further includes the query data source in node "C".
  • In this embodiment, subtasks that have a same query condition and query data source can be determined from the subtasks in the first task set for de-duplication.
  • If query conditions and query data sources of two subtasks are the same, it can be considered that the two subtasks are the same, and a same query result is obtained by executing the two subtasks. Therefore, a plurality of subtasks in the first task set that have a same query condition and query data source can be merged to obtain the first merged subtask. Query results of the plurality of subtasks can be obtained by executing the first merged subtask once. The first merged subtask can be one of a plurality of subtasks used for merging. Therefore, only one of the subtasks can be executed after the merging, to avoid repeatedly executing a same query logic.
  • FIG. 2 is used as an example. In the first task set, de-duplication can be performed if the query condition input in "Input1" of subtask a is the same as the query condition input in "Input2" of subtask b and query data source "A" of subtask a is the same as query data source "B" of subtask b. That is, subtask a and subtask b can be merged to obtain the first merged subtask, for example, subtask a is used as the first merged subtask. As such, the query results of both subtask a and subtask b are obtained by executing subtask a once.
  • In this embodiment, when the first task set is determined from the combined task, de-duplication optimization can be performed on the subtasks in the first task set. Because the subtasks in the first task set belong to different query tasks, in this de-optimization method, cross-query task optimization is implemented, repeated query logics in different query tasks are reduced, query efficiency is improved, and query resources are saved.
  • In an example, each subtask includes a query data source. The merging the at least two subtasks in the first task set to obtain the first merged subtask includes: determining a plurality of subtasks with a same query data source from the first task set; and merging at least two subtasks whose query paths have a same part in the plurality of subtasks with the same query data source, to obtain the first merged subtask.
  • In the above embodiment, when the first task set is determined from the combined task, subtasks whose storage paths have a same part can be merged.
  • FIG. 2 is used as an example. In the first task set, assume that the query data source of subtask a is an HBASE database, and the query data source of subtask b is also an HBASE database. In this embodiment, it can be determined from the first task set that subtask a and subtask b are subtasks with a same query data source.
  • For example, the HBASE database is used as the query data source. The following illustrates five query tasks:
    • Query 1: T1, KEY1, f, col1;
    • Query2: T1, KEY1, f, coll;
    • Query3: T1, KEY1, f, col2;
    • Query4: T1, KEY2, f, col3;
    • Query5: T2, KEY3, f, col4;
  • Each query task includes a table, a row, a column cluster, and a column. For example, when a query process of query task Query 1 is executed, a query path is successively as follows: table-Tl, row-KEY1, column cluster-f, and column-col1. Finally, a query result of query task Query 1 is obtained.
  • In this embodiment, subtasks whose query paths have a same part in subtasks with a same query data source can be merged into the first merged subtask.
  • The HBASE database is still used as an example. Assume that the first task set includes a total of five subtasks: query tasks Query1 to 5. Because query paths of two subtasks Query1 and Query2 are totally the same, only subtask Query1 can be executed, and a query result of Query2 can be obtained at the same time. Query1 can be referred to as a first merged subtask obtained by merging Query 1 and Query2.
  • Because query paths of two subtasks Query1 and Query3 are different only in terms of "column", the two query tasks can be merged into a "Get task". The "Get task" can be referred to as a first merged subtask obtained by merging Query 1 and Query3.
  • Because query paths of Query1 and the Query4 are different in terms of "row", Query4 and the "Get task" can be further merged into a "MultiGet task". The "MultiGet task" can be referred to as a first merged subtask obtained by merging Query1, Query3, and Query4.
  • Because query paths of Query5 and other subtasks are different in terms of "table", Query5 can only serve as a "new Get task". Based on the above merging processes, a total of five subtasks Query1 to 5 can be merged into two first merged subtasks: a "MultiGet task" and a "new Get task".
  • It should be noted that a specific implementation process of merging the subtasks in the first task set based on the HBASE database as a query data source is described above. For different query data sources such as the MySQL database, the subtasks in the first task set can be merged based on a query feature of the database. It can be understood that in the above merging process, merging is performed based on a same query path. In another feasible embodiment, another form of task merging can be performed based on a query feature of a query data source.
  • Query results of a plurality of subtasks used for merging can be obtained after the first merged subtask is executed. With reference to the above example of the HBASE database, query results of a total of four subtasks Query 1 to 4 can be obtained after the "MultiGet task" is executed.
  • In this embodiment, after a plurality of subtasks with a same data source are determined from the first task set, the subtasks can be merged based on the query path. In this method, a number of query times is reduced, a number of times of accessing a query data source is reduced, and a query time is reduced. In addition, in this merging method, cross-query task optimization is also implemented, query efficiency is improved, and query resources are saved.
  • In an example, after the forming the first task set of the combined task, the method further includes: determining, from the combined task based on the query dependence relationship, subtasks that depend on each subtask in the first task set, to form a second task set of the combined task; executing the subtasks in the first task set, and storing query results; and if the second task set and the first task set have same subtasks with a same query condition and query data source, using a query result of the subtask in the first task set as a query result of the same subtask in the second task set.
  • In the above embodiment, the subtasks in the second task set can be determined based on the subtasks in the first task set and a query dependence relationship between subtasks in each query task. The subtask in the second task set depends on the subtask in the first task set.
  • FIG. 2 is used as an example. Subtask c can be used as a subtask in the second task set because subtask c depends on subtask b in the first task set. Subtask e can be used as a subtask in the second task set because subtask e depends on subtask d in the first task set.
  • The subtasks in the first task set are executed to obtain and store query results. For example, a plurality of subtasks in the first task set are concurrently executed to obtain query results of the plurality of subtasks, and then the query results are correspondingly stored. For example, in FIG. 2, the subtasks in the first task set can be concurrently executed, and the query results are correspondingly stored.
  • Because the subtask in the second task set depends on the subtask in the first task set, a query result obtained after the subtask in the first task set is executed can be used as a query condition of the subtask in the second task set. In this case, the subtask in the second task set can be compared with a subtask that has been executed in the first task set. If the second task set includes a same subtask that has a same query condition and query data source as the subtask in the first task set, a query result of the corresponding same subtask can be directly obtained from the stored query results, and a query logic of the same subtask in the second task set does not need to be executed again.
  • Combined task 1 shown in FIG. 2 is used as an example. Combined task 1 includes the first task set and the second task set. A plurality of subtasks in the first task set are executed in advance, and corresponding query results are stored. Assume that query data source "E" of subtask e in the second task set is the same as query data source "B" of subtask b in the first task set and both are the HBASE database, and the query result of subtask d in the first task set is the same as the query condition input in "Input" of subtask b. In this case, it can be determined that subtask e in the second task set and subtask b in the first task set are same subtasks with a same query condition and query data source. Because subtask b has been already executed and the query result is stored, the query result of subtask b can be directly obtained as the query result of subtask e without executing subtask e.
  • In this embodiment, query results of executed subtasks can be stored. When a subtask is executed subsequently, a corresponding query result is directly obtained if a same subtask exists. In a process of optimizing the subtask in the second task set by using the subtask in the first task set, cross-query task optimization is still implemented, repeated queries in different query tasks are reduced, query efficiency is improved, and query resources are saved.
  • As shown in FIG. 3, the present specification provides a query optimization apparatus, and the apparatus can execute the query optimization method in any embodiment of the present specification. The apparatus can include an acquisition module 201, a combination module 202, and a merging module 203.
  • The acquisition module 201 is configured to obtain a plurality of to-be-optimized query tasks, each query task including at least one subtask, and when a same query task includes at least two subtasks, a query dependence relationship existing between the at least two subtasks.
  • The combination module 202 is configured to combine the plurality of query tasks based on the query dependence relationship to obtain one combined task.
  • The merging module 203 is configured to merge at least two subtasks in different query tasks based on the combined task to obtain a first merged subtask, the first merged subtask being executed to obtain query results of the at least two subtasks in the different query tasks.
  • Optionally, as shown in FIG. 4, the merging module 203 includes: a first set submodule 301, configured to determine a non-dependent subtask in each query task based on the query dependence relationship to form a first task set of the combined task; and a merging submodule 302, configured to merge at least two subtasks in the first task set to obtain the first merged subtask.
  • Optionally, each subtask includes a query condition and a query data source, and that the merging submodule 302 is configured to merge the at least two subtasks in the first task set to obtain the first merged subtask includes: merging at least two subtasks in the first task set that have a same query condition and query data source, to obtain the first merged subtask.
  • Optionally, each subtask includes a query data source, and that the merging submodule 302 is configured to merge the at least two subtasks in the first task set to obtain the first merged subtask includes: determining a plurality of subtasks with a query data source from the first task set; and merging at least two subtasks whose query paths have a same part in the plurality of subtasks with the same query data source, to obtain the first merged subtask.
  • Optionally, as shown in FIG. 5, the merging module 203 further includes: a second set submodule 401, configured to determine, from the combined task based on the query dependence relationship, subtasks that depend on each subtask in the first task set, to form a second task set of the combined task; a first execution submodule 402, configured to execute the subtasks in the first task set, and store query results; and a result acquisition submodule 403, configured to: if the second task set and the first task set have same subtasks with a same query condition and query data source, obtain a query result of the subtask in the first task set as a query result of the same subtask in the second task set.
  • Optionally, each subtask includes a query data source, and the query data source includes at least two data sources of different physical architectures.
  • Optionally, as shown in FIG. 6, the merging module 203 further includes: a concurrence execution submodule 501, configured to concurrently execute the subtasks in the first task set; and a caching submodule 502, configured to place, into a cache, a query result of a subtask whose query duration meets a predetermined condition.
  • Optionally, as shown in FIG. 7, the apparatus further includes: a query logic determining module 601, configured to determine first query logic from a query task based on a predetermined determining condition, the first query logic including at least one subtask in the same query task; a result storage module 602, configured to store a query result of the first query logic; and a result acquisition module 603, configured to: if determining that query logic that is the same as the first query logic exists in the combined task, obtain the query result of the first query logic as a query result of the query logic that is the same as the first query logic.
  • For an implementation process of functions and roles of each module in the apparatus, references can be made to an implementation process of a corresponding step in the above method. Details are omitted herein for simplicity.
  • Because an apparatus embodiment corresponds to a method embodiment, for related parts, references can be made to related descriptions in the method embodiment. The apparatus embodiment described above is merely an example. The units described as separate parts can or cannot be physically separate, and parts displayed as units can or cannot be physical units, can be located in one position, or can be distributed on a plurality of network units. Some or all of the modules can be selected based on actual needs to achieve the objectives of the solutions in at least one embodiment of the present specification. A person of ordinary skill in the art can understand and implement the embodiments of the present application without creative efforts.
  • The present specification further provides a computer device, including a memory, a processor, and a computer program that is stored in the memory and that can run on the processor. The processor can implement the query optimization method in any implementation of the present specification when executing the program.
  • The present specification further provides a computer readable storage medium, and the computer readable storage medium stores a computer program. Steps of the query optimization method in any implementation of the present specification are implemented when the program is executed by a processor.
  • The non-transitory computer-readable storage medium can be a read-only memory (ROM), a random access memory (RAM), a CD-ROM, a magnetic tape, a floppy disk, an optical data storage device, etc. No limitation is imposed in the present specification.
  • A person skilled in the art can easily figure out another implementation of the present specification after thinking over the specification and practicing the present specification herein. The present specification is intended to cover any variations, uses, or adaptations of the present specification, and these variations, uses, or adaptations follow the general principles of the present specification and include common knowledge or conventional techniques that are not disclosed in the technical field of the present specification. The present specification and the embodiments are merely considered as examples, and the actual scope and the spirit of the present specification are pointed out by the following claims.
  • It should be understood that the present specification is not limited to the precise structures that have been described above and illustrated in the accompanying drawings, and various modifications and changes can be made without departing from the scope of the present specification. The scope of the present specification is limited by the appended claims only.
  • The previous descriptions are merely preferred embodiments of the present specification, but are not intended to limit the present specification. Any modification, equivalent replacement, or improvement made without departing from the spirit and principle of the present specification shall fall within the protection scope of the present specification.

Claims (15)

  1. A query optimization method, the method comprising:
    obtaining a plurality of to-be-optimized query tasks, each query task including at least one subtask, and when a same query task includes at least two subtasks, a query dependence relationship existing between the at least two subtasks;
    combining the plurality of query tasks based on the query dependence relationship to obtain one combined task; and
    merging at least two subtasks in different query tasks based on the combined task to obtain a first merged subtask, the first merged subtask being executed to obtain query results of the at least two subtasks in the different query tasks.
  2. The method according to claim 1, wherein the merging the at least two subtasks in the different query tasks to obtain the first merged subtask includes:
    determining a non-dependent subtask in each query task based on the query dependence relationship to form a first task set of the combined task; and
    merging at least two subtasks in the first task set to obtain the first merged subtask.
  3. The method according to claim 2, wherein each subtask includes a query condition and a query data source; and
    the merging the at least two subtasks in the first task set to obtain the first merged subtask includes:
    merging at least two subtasks in the first task set that have a same query condition and query data source, to obtain the first merged subtask.
  4. The method according to claim 2, wherein each subtask includes a query data source; and
    the merging the at least two subtasks in the first task set to obtain the first merged subtask includes:
    determining a plurality of subtasks with a same query data source from the first task set; and
    merging at least two subtasks whose query paths have a same part in the plurality of subtasks with the same query data source, to obtain the first merged subtask.
  5. The method according to claim 2, after the forming the first task set of the combined task, further comprising:
    determining, from the combined task based on the query dependence relationship, subtasks that depend on each subtask in the first task set, to form a second task set of the combined task;
    executing the subtasks in the first task set, and storing query results; and
    if the second task set and the first task set have same subtasks with a same query condition and query data source, using a query result of the subtask in the first task set as a query result of the same subtask in the second task set.
  6. The method according to claim 1 or 2, wherein each subtask includes a query data source, and the query data source includes at least two data sources of different physical architectures.
  7. The method according to claim 2, after the forming the first task set of the combined task, further comprising:
    concurrently executing the subtasks in the first task set; and
    placing, into a cache, a query result of a subtask whose query duration meets a predetermined condition.
  8. The method according to claim 1, before the combining the plurality of query tasks to obtain the one combined task, further comprising:
    determining a first query logic from a query task based on a predetermined determining condition, the first query logic including at least one subtask in the same query task; and
    storing a query result of the first query logic; and
    after the combining the plurality of query tasks to obtain the one combined task, further comprising:
    if determining that a query logic that is the same as the first query logic exists in the combined task, using the query result of the first query logic as a query result of the query logic that is the same as the first query logic.
  9. A query optimization apparatus, the apparatus comprising:
    an acquisition module, configured to obtain a plurality of to-be-optimized query tasks, each query task including at least one subtask, and when a same query task includes at least two subtasks, a query dependence relationship existing between the at least two subtasks;
    a combination module, configured to combine the plurality of query tasks based on the query dependence relationship to obtain one combined task; and
    a merging module, configured to merge at least two subtasks in different query tasks based on the combined task to obtain a first merged subtask, the first merged subtask being executed to obtain query results of the at least two subtasks in the different query tasks.
  10. The apparatus according to claim 9, wherein the merging module includes:
    a first set submodule, configured to determine a non-dependent subtask in each query task based on the query dependence relationship to form a first task set of the combined task; and
    a merging submodule, configured to merge at least two subtasks in the first task set to obtain the first merged subtask.
  11. The apparatus according to claim 10, wherein each subtask includes a query condition and a query data source; and
    that the merging submodule is configured to merge the at least two subtasks in the first task set to obtain the first merged subtask includes:
    merging at least two subtasks in the first task set that have a same query condition and query data source, to obtain the first merged subtask.
  12. The apparatus according to claim 10, wherein each subtask includes a query data source; and
    that the merging submodule is configured to merge the at least two subtasks in the first task set to obtain the first merged subtask includes:
    determining a plurality of subtasks with a same query data source from the first task set; and
    merging at least two subtasks whose query paths have a same part in the plurality of subtasks with the same query data source, to obtain the first merged subtask.
  13. The apparatus according to claim 10, wherein the merging module further includes:
    a second set submodule, configured to determine, from the combined task based on the query dependence relationship, subtasks that depend on each subtask in the first task set, to form a second task set of the combined task;
    a first execution submodule, configured to execute the subtasks in the first task set, and store query results; and
    a result acquisition submodule, configured to: if the second task set and the first task set have same subtasks with a same query condition and query data source, obtain a query result of the subtask in the first task set as a query result of the same subtask in the second task set.
  14. A computer device, comprising a memory, a processor, and a computer program that is stored in the memory and that can run on the processor, the processor implementing the method according to any one of claims 1 to 8 when executing the program.
  15. A computer readable storage medium, the computer readable storage medium storing a computer program, and steps of the method according to any one of claims 1 to 8 being implemented when the program is executed by a processor.
EP21182225.9A 2020-08-03 2021-06-29 Query optimization method and apparatus Pending EP3951609A1 (en)

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202010766906.XA CN111913986B (en) 2020-08-03 2020-08-03 Query optimization method and device

Publications (1)

Publication Number Publication Date
EP3951609A1 true EP3951609A1 (en) 2022-02-09

Family

ID=73288035

Family Applications (1)

Application Number Title Priority Date Filing Date
EP21182225.9A Pending EP3951609A1 (en) 2020-08-03 2021-06-29 Query optimization method and apparatus

Country Status (3)

Country Link
US (1) US20220035813A1 (en)
EP (1) EP3951609A1 (en)
CN (1) CN111913986B (en)

Families Citing this family (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112506953A (en) * 2020-12-11 2021-03-16 中国邮政储蓄银行股份有限公司 Query method, device and storage medium based on Structured Query Language (SQL)
CN113806049A (en) * 2021-09-17 2021-12-17 金蝶软件(中国)有限公司 Task queuing method and device, computer equipment and storage medium
CN115081235A (en) * 2022-06-30 2022-09-20 北京奇艺世纪科技有限公司 Feature processing method, feature processing apparatus, storage medium, and electronic device

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20060253422A1 (en) * 2005-05-06 2006-11-09 Microsoft Corporation Efficient computation of multiple group by queries
US20110082859A1 (en) * 2009-10-07 2011-04-07 International Business Machines Corporation Information theory based result merging for searching hierarchical entities across heterogeneous data sources
WO2017019879A1 (en) * 2015-07-29 2017-02-02 Algebraix Data Corp. Multi-query optimization

Family Cites Families (12)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US8694524B1 (en) * 2006-08-28 2014-04-08 Teradata Us, Inc. Parsing a query
US10255583B2 (en) * 2007-05-01 2019-04-09 Oracle International Corporation Nested hierarchical rollups by level using a normalized table
US8352459B2 (en) * 2009-07-23 2013-01-08 International Business Machines Corporation SQL queries for simple network management protocol management information base tables
JP5443945B2 (en) * 2009-10-28 2014-03-19 株式会社ビーイング Project planning device and project planning program
US9183254B1 (en) * 2012-05-04 2015-11-10 Paraccel Llc Optimizing database queries using subquery composition
CN103678303A (en) * 2012-08-31 2014-03-26 国际商业机器公司 Method and system used for group set inquiry of database
US9529848B2 (en) * 2013-12-06 2016-12-27 Vmware, Inc. Predictive query result computation
CN107402926B (en) * 2016-05-18 2021-02-23 华为技术有限公司 Query method and query device
CN110263105B (en) * 2019-05-21 2021-09-10 北京百度网讯科技有限公司 Query processing method, query processing system, server, and computer-readable medium
US11409744B2 (en) * 2019-08-01 2022-08-09 Thoughtspot, Inc. Query generation based on merger of subqueries
CN110532084B (en) * 2019-09-04 2024-05-31 深圳前海微众银行股份有限公司 Platform task scheduling method, device, equipment and storage medium
CN111190924A (en) * 2019-12-18 2020-05-22 中思博安科技(北京)有限公司 Cross-domain data query method and device

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20060253422A1 (en) * 2005-05-06 2006-11-09 Microsoft Corporation Efficient computation of multiple group by queries
US20110082859A1 (en) * 2009-10-07 2011-04-07 International Business Machines Corporation Information theory based result merging for searching hierarchical entities across heterogeneous data sources
WO2017019879A1 (en) * 2015-07-29 2017-02-02 Algebraix Data Corp. Multi-query optimization

Non-Patent Citations (4)

* Cited by examiner, † Cited by third party
Title
CHUAN ZHANG ET AL: "An Evolutionary Approach to Materialized Views Selection in a Data Warehouse Environment", IEEE TRANSACTIONS ON SYSTEMS, MAN, AND CYBERNETICS: PART C:APPLICATIONS AND REVIEWS, IEEE SERVICE CENTER, PISCATAWAY, NJ, US, vol. 31, no. 3, 1 August 2001 (2001-08-01), XP011057265, ISSN: 1094-6977 *
LING CHEN ET AL: "Query grouping-based multi-query optimization framework for interactive SQL query engines on Hadoop", CONCURRENCY AND COMPUTATION: PRACTICE AND EXPERIENCE, WILEY, LONDON, GB, vol. 30, no. 19, 9 August 2018 (2018-08-09), pages n/a, XP072309082, ISSN: 1532-0626, DOI: 10.1002/CPE.4676 *
PIETRO MICHIARDI ET AL: "Cache-based Multi-query Optimization for Data-intensive Scalable Computing Frameworks", ARXIV.ORG, CORNELL UNIVERSITY LIBRARY, 201 OLIN LIBRARY CORNELL UNIVERSITY ITHACA, NY 14853, 22 May 2018 (2018-05-22), XP080881072 *
TIMOS K SELLIS: "Multiple-query optimization", ACM TRANSACTIONS ON DATABASE SYSTEMS, ACM, NEW YORK, NY, US, vol. 13, no. 1, 1 March 1988 (1988-03-01), pages 23 - 52, XP058138822, ISSN: 0362-5915, DOI: 10.1145/42201.42203 *

Also Published As

Publication number Publication date
CN111913986A (en) 2020-11-10
US20220035813A1 (en) 2022-02-03
CN111913986B (en) 2024-04-16

Similar Documents

Publication Publication Date Title
EP3951609A1 (en) Query optimization method and apparatus
US10754874B2 (en) Query dispatching system and method
US10185744B2 (en) Caching views on historical data
US8943103B2 (en) Improvements to query execution in a parallel elastic database management system
US10191932B2 (en) Dependency-aware transaction batching for data replication
US11003649B2 (en) Index establishment method and device
US10706052B2 (en) Method for performing in-memory hash join processing in relational database systems
US20170116271A1 (en) Static data caching for queries with a clause that requires multiple iterations to execute
US8010554B1 (en) Processing a temporal aggregate query in a database system
US8301652B2 (en) Parallel algorithm to implement temporal aggregate queries
US11475006B2 (en) Query and change propagation scheduling for heterogeneous database systems
CN110837585A (en) Multi-source heterogeneous data association query method and system
US11449550B2 (en) Ad-hoc graph definition
CN112988782B (en) Hive-supported interactive query method and device and storage medium
US20170270162A1 (en) Query optimization method in distributed query engine and apparatus thereof
US20140188924A1 (en) Techniques for ordering predicates in column partitioned databases for query optimization
US9552392B2 (en) Optimizing nested database queries that include windowing operations
JP2018509666A (en) Method and apparatus for determining a SQL execution plan
US9870399B1 (en) Processing column-partitioned data for row-based operations in a database system
US10997175B2 (en) Method for predicate evaluation in relational database systems
WO2022006794A1 (en) Routing directives for partitioned databases
US20190179930A1 (en) Optimistic concurrency control for database transactions
US11625399B2 (en) Methods and devices for dynamic filter pushdown for massive parallel processing databases on cloud
Qu et al. Distributed snapshot maintenance in wide-column NoSQL databases using partitioned incremental ETL pipelines
US9378229B1 (en) Index selection based on a compressed workload

Legal Events

Date Code Title Description
PUAI Public reference made under article 153(3) epc to a published international application that has entered the european phase

Free format text: ORIGINAL CODE: 0009012

STAA Information on the status of an ep patent application or granted ep patent

Free format text: STATUS: THE APPLICATION HAS BEEN PUBLISHED

AK Designated contracting states

Kind code of ref document: A1

Designated state(s): AL AT BE BG CH CY CZ DE DK EE ES FI FR GB GR HR HU IE IS IT LI LT LU LV MC MK MT NL NO PL PT RO RS SE SI SK SM TR

STAA Information on the status of an ep patent application or granted ep patent

Free format text: STATUS: REQUEST FOR EXAMINATION WAS MADE

17P Request for examination filed

Effective date: 20220725

RBV Designated contracting states (corrected)

Designated state(s): AL AT BE BG CH CY CZ DE DK EE ES FI FR GB GR HR HU IE IS IT LI LT LU LV MC MK MT NL NO PL PT RO RS SE SI SK SM TR

STAA Information on the status of an ep patent application or granted ep patent

Free format text: STATUS: EXAMINATION IS IN PROGRESS

17Q First examination report despatched

Effective date: 20230918