CN116662449B

CN116662449B - OLAP query optimization method and system based on broadcast sub-query cache

Info

Publication number: CN116662449B
Application number: CN202310704298.3A
Authority: CN
Inventors: 吕彪; 程鹏; 戚依宁; 方崇荣; 陈积明
Original assignee: Zhejiang University ZJU
Current assignee: Zhejiang University ZJU
Priority date: 2023-06-14
Filing date: 2023-06-14
Publication date: 2024-06-04
Anticipated expiration: 2043-06-14
Also published as: CN116662449A

Abstract

The invention discloses an OLAP query optimization method and system based on broadcast sub-query caching. In a time sequence type data analysis scene, the existing OLAP system query result caching scheme has very low cache hit rate, and the invention provides a new, finer granularity and flexible caching scheme, namely, local results of sub-query operators are cached, so that the design defect of query result caching is avoided, and the method can be directly applied to a distributed big data analysis system; the invention can be directly applied in the cluster environment through a cache broadcasting mechanism, so that the cache of sub-queries can be used at all nodes, the overall cache hit rate is improved, and the query performance of the OLAP is accelerated by fully utilizing the distributed cluster capability.

Description

OLAP query optimization method and system based on broadcast sub-query cache

Technical Field

The invention relates to the field of cloud network observable data analysis, in particular to a method and a system for accelerating OLAP query performance.

Background

OLAP (Online Analytical Processing ) is an online analytical processing system, and OLAP is mainly used for querying data. With the continuous development of OLAP, OLAP products are layered endlessly, and most of OLAP systems are based on a ROLAP (relational database online analysis processing) system or a single MOLAP (multidimensional database online analysis processing) system.

With the rapid growth of commercial data volume, the conventional stand-alone OLAP database cannot meet the needs of users, and the modern OLAP system basically adopts a distributed architecture. In a distributed OLAP system, distributed optimization of SQL queries is achieved, thus supporting queries and analysis of massive data.

In modern OLAP systems, the most common SQL query optimization is mainly "query result cache (Query Result Cache)", i.e. a result set for caching a query statement, and if the same query is subsequently performed, the result is directly read from the result set cache without re-execution, so that the query performance is greatly improved.

However, with the increasing data query volume of the time sequence type, due to the design constraint of the query result cache (Query Result Cache), the SQL of each query needs to be completely consistent to hit the cache, but the query of the time sequence data is only partially consistent and unchanged for the inner sub-query, and the SQL time parameter of the outer layer is inconsistent, so that the cache cannot be effectively hit during each query, thereby influencing the query throughput performance of the whole system.

Disclosure of Invention

The invention aims at overcoming the defects of the prior art, and provides an OLAP query optimization method and system based on broadcast sub-query caching, which are used for accelerating the OLAP query performance. Compared with the existing OLAP system query result caching scheme, the method and the device can be directly applied to a distributed big data analysis system by using a more fine-grained and flexible caching algorithm, and the query performance of the OLAP is accelerated by fully utilizing the distributed cluster capacity.

The invention aims at realizing the following technical scheme:

according to a first aspect of the present specification, there is provided an OLAP query optimization method based on broadcast sub-query caching, the method including the steps of:

S1, when a management node receives an SQL query request aiming at time series data, splitting the SQL query request into sub-queries and final result convergence queries according to operators, and caching the execution results of the sub-query physical plan to all working nodes of an OLAP cluster by using a broadcasting mechanism;

S2, when the working node executes SQL query for time series data, directly performing cache query of a sub-query physical plan locally, and if the cache is hit, directly performing operator execution of the next step; if the cache is not hit, sub-query operator execution is carried out, the execution result of the sub-query physical plan is locally cached, and the execution result is updated to all working nodes of the OLAP cluster through broadcasting.

Further, the management node receives an SQL query request aiming at time series data, firstly analyzes SQL into a logic plan, optimizes the logic plan into a physical plan, and then splits the physical plan into sub-queries and final result convergence queries according to operators.

Further, the working node performs hash operation on the received sub-query physical plan to obtain an identification ID, and attempts to obtain a cache of the sub-query physical plan from a local cache module by using the identification ID; if the cache is obtained, loading the cache result into an execution flow, and entering an operator computing stage of hash connection; otherwise, the sub-query physical plan is executed locally, an execution result is loaded into an execution flow, an operator calculation stage of hash connection is entered, and meanwhile, the identification ID of the sub-query physical plan and the execution result are combined into a cache structure to be written into a local cache module.

Further, the local cache structure of the working node is expressed as HashMap < hashID, result >, wherein hashID is an MD5 calculated value for sub-query physical planning, and Result is a data structure value in the memory of the working node.

Further, when the query process of the working node does not hit the cache, sub-query physical plans are executed in the working node, and the execution Result is written into a local cache structure HashMap < HashID, result >; and simultaneously acquiring PRC addresses of all working nodes from the management node, and broadcasting HashMap < HashID, result > to all working nodes.

According to a second aspect of the present specification, there is provided an OLAP query optimization system based on broadcast sub-query caching, the system comprising the following modules:

Sub-query module: the system comprises an OLAP cluster working node, a cache module and a sub-query module, wherein the OLAP cluster working node is used for extracting a sub-query physical plan, carrying out hash operation on the physical plan to obtain an identification ID, and using the identification ID to try to obtain the cache of the sub-query physical plan from the cache module; if the cache is obtained, loading the cache result into an execution flow, and entering an operator computing stage of hash connection; otherwise, the sub-query physical plan is executed locally, an execution result is loaded into an execution flow, an operator calculation stage of hash is entered, and meanwhile, the identification ID of the sub-query physical plan and the execution result are combined into a cache structure to be written into a cache module;

and a cache module: the sub-query module is deployed at an OLAP cluster working node and provides the capability of cache writing and parallel reading for the sub-query module; for the first write-in cache, a broadcasting module is called to carry out cache broadcasting on all working nodes of the OLAP cluster;

and a broadcasting module: and providing a cache broadcasting service, acquiring PRC addresses of all working nodes from the management node after receiving a cache broadcasting request from the cache module, and broadcasting an execution result of the sub-query physical plan to all the working nodes.

Further, the cache module provides a cache service based on the LRU policy, and the cache structure is expressed as HashMap < HashID, result > where HashID is an MD5 calculated value for the sub-query physical plan, and Result is a data structure value in the memory of the working node.

Further, the system can be deployed on various forms of computing node platforms, including ECS, docker, physical machine environments.

Compared with the prior art, the invention has the following advantages:

Firstly, aiming at the data query of the time sequence type, a new, finer granularity and flexible caching algorithm is provided, the design defect of query result caching (Query Result Cache) is avoided, the method can be directly applied to a distributed big data analysis system, and the query performance of the OLAP is accelerated by fully utilizing the distributed cluster capability.

Secondly, the method can be directly applied in a cluster environment through a cache broadcasting mechanism, so that the cache of sub-queries is available at all nodes, the overall cache hit rate is improved, and the query performance of the OLAP is accelerated by fully utilizing the distributed cluster capability.

Drawings

FIG. 1 is a flowchart of an OLAP query optimization method based on broadcast sub-query caching according to an embodiment of the present invention;

Fig. 2 is a block diagram of an OLAP query optimization system based on broadcast sub-query caching according to an embodiment of the present invention.

Detailed Description

In order that the above objects, features and advantages of the invention will be readily understood, a more particular description of the invention will be rendered by reference to the appended drawings.

In the following description, numerous specific details are set forth in order to provide a thorough understanding of the present invention, but the present invention may be practiced in other ways other than those described herein, and persons skilled in the art will readily appreciate that the present invention is not limited to the specific embodiments disclosed below.

As shown in fig. 1, the method for optimizing OLAP query based on broadcast sub-query caching according to the embodiment of the present application may be specifically implemented in three steps.

(1) A new caching scheme is designed, and local results of sub-query operators are cached instead of SQL query results.

The caching scheme is specifically implemented in a working node Worker of the OLAP cluster. The Worker node is responsible for the execution of the physical schema PHYSICAL PLAN of SQL. The traditional SQL query result caching (Query Result Cache) is to cache the SQL query result at the Proxy layer Proxy, and the cache hit rate is very low in a time sequence type data analysis scene; the scheme is used for carrying out local result caching on the operator layer of PHYSICAL PLAN of SQL, is finer and more flexible, can greatly improve the cache hit rate in a time sequence type data analysis scene, and fully utilizes the distributed cluster capability to accelerate the query performance of the OLAP.

(2) When the management node Master receives the SQL query request aiming at the time series data, the SQL is analyzed into a logic plan logical plan, the logical plan is optimized to PHYSICAL PLAN, and PHYSICAL PLAN is split into sub-queries according to operators, and the final results are gathered. The execution result of the sub-query PHYSICAL PLAN is cached in each Worker node of the distributed system by using a broadcasting mechanism.

The operator split, mainly for PHYSICAL PLAN splits, sub-queries are typically optimized as hash connections HashJoin. The Worker node performs hash operation on the received sub-query PHYSICAL PLAN to obtain an identification ID, and attempts to obtain a cache of the sub-query PHYSICAL PLAN from a local cache module of the Worker node by using the identification ID; if the cache is obtained, loading the cache result into an execution flow Pipeline, and entering an operator calculation stage of HashJoin; if no cache is obtained, the sub-query PHYSICAL PLAN is executed locally, the execution Result is loaded into Pipeline, the operator calculation stage of HashJoin is entered, and meanwhile, the identification ID of the sub-query PHYSICAL PLAN and the execution Result are combined into a cache structure HashMap < HashID, and Result > is written into a local cache module of the Worker node.

In the cache structure HashMap < HashID, result > is the MD5 calculated value for the sub-query PHYSICAL PLAN, and Result is the data structure value in the memory of the workbench node.

(3) When the Worker node executes SQL query for time series data, the cache query of sub query PHYSICAL PLAN is directly carried out locally, and if the cache is hit, the operator execution of the next step is directly carried out; if the cache is not hit, sub-query operator execution is performed, the execution result of the sub-query PHYSICAL PLAN is locally cached, and the execution result is updated to all the workbench nodes of the OLAP cluster through broadcasting.

When the query process does not hit the cache, sub-query PHYSICAL PLAN is executed at the Worker node, and the Result after the execution is written into a local cache structure HashMap < HashID, result >; meanwhile, all the Worker node PRC addresses are acquired from the Master node, and HashMap < HashID, result > is broadcasted to all the Worker nodes. By the method, the sub-query cache is available at all nodes, the overall cache hit rate is improved, and the query performance of the OLAP is accelerated by fully utilizing the distributed cluster capability.

As shown in fig. 2, an OLAP query optimization system based on broadcast sub-query caching according to an embodiment of the present application includes the following modules:

Sub-query module: the method comprises the steps of deploying at a workbench node, extracting PHYSICAL PLAN of sub-queries, carrying out hash operation on PHYSICAL PLAN, obtaining an identification ID, and using the identification ID to try to obtain the cache of the sub-queries PHYSICAL PLAN from a cache module; if the cache is obtained, loading the cache result into a Pipeline, and entering an operator calculation stage of HashJoin; if no cache is obtained, the sub-query PHYSICAL PLAN is executed locally, the execution Result is loaded into Pipeline, the operator calculation stage of HashJoin is entered, and meanwhile, the identification ID of the sub-query PHYSICAL PLAN and the execution Result are combined into a cache structure HashMap < HashID, and Result > is written into the cache module.

And a cache module: the system is deployed at a workbench node, provides cache service based on an LRU strategy, and provides cache writing and parallel reading capabilities for a sub-query module. The core is a hash table HashTable, and the read-write lock is used for controlling, and the cache structure is HashMap < HashID, result >. And for the first write-in cache, a broadcasting module is called to broadcast the cache of all the workbench nodes of the OLAP cluster.

And a broadcasting module: and after receiving the cache broadcasting request from the cache module, the cache broadcasting service is provided, all the PRC addresses of the workbench nodes are obtained from the Master node, and the HashMap < HashID, result > is broadcasted to all the workbench nodes. By the method, the sub-query cache is available at all nodes, the overall cache hit rate is improved, and the query performance of the OLAP is accelerated by fully utilizing the distributed cluster capability.

The embodiment of the application realizes a prototype system on the ECS platform based on the Arian cloud and tests the effect of the method. However, according to different hardware devices used by the computing node, the embodiment of the application can be popularized and deployed to platforms such as a physical machine and a Docker.

The foregoing is merely a preferred embodiment of the present invention, and the present invention has been disclosed in the above description of the preferred embodiment, but is not limited thereto. Any person skilled in the art can make many possible variations and modifications to the technical solution of the present invention or modifications to equivalent embodiments using the methods and technical contents disclosed above, without departing from the scope of the technical solution of the present invention. Therefore, any simple modification, equivalent variation and modification of the above embodiments according to the technical substance of the present invention still fall within the scope of the technical solution of the present invention.

Claims

1. An OLAP query optimization method based on broadcast sub-query caching, comprising the steps of:

s1, when a management node receives an SQL query request aiming at time series data, firstly analyzing SQL into a logic plan, optimizing the logic plan into a physical plan, splitting the physical plan into sub-queries and final result convergence queries according to operators, and caching the execution results of the sub-query physical plan to all working nodes of an OLAP cluster by using a broadcasting mechanism;

The working node performs hash operation on the received sub-query physical plan to obtain an identification ID, and attempts to obtain a cache of the sub-query physical plan from a local cache module by using the identification ID; if the cache is obtained, loading the cache result into an execution flow, and entering an operator computing stage of hash connection; otherwise, the sub-query physical plan is executed locally, an execution result is loaded into an execution flow, an operator calculation stage of hash connection is entered, and meanwhile, the identification ID of the sub-query physical plan and the execution result are combined into a cache structure to be written into a local cache module;

2. The method of claim 1, wherein the local cache structure of the working node is expressed as HashMap < hashd, result >, wherein hashd is an MD5 calculated value for the sub-query physical plan, and Result is a data structure value in the working node memory.

3. The method according to claim 2, wherein the sub-query physical plan is executed at the working node when the query process of the working node does not hit the cache, and the execution Result is written into a local cache structure HashMap < hashd, result >; and simultaneously acquiring PRC addresses of all working nodes from the management node, and broadcasting HashMap < HashID, result > to all working nodes.

4. An OLAP query optimization system implemented using the method of any one of claims 1-3, comprising:

Sub-query module: the method comprises the steps of being deployed at an OLAP cluster working node and used for extracting sub-query physical plans, carrying out hash operation on the sub-query physical plans to obtain identification IDs, and attempting to obtain caches of the sub-query physical plans from a cache module by using the identification IDs; if the cache is obtained, loading the cache result into an execution flow, and entering an operator computing stage of hash connection; otherwise, the sub-query physical plan is executed locally, an execution result is loaded into an execution flow, an operator calculation stage of hash is entered, and meanwhile, the identification ID of the sub-query physical plan and the execution result are combined into a cache structure to be written into a cache module;

5. The system of claim 4, wherein the caching module provides a cache service based on an LRU policy, and the cache structure is expressed as HashMap < hashd, result >, where hashd is an MD5 calculated value for a sub-query physical plan, and Result is a data structure value in a working node memory.

6. The system of claim 4, wherein the system is capable of being deployed on various forms of computing node platforms, including ECS, docker, physical machine environments.