CN116226095A

CN116226095A - Memory calculation separation system of shared-architecture-free database

Info

Publication number: CN116226095A
Application number: CN202310508386.6A
Authority: CN
Inventors: 江大白; 胡增; 汪刚
Original assignee: China Applied Technology Co Ltd
Current assignee: China Applied Technology Co Ltd
Priority date: 2023-05-08
Filing date: 2023-05-08
Publication date: 2023-06-06

Abstract

The invention discloses a memory calculation separation system of a shared-architecture-free database, which comprises a cloud service control module, an elastic calculation module, an elastic local temporary storage module, a persistent data storage module and a query execution module, wherein the cloud service control module is used for receiving a database of a shared architecture; the cloud service control module is used for arranging the centralized service of end-to-end query execution; the elastic computing module is used for accessing computing resources in the cloud service system by utilizing the abstraction of the virtual warehouse; the elastic local temporary storage module is used for constructing a distributed temporary storage system and providing service for intermediate data by utilizing the distributed temporary storage system; the persistent data storage module is used for storing persistent data; and the query execution module is used for generating required execution tasks through the cloud server and scheduling the required execution tasks in nodes of the virtual warehouse. The present invention uses frequently read non-persistent data caches to reduce network traffic and improve data locality.

Description

Memory calculation separation system of shared-architecture-free database

Technical Field

The invention relates to the technical field of resource scheduling, in particular to a memory calculation separation system of a shared-architecture-free database.

Background

Conventional database systems are designed to handle repeated queries for data having predictable amounts and rates, e.g., data from within an organization: transaction systems, enterprise resource planning applications, customer relationship management applications, and the like. Today, more and more data comes from uncontrolled external sources (e.g., application logs, social media, web applications, and mobile systems, etc.), resulting in temporary, time-varying, and unpredictable query workloads. For such workloads, shared-nothing architecture can result in high cost, inflexibility, low performance, and inefficiency, affecting production application and cluster deployment.

The shared-nothing architecture has been the basis of traditional query execution engines and database systems, providing the underlying data storage services for today's cloud service platforms, and over the past few years databases of shared-nothing architecture have evolved to serve thousands of clients, performing millions of queries on pb-level data per day. In such an architecture, persistent data (e.g., customer data stored in table form) is partitioned into a set of computing nodes, each of which is responsible for only its local data. This shared-nothing architecture enables the query execution engine to scale well, providing cross-job isolation and good data locality, providing high performance services for various workloads. However, these benefits come at the cost of several major drawbacks:

1. hardware does not match workload: the shared-nothing architecture makes it difficult to achieve a perfect balance between the CPU, memory, storage and bandwidth resources provided by the compute nodes and the resources required by the workload. For example, a node configuration is well suited for bandwidth-intensive, lightweight compute bulk loading, but may not be well suited for compute-intensive, lightweight complex queries. However, many customers wish to run queries in a hybrid manner without having to set up separate clusters for each query type. Therefore, in order to achieve performance goals, resources must often be oversupplied; this results in an average resource under-utilization and higher traffic costs.

2. Lack of elasticity: even if the hardware resources on the compute nodes match workload demands, the data partitioning inherent in static parallelism and (inelastic) shared-nothing architectures limits the adaptation to data tilting and time-varying workloads. For example, customer-run queries have extremely skewed intermediate data sizes, the data sizes vary by more than 5 orders of magnitude, and the CPU requirements vary by as much as one order of magnitude within the same hour. Furthermore, shared-nothing architecture does not have effective resilience; the usual method of adding or deleting nodes to scale back resources requires the reallocation of large amounts of data, which not only increases network bandwidth requirements, but also results in significant performance degradation, especially when the entire cluster is still serving out-of-service.

For the problems in the related art, no effective solution has been proposed at present.

Disclosure of Invention

Aiming at the problems in the related art, the invention provides a system for separating the memory of a database without a shared architecture, which aims to overcome the technical problems in the prior art.

For this purpose, the invention adopts the following specific technical scheme:

the system comprises a cloud service control module, an elastic calculation module, an elastic local temporary storage module, a persistent data storage module and a query execution module;

the cloud service control module is used for arranging the centralized service of end-to-end query execution;

the elastic computing module is used for accessing computing resources in the cloud service system by utilizing the abstraction of the virtual warehouse, and providing services for clients by using the preheated node pool;

the elastic local temporary storage module is used for constructing a distributed temporary storage system and providing service for intermediate data by utilizing the distributed temporary storage system;

the persistent data storage module is used for storing persistent data;

and the query execution module is used for generating required execution tasks through the cloud server and scheduling the required execution tasks in nodes of the virtual warehouse.

Further, the centralized services include access control, query optimization and planning, scheduling, transaction management, and concurrency control.

Further, the persistent data is represented as client data, and is stored in a database in the form of a table;

the intermediate data is generated by a query operator.

Further, the elastic computing module comprises a resource access module and a node pool preheating module;

the resource access module is used for enabling a user to access computing resources in the cloud service system through abstraction of the virtual warehouse;

the node Chi Yure module is configured to provide services to clients using the preheated node pool and calculate elasticity on a fine-grained time scale.

Further, the persistent data storage module comprises a file dividing module, an attribute compression module and a stored file query execution module;

the file dividing module is used for horizontally dividing the storage table into storage files;

the attribute compression module is used for grouping and compressing the values of the individual attributes or columns in the storage file;

the stored file inquiry executing module is used for acquiring the head stored file, and reading and inquiring the columns required by executing by utilizing the offset of the columns in the head stored file.

Further, the generating the required execution task through the cloud server and scheduling the required execution task in the node of the virtual warehouse includes the following steps:

the client submits the query text to the cloud server;

the cloud server receives the query text and executes query analysis, query planning and optimization operations to generate required execution tasks;

scheduling the execution task on a node of the virtual warehouse, and executing read-write operation on the distributed temporary storage system and the persistent data storage module;

the query progress is tracked in real time through the cloud server, and the nodes are detected in real time by utilizing the collection performance counter;

rearranging the inquiry on the nodes of the virtual warehouse after detecting the node faults, and acquiring inquiry results;

and returning the query result to the virtual warehouse and returning the query result from the virtual warehouse to the client.

Further, the query execution module comprises a local perception task scheduling module and an unbalanced partition balancing strategy module;

the local perception task scheduling module is used for co-locating the execution task and the persistent data file and caching the persistent data file in the distributed temporary storage system by utilizing a local perception scheduling mechanism;

the unbalanced partition balancing strategy module is used for optimizing the nodes and improving load balancing.

Further, the local perception task scheduling module comprises a persistent data file distribution module and an execution task scheduling module;

the persistent data file distribution module is used for distributing the persistent data files through hash table file names and calculating nodes by using the distributed persistent data files;

the execution task scheduling module is used for scheduling the execution task operating on the persistent data file to the node to which the persistent data file hashes.

Further, the unbalanced partition balancing strategy module comprises a node distribution module and an optimal point acquisition module;

the node allocation module is used for allocating tasks from another node when the execution tasks of the nodes do not meet the expected completion time, and reading persistent data files required by the execution tasks from the persistent data storage module;

the optimal point acquisition module is used for acquiring an optimal node between two extremes by using a scheduler and scheduling a single execution task to the optimal node.

Further, the two extremes include co-locating the execution tasks with the cached persistent data file and locating all of the execution tasks in a single node.

The beneficial effects of the invention are as follows:

1. the invention can decouple the calculation and the persistent storage to realize elasticity, decompose the calculation storage through the distributed temporary storage system, and reduce network traffic and improve data locality by using frequently read non-persistent data caches.

2. The invention provides service for clients by using the preheated node pool, realizes the calculation elasticity on the fine-granularity time scale, so that the cloud calculation pricing with the granularity per hour is cost-effective, realizes the efficient utilization of the CPU, and not only realizes the efficient remote storage but also does not consume the CPU excessively.

3. The invention can realize the centralized service of end-to-end query execution through the cloud service end, and realize the operations of access control, query optimization and planning, scheduling, transaction management, concurrency control and the like through the cloud service end, designs and realizes the cloud service end as multi-user and long life cycle service, has enough replication to realize high availability and scalability, so that the failure of a single service node can not cause the loss of state or availability.

Drawings

In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings that are needed in the embodiments will be briefly described below, and it is obvious that the drawings in the following description are only some embodiments of the present invention, and other drawings may be obtained according to these drawings without inventive effort for a person skilled in the art.

FIG. 1 is a functional block diagram of a memory separation system for a shared-nothing architecture database in accordance with an embodiment of the present invention;

FIG. 2 is an overall architecture diagram of a memory separation system without a shared architecture database according to an embodiment of the invention.

In the figure:

1. a cloud service control module; 2. an elasticity calculation module; 3. an elastic local temporary storage module; 4. a persistent data storage module; 5. and a query execution module.

Detailed Description

For the purpose of further illustrating the various embodiments, the present invention provides the accompanying drawings, which are a part of the disclosure of the present invention, and which are mainly used for illustrating the embodiments and for explaining the principles of the operation of the embodiments in conjunction with the description thereof, and with reference to these matters, it will be apparent to those skilled in the art to which the present invention pertains that other possible embodiments and advantages of the present invention may be practiced.

According to an embodiment of the invention, a memory and computation separation system of a shared-nothing architecture database is provided.

The present invention will be further described with reference to the accompanying drawings and the specific embodiments, as shown in fig. 1, a system for separating a memory of a database without a shared architecture according to an embodiment of the present invention, where the system for separating a memory includes a cloud service control module 1, an elastic computing module 2, an elastic local temporary storage module 3, a persistent data storage module 4, and a query execution module 5.

Specifically, in the cloud service environment, storage and calculation coupling can cause a great deal of resource waste and other expenditure, the system performance and processing speed are reduced, the scalability of the system is enhanced by using a design of separating a storage unit from a calculation unit, a user can independently increase a database server to improve the processing capacity, and the storage server is increased to expand the database capacity; the fault tolerance of the system is enhanced, under the separated architecture, single-point faults of any link can be prevented through redundant configuration, and the continuous service capability of the database system is enhanced.

The cloud service control module 1 is used for arranging the centralized service executed by the end-to-end query.

Wherein the centralized services include access control, query optimization and planning, scheduling, transaction management, and concurrency control.

Specifically, as shown in fig. 2, centralized control is performed by the Cloud Service, and all users interact with a centralized layer named as a Cloud Service (CS) and submit queries, and the centralized layer is responsible for access control, query optimization and planning, scheduling, transaction management, concurrency control, and the like. Cloud services are designed and implemented as multi-user and long life cycle services with enough replication to achieve high availability and scalability. Thus, failure of a single serving node does not result in loss of state or availability, although some queries may fail and be re-executed unconsciously.

The elastic computing module 2 is configured to access computing resources in the cloud service system by using an abstraction of the virtual repository, and provide services for clients by using the preheated node pool.

The elastic computing module 2 comprises a resource access module and a node pool preheating module.

The resource access module is used for enabling the user to access the computing resources in the cloud service system through the abstraction of the virtual warehouse.

Specifically, a virtual repository (VM) refers to the abstraction of allocation of physical resources into individual virtual machine implementations, which are running on a cloud service system. The node Chi Yure module is configured to provide services to clients using the preheated node pool and calculate elasticity on a fine-grained time scale.

Specifically, a user accesses computing resources in a cloud service system through an abstraction of a Virtual Warehouse (VW). Each virtual warehouse is essentially a set of AWS EC2 instances, customer queries are performed in a distributed fashion, customers pay for computing time based on the virtual warehouse size, and each virtual warehouse can be elastically scaled according to the customer's requirements. To support elasticity on a fine-grained time scale (e.g., tens of seconds), a pre-heated EC2 instance pool is provided, which includes the steps of:

after receiving the request, only the EC2 instance needs to be added to or deleted from the virtual repository (in the case of addition, most of the requests can be supported directly from the pre-heated instance pool, thereby avoiding starting a new EC2 instance, reducing response time). Each virtual warehouse may run multiple concurrent queries, in fact, many clients run multiple virtual warehouses (e.g., one for data ingestion and one for executing OLAP queries).

Specifically, the AWS EC2 instance is a cloud service provided by amazon, EC2 is a machine configuration model in the cloud service, and after a request is received, only the EC2 instance needs to be added to or deleted from the virtual repository (in the case of addition, most of the requests can be supported directly from the warm-up instance pool, so as to avoid instance start-up time). Each virtual warehouse may run multiple concurrent queries. Elasticity can thus be calculated on a fine-grained time scale.

The elastic local temporary storage module 3 is used for constructing a distributed temporary storage system and providing service for the intermediate data by utilizing the distributed temporary storage system.

In particular, intermediate data has different performance requirements than persistent data, which are not met by existing persistent data stores (e.g., S3 does not provide the low latency and high throughput attributes required for intermediate data to ensure minimal blocking of computing nodes); a distributed temporary storage system is thus established to meet the needs of intermediate data, which is co-located with the computing nodes of the database and is specifically designed to automatically expand as nodes are added or deleted. And with the addition and deletion of the nodes, the distributed temporary storage system does not need to re-partition or re-combine the data, so that the core limit of the shared-nothing architecture is reduced, and each virtual warehouse runs an independent distributed temporary storage system and is only used for queries running on a specific virtual warehouse.

The persistent data storage module 4 is used for storing persistent data.

The persistent data storage module comprises a file dividing module, an attribute compression module and a stored file query execution module.

The file dividing module is used for horizontally dividing the storage table into storage files.

The attribute compression module is used for grouping and compressing the values of the individual attributes or columns in the storage file.

Specifically, the values of the individual attributes or columns in the storage file are grouped and compressed through user-defined rules.

Specifically, all persistent data is stored in a remote, resolved persistent data store. S3 is a cloud service provided by amazon for storing customer data in persistent data stores, which is stored in S3 despite the relatively low latency and throughput performance of S3 due to the resiliency, high availability and persistence attributes of S3. S3, supporting the storage of immutable files, wherein the files can only be completely covered, and even the additional operation is not allowed. However, S3 supports read requests for file portions. In order to store the table in S3, it is horizontally divided into large, immutable files, which correspond to blocks in a conventional database system. In each file, the values of each individual attribute or column are grouped together and compressed, each file having a header that stores the offset for each column in the file, so that the partial read function of S3 can be used to read only the columns required for query execution. In particular, all virtual warehouses belonging to the same customer can access the same shared table through remote persistent storage, thus eliminating the need to physically copy data from one virtual warehouse to another.

The query execution module 5 is configured to generate a required execution task through the cloud server, and schedule the required execution task in a node of the virtual warehouse.

The method for generating the required execution tasks through the cloud server and scheduling the required execution tasks in the nodes of the virtual warehouse comprises the following steps:

and the client submits the query text to the cloud server.

And the cloud server receives the query text and executes query analysis, query planning and optimization operations to generate required execution tasks.

The execution tasks are scheduled on nodes of the virtual warehouse and read-write operations are performed on the distributed temporary storage system and the persistent data storage module 4.

And tracking the query progress in real time through the cloud server, and detecting the nodes in real time by utilizing the collection performance counter.

Specifically, the cloud server refers to a server program deployed on a cloud server, and can provide functions such as internet-based service and data storage, and the cloud server continuously tracks the progress of each query, collects performance counters, and rearranges the queries on computing nodes in a virtual warehouse after detecting a node failure.

After detecting the node failure, the queries on the nodes of the virtual warehouse are rearranged and query results are obtained.

The query execution module 5 comprises a local perception task scheduling module and an unbalanced partition balancing strategy module.

Specifically, to fully utilize the distributed temporary storage system, each task is co-located with persistent data profiles, and nodes are computed using a local aware scheduling mechanism (recall that the profiles may be cached in the temporary storage system) using consistent hash table filenames to allocate the persistent data files. Thus, for a fixed virtual warehouse size, each persistent data file is cached on a particular node and tasks that operate on the persistent data file are scheduled to the node to which the file is always hashed.

In particular, the result of the above-described scheduling scheme is that query parallelism is tightly coupled with consistent hashes of files on nodes—one query is scheduled to be cache-local and can be distributed across all nodes in the virtual warehouse.

Consider, for example, a client having persistent data of 100 tens of thousands of files and running a virtual warehouse with 10 nodes. Consider two queries, the first running on 100 files and the second operating on 100000 files; then, two queries are likely to run on all 10 nodes, since the file is hashed consistently on all 10 nodes. The order of magnitude of the read and write persistence bytes varies almost independently of the number of nodes in the virtual warehouse, with intermediate data exchanged over the network increasing as the number of nodes used increases.

The local perception task scheduling module is used for co-locating the execution task and the persistent data file and caching the persistent data file in the distributed temporary storage system by utilizing a local perception scheduling mechanism.

The local perception task scheduling module comprises a persistent data file distribution module and an execution task scheduling module.

The persistent data file distribution module is used for distributing the persistent data files through hash table file names and calculating nodes by using the distributed persistent data files.

The unbalanced partition balancing strategy module comprises a node distribution module and an optimal point acquisition module.

The node allocation module is configured to allocate a task from another node when the execution task of the node does not meet the expected completion time, and read a persistent data file required for executing the task from the persistent data storage module 4.

Wherein the two extremes include co-locating the execution tasks with the cached persistent data file and locating all of the execution tasks in a single node.

In particular, consistent hashing results in unbalanced partitioning, and in order to avoid node overload and improve load balancing, an unbalanced partition balancing strategy is used, which is a simple optimization that allows a node to distribute tasks from another node if the task expectation completion time (sum of execution time and latency) of a new node is low. When this occurs, the persistent data files needed to perform the task will be read from the remote persistent data store, rather than from the node that was originally scheduled to perform the task. This may avoid increasing the load on the already overloaded node that was originally scheduled to perform the task (which may only occur if the node is overloaded).

The scheduler may put tasks on the nodes using two extreme options: one is to co-locate tasks with cached persistent data, which may schedule all queries on all nodes in the virtual warehouse; while such a scheduling policy may minimize network traffic for reading persistent data, it may result in increased network traffic for intermediate data exchanges. The other extreme is to put all tasks on a single node. This would avoid the need for network transmission for intermediate data exchanges, but would increase network traffic for persistent data reads. Both of these extremes may not be the correct choice for all queries. The co-designed query scheduler will select the correct set of nodes to obtain an optimal point between the two extremes and then schedule a single task onto these nodes.

Wherein the persistent data is represented as client data and stored in a database in the form of a table; the intermediate data is generated by a query operator.

Specifically, the persistent data is client data, which is stored in a database in the form of a table. Each table can be read by multiple queries, even simultaneously, over time, and therefore the storage time of these tables is long, requiring strong persistence and availability guarantees.

The intermediate data is generated by a query operator (e.g., a connection), and is typically used by nodes that are involved in executing the query, so that the effective time of the intermediate data is relatively short. Furthermore, to avoid nodes being blocked during intermediate data access, low-latency high-throughput access to intermediate data is guaranteed in preference to strong persistence, and if a failure occurs during the (short) lifecycle of the intermediate data, the failed query part can simply be re-run.

Metadata is represented as corresponding files, statistics, transaction logs, locks, etc. mapped from object directories or database tables into persistent storage.

In summary, by means of the above technical solution of the present invention, the present invention can decouple computation and persistent storage to achieve elasticity, decompose computation and storage through a distributed temporary storage system, and use frequently read non-persistent data caches to reduce network traffic and improve data locality; according to the invention, the preheated node pool is used for providing services for clients, so that the calculation elasticity on the fine-granularity time scale is realized, the cloud calculation pricing with the granularity per hour is cost-effective, the efficient utilization of the CPU is realized, and the efficient remote storage is realized without excessive consumption of the CPU; the invention can realize the centralized service of end-to-end query execution through the cloud service end, and realize the operations of access control, query optimization and planning, scheduling, transaction management, concurrency control and the like through the cloud service end, designs and realizes the cloud service end as multi-user and long life cycle service, has enough replication to realize high availability and scalability, so that the failure of a single service node can not cause the loss of state or availability.

The foregoing description of the preferred embodiments of the invention is not intended to be limiting, but rather is intended to cover all modifications, equivalents, alternatives, and improvements that fall within the spirit and scope of the invention.

Claims

1. The system is characterized by comprising a cloud service control module, an elastic calculation module, an elastic local temporary storage module, a persistent data storage module and a query execution module;

the persistent data storage module is used for storing persistent data;

2. The system of claim 1, wherein the centralized services include access control, query optimization and planning, scheduling, transaction management, and concurrency control.

3. A system for the segregation of memory in a shared-architecture-free database according to claim 2, wherein the persistent data is represented as customer data and stored in the database in the form of tables;

the intermediate data is generated by a query operator.

4. A system for computing and separating a shared-architecture-free database as recited in claim 3, wherein said elastic computing module comprises a resource access module and a node pool pre-heating module;

5. The system of claim 4, wherein the persistent data storage module comprises a file partitioning module, an attribute compression module, and a stored file query execution module;

6. The system for separating storage of a database without shared architecture according to claim 5, wherein the generating the required execution task by the cloud server and scheduling the required execution task in the node of the virtual warehouse comprises the steps of:

the client submits the query text to the cloud server;

7. The system of claim 6, wherein the query execution module comprises a local aware task scheduling module and an unbalanced partition balancing policy module;

8. The system of claim 7, wherein the local aware task scheduling module comprises a persistent data file allocation module and an execution task scheduling module;

9. The system of claim 8, wherein the unbalanced partition balancing policy module comprises a node allocation module and a best point acquisition module;

10. The system of claim 9, wherein the two extremes include co-locating execution tasks with cached persistent data files and locating all execution tasks in a single node.