WO2020220717A1

WO2020220717A1 - Decoupling elastic data warehouse architecture

Info

Publication number: WO2020220717A1
Application number: PCT/CN2019/130535
Authority: WO
Inventors: 伍浩文; 白童心; 须成忠
Original assignee: 中国科学院深圳先进技术研究院
Priority date: 2019-04-30
Filing date: 2019-12-31
Publication date: 2020-11-05
Also published as: CN110162515A

Abstract

Provided is a decoupling elastic data warehouse architecture. The decoupling elastic data warehouse architecture comprises: a data warehouse front end for using PostgreSQL as a basis for the data warehouse front end, processing incoming and outgoing data, providing a control and query user interface, and managing underlying storage; a data warehouse back end for performing extensible and elastic resource management and performing single or concurrent querying; and separation data warehouse middleware for coordinating, by using Dex, message transfer and data transmission between the data warehouse front end and the data warehouse back end. By means of separating a data management function from a data calculation function, independent extensibility is achieved, and a cloud elastic data warehouse system structure is explored. The data warehouse front end receives data, manages storage and provides high availability. The data warehouse back end is used for data analysis querying. By means of separating data management from data calculation, the present data warehouse architecture can acquire elasticity from a single data warehouse.

Description

A decoupled elastic data warehouse architecture

Technical field

The invention relates to the field of data warehouses, in particular to a decoupled elastic data warehouse architecture.

Background technique

As the cloud becomes more and more popular in providing shared and managed IT infrastructure, companies today are eager to transfer their data platform assets to the cloud to reduce equipment, utilities, and maintenance expenses. Moving data warehouses to the cloud is a cost-effective data management trend considered by companies today. In order to fully achieve economic goals, a cloud data warehouse system should be able to adjust its resource allocation to adapt to changing workload requirements. However, the traditional data warehouse architecture is not flexible enough to allow on-demand resource control, which severely limits the total cost of cloud providers and users to optimize the total cost and maintain the desired quality of service.

The data warehouse has existed for decades, and its main architecture has changed from symmetric multi-processor (SMP) to massively parallel processor (MPP). However, the emergence of cloud computing and big data requires a new paradigm change, which is more urgent and destructive than the previous paradigm. The current MPP data warehouse is statically installed on a few computer nodes that are not shared. This architecture cannot make use of the multi-function and powerful functions of the cloud for planning and resource allocation, which prevents users and cloud providers from achieving the expected performance, service quality, and budget control goals.

Traditionally, due to good scalability, MPP has gained popularity in data warehouses. However, this scalability is almost only provided at installation time. Before installation, the type of workload and data volume is very clear. But for MPP data warehouses, it is difficult to support heterogeneous workloads, computationally intensive algorithms and fine-grained resource management. With the cloud, the volatile nature of modern business workflow is inevitable. First, the workflow must deal with diverse data sources. Traditionally, data warehouses are set up to analyze data integrated from internal data sources. In the cloud era, the possibility of analyzing data is increasing, coming from a variety of applications, and the speed varies greatly. In addition, analysis requests are made on demand by external customers. When a large number of requests are submitted in a short period of time, the system is under great pressure to process them. In this case, the ability to expand computing resources is essential to ensure service quality. Second, the analysis will use more complex and iterative algorithms that require more computing power than traditional analysis workloads. Modern algorithms in data mining and machine learning have succeeded in identifying patterns and discovering valid data in business data. Modern analytics use advanced algorithms to drive applications, such as personalized recommendations, fraud detection, and business decision-making. As a result, more CPU-intensive workloads will run in the data warehouse than ever before.

Based on the above observations, it can be considered that elasticity, that is, the ability to independently and adaptively expand system components, is the main attribute that cloud data warehouses should support. However, the limitation of the current data warehouse software design to establish an elastic data warehouse is not a simple task. It assumes a symmetrical model in which each node is connected to a local storage, and all nodes are homogeneous. In MPP settings, using a strongly coupled model to process the system will have a positive impact on performance. But in cloud configurations, the software design becomes an obstacle to performance and cost-effectiveness. In order to obtain the required flexibility, the software must support a certain degree of separation of computing and storage so that more computing resources can be added when the workload requires it.

After recognizing this obstacle, some database vendors have begun to redesign cloud computing data warehouses. Azure SQL data warehouse is a large-scale data warehouse service available on Microsoft Azure cloud. Azure SQL is based on Microsoft SQL server and is suitable for supporting relational and non-relational data. Azure SQL stores and accesses all its data through Azure Blob storage service. Due to the physical separation, not only storage and computing are independent, but computing can also be suspended so that users only pay for storage.

Summary of the invention

The embodiment of the present invention provides a decoupled elastic data warehouse architecture to at least solve the technical problem that the existing data warehouse cannot separate data management and data calculation.

According to an embodiment of the present invention, a decoupled elastic data warehouse architecture is provided, including:

The data warehouse front-end is used to use PostgreSQL as the basis of the data warehouse front-end to process incoming and outgoing data, provide control and query user interfaces, and manage the underlying storage; among them, all sharding logic is implemented in the PostgreSQL extension called xschema in the management of the underlying storage;

Data warehouse backend, used for scalable and flexible resource management, single or concurrent query; the resource allocation is carried out in two stages, the first stage is at the time of installation, the total resources are initially allocated from the cloud; the second stage After the cluster is set up, users can pass parameters when the session starts; and use Spark SQL as the underlying query engine;

Separate data warehouse middleware and use Dex to coordinate message and data transmission between the front end of the data warehouse and the back end of the data warehouse.

Further, Dex middleware includes: Dex server, PostgreSQL adapter and Spark adapter, and runs through the Dex communication API. The Dex communication API provides an intermediate layer, including:

The PostgreSQL adapter is used to convert database queries, communicate with Dex servers, and then convert the returned response from the backend cluster;

The Dex server is used to maintain the query context, monitor the transition of the session state and provide Dex services through the Dex communication API;

The Spark adapter is used to accept and parse Dex requests, convert Dex requests into Spark computing tasks, and send them back to the Dex server once the response is ready.

Furthermore, Dex interoperability is a stateful service managed by Dex Context in Dex server. The internal work is driven by messages exchanged between PostgreSQL and Spark; Dex Context supports two settings of single backend and multiple backends. Back-end settings, Dex Context proxy the communication between the PostgreSQL back-end process and Spark through the Dex Context API; when starting a new session, the client application first creates or reuses the Dex Context instance by submitting a connection request to the Dex server. Once set With Dex Context, client applications can use Dex Context API to start calling services; Dex Context also supports multiple back-ends. When multiple back-ends are connected in a single session, the Dex server references the Dex context manager. Each session assigns a Dex Context to a backend;

The PostgreSQL adapter is implemented in the PostgreSQL extension, providing a client library for the Dex communication API interface, and also includes internal functions for converting database queues into Dex requests and converting results back to PostgreSQL data records;

The Spark adapter parses the Dex request into the corresponding Spark function, starts to execute the task and returns the final result.

Further, in the data warehouse front-end processing incoming and outgoing data, for data integration, choose to support various data sources, including local and network file systems, relational databases and non-relational databases; and process specific types of data sources through data ingestion drivers;

The front end of the data warehouse provides a control and query user interface. The user interface inherits the SQL syntax and serves as a unified portal for system control and interactive query;

The front-end management of the data warehouse runs on the designated master node through the shard controller in the underlying storage; all user data is stored on the data node; the system administrator registers and releases the data node through sharding, and the user defines the partition for the distributed fact table Program;

The front-end management of the data warehouse uses the analysis service interface in the underlying storage for users to start the analysis workload on the back-end; and re-analyze and plan in PostgreSQL.

Further, the back end of the data warehouse is a computer cluster managed by a software stack, and the data warehouse layer is designated for different functions, including resource allocation, task scheduling, and query combination.

Furthermore, for a single query in the backend of the data warehouse, the execution efficiency is determined by the query optimizer and the execution framework; for many concurrently running queries, the overall execution efficiency requirement involves the task scheduler.

Further, the backend of the data warehouse uses Spark SQL as the underlying query engine. The backend of the data warehouse is a specific server cluster. There is only one corresponding Spark installation managed by YARN. Multiple sessions connected to the same backend are maintained independently; Spark sessions are served by different Spark jobs.

Furthermore, the data warehouse has multiple Spark clusters as backends; Spark clusters have different sizes and configurations; in a Spark cluster, any session can specify its own resource requirements;

In the back-end resource management of the data warehouse, all back-end information is recorded in a shared directory, which is stored on the PostgreSQL master server and can be referenced by all Dex Contexts; it can be added, deleted or selected according to the request of privileged users. end;

The back-end of the data warehouse includes a back-end session manager. The back-end session manager shares common information with the home directory, which maintains a superset of all active sessions; each session on the back-end session manager only stores session-specific elements Data; this metadata includes important facts about the front end of the data warehouse, the back end of the data warehouse, and the network connection; the metadata will be used for request processing. Once the front-end request is received and parsed, a request data structure of type ReqStruct will be generated. The session metadata needs to be searched to identify the input and functions appearing in the request; the communication between the front end of the data warehouse and the back end of the data warehouse is implemented using ZeroMQ of the high-speed asynchronous network I/O library.

Further, the Spark adapter is responsible for processing the request from the Dex middleware; when a front-end request is received, the request will be converted into a Spark command, including the released SQL query or Spark function, and the pro-logue and epilogue that define the auxiliary RDD Command; Duo SQL supports two types of analysis requests: one is SQL query, the other is UDF call; each Dex request encodes its type in its header.

Furthermore, in Spark SQL, JdbcRDD is used as a standard API for importing data from remote databases; JdbcRDD allows parallel connection to multiple partitions of a single table, which enhances support for sharded databases. It is a JBDCRDD of a shard; using DuoRDD, Spark executors can be loaded from multiple shard nodes in parallel.

The decoupled elastic data warehouse architecture in the embodiment of the present invention explores the cloud elastic data warehouse architecture by separating data management and data computing functions to achieve independent scalability. The front end of the data warehouse receives data, manages storage, and provides high availability. The data warehouse backend is used for data analysis queries. By separating data management and data calculation, the present invention can obtain flexibility in a single data warehouse.

Description of the drawings

Figure 1 is a schematic diagram of the architecture of DuoSQL in the present invention;

2 is a schematic diagram of the structure of Dex as middleware for data coordination in the present invention;

Figure 3 is a process diagram in which the new Spark API of the present invention extends JdbcRDD;

Figure 4 is a flowchart of the execution of the Duo SQL system in the present invention.

Detailed ways

In order to enable those skilled in the art to better understand the solutions of the present invention, the technical solutions in the embodiments of the present invention will be described clearly and completely with reference to the accompanying drawings in the embodiments of the present invention. Obviously, the described embodiments are only It is a part of the embodiments of the present invention, not all the embodiments. Based on the embodiments of the present invention, all other embodiments obtained by those of ordinary skill in the art without creative work shall fall within the protection scope of the present invention.

It should be noted that the terms "first" and "second" in the specification and claims of the present invention and the above-mentioned drawings are used to distinguish similar objects, and not necessarily used to describe a specific sequence or sequence. It should be understood that the data used in this way can be interchanged under appropriate circumstances so that the embodiments of the present invention described herein can be implemented in an order other than those illustrated or described herein. In addition, the terms "including" and "having" and any variations of them are intended to cover non-exclusive inclusions. For example, a process, method, system, product or device that includes a series of steps or units is not necessarily limited to the clearly listed Those steps or units may include other steps or units that are not clearly listed or are inherent to these processes, methods, products, or equipment.

The decoupled elastic data warehouse architecture in the embodiment of the present invention explores the cloud elastic data warehouse architecture by separating data management and data calculation functions to achieve independent scalability. The front end of the data warehouse receives data, manages storage, and provides high availability. The data warehouse backend is used for data analysis queries. By separating data management and data calculation, the present invention can obtain flexibility in a single data warehouse.

As a preferred technical solution, Dex middleware includes: Dex server, PostgreSQL adapter and Spark adapter, and runs through the Dex communication API. The Dex communication API provides an intermediate layer, including:

As a preferred technical solution, Dex interoperability is a stateful service managed by Dex Context in Dex server. The internal work is driven by messages exchanged between PostgreSQL and Spark; Dex Context supports single backend and multiple backends respectively Settings. For single-backend settings, Dex Context proxy the communication between the PostgreSQL back-end process and Spark through the Dex Context API; when starting a new session, the client application first creates or reuses the Dex Context by submitting a connection request to the Dex server For example, once the DexContext is set, the client application can use the Dex Context API to start calling the service; Dex Context also supports multiple backends, when multiple backends are connected in a single session, the Dex server references the Dex context manager , Each of these sessions assigns a Dex Context to a backend;

As a preferred technical solution, the front-end of the data warehouse processes incoming and outgoing data. For data integration, it chooses to support various data sources, including local and network file systems, relational databases and non-relational databases; and handles specific types through data ingestion drivers Data source;

As a preferred technical solution, the back end of the data warehouse is a computer cluster managed by a software stack, and the data warehouse layer is designated for different functions, including resource allocation, task scheduling, and query combination.

As a preferred technical solution, for a single query in the data warehouse backend, the execution efficiency is jointly determined by the query optimizer and the execution framework; for many concurrently running queries, the overall execution efficiency requirement involves the task scheduler.

As a preferred technical solution, the data warehouse backend uses Spark SQL as the underlying query engine, the data warehouse backend is a specific server cluster, and there is only one corresponding Spark installation managed by YARN, and multiple sessions connected to the same backend are independently maintained Yes; separate Spark sessions are served by different Spark jobs.

As a preferred technical solution, the data warehouse has multiple Spark clusters as backends; Spark clusters have different sizes and configurations; in a Spark cluster, any session can specify its own resource requirements;

As a preferred technical solution, the Spark adapter is responsible for processing the request from the Dex middleware; when a front-end request is received, the request will be converted into a Spark command, including the released SQL query or Spark function, and the pro that defines the auxiliary RDD -logue and epilogue commands; Duo SQL supports two types of analysis requests: one is SQL query, the other is UDF call; each Dex request encodes its type in its header.

As a preferred technical solution, in Spark SQL, JdbcRDD is used as a standard API for importing data from remote databases; JdbcRDD allows parallel connection to multiple partitions of a single table, which enhances the support for sharded databases. It is a shard JBDCRDD ; Using DuoRDD, Spark executors can be loaded from multiple shard nodes in parallel.

In the specific embodiment of the present invention, the present invention explores an architecture that separates data management and data calculation. By separating these two parts, the system gains more flexibility and adaptability. In order to realize, the present invention constructs a prototype system DuoSQL based on PostgreSQL and Spark. At the same time, the present invention uses the TPC-H benchmark to verify the system. Experimental results show that the decoupling algorithm has great performance potential.

The present invention separates data management and data calculation functions to achieve independent scalability, and explores the cloud elastic data warehouse system structure. The data warehouse front end (data management unit) receives data, manages storage, and provides high availability. The back end of the data warehouse (data computing unit) is used for data analysis queries. By separating data management and data calculation, the present invention can obtain flexibility in a single data warehouse, which is a characteristic that the existing system does not have. In contrast, Microsoft Azure cloud database only allows flexible processing across multiple data warehouses. The invention enables the data warehouse to have stronger adaptability in the face of constantly changing workload requirements. The invention uses an RDBMS and a memory cluster computing engine with SQL support to implement the system structure. Specifically, the present invention builds a prototype system named DuoSQL based on PostgreSQL and Spark. First, the present invention proposes a cloud data warehouse architecture, which can decouple data management and data calculation, thereby achieving flexibility in a single data warehouse. Secondly, the present invention constructs a prototype system named DuoSQL based on PostgreSQL and Spark, and the experimental results show good performance potential.

The present invention discloses a decoupled elastic data warehouse architecture. In order to make the purpose and technical solution of the present invention clearer and clearer, the present invention will be further described in detail with reference to the accompanying drawings and examples. It should be understood that the specific embodiments described here are only used to explain the present invention, but not to limit the present invention.

The flexible data warehouse architecture is based on flexible configuration, mainly starting from the flexible configuration of storage platforms and computing platforms, and then assisting with communication middleware. The elastic data warehouse is introduced from the following aspects:

1.1 System structure:

Figure 1 shows the architecture of DuoSQL. The whole system structure combines the data management front-end and data computing back-end. As long as the subsystems that meet the design goals are selected, the architecture supports various front-end and back-end subsystems. In order to integrate the front-end and the back-end, an internal operation middleware is needed to manage network work connections, user sessions, request agents, query translation, and data transmission.

1.2 Internal design:

1.2.1 Frontend

As shown on the left side of Figure 1, the front end is the data management component of the data warehouse, and its main responsibilities are as follows:

(1) Process incoming and outgoing data. For data integration, the front end can choose to support various data sources, such as local and network file systems, relational databases, and non-relational databases. A data ingest driver may be required to handle specific types of data sources.

(2) Provide control and query user interface. The user interface inherits the SQL syntax and serves as a unified portal for system control and interactive query.

(3) Manage the underlying storage. When the size of the managed data is too large for local storage, segmentation is inevitable. For an OLAP database, segmentation severely complicates query planning and execution. Therefore, the complexity of query planning for fragmented data is transferred from the data management unit to the data calculation unit.

1.2.2 Backend

The back end is where the actual analysis and calculations are performed. It is a computer cluster completely managed by the software stack to ensure efficiency and flexibility. In this software stack, the data warehouse layer is designated for different functions, such as resource allocation, task scheduling, and query combination. Together, these layers serve two purposes:

(1) Scalable and flexible resource management. Resource allocation is carried out in two stages. The first stage is during installation, when total resources are initially allocated from the cloud. The second stage is after the cluster is set up, the user can pass parameters when the session starts, such as the number of worker threads, cores and total memory. By allowing two-stage allocation, the system provides coarse-grained and fine-grained resource elasticity.

(2) Query efficiency. The central task of the backend is to complete the submitted query as quickly as possible. For a single query, the execution efficiency is determined by the query optimizer and execution framework. For many queries that run concurrently, the overall execution efficiency requirements involve the task scheduler. In order to improve the query efficiency of big data analytics, a lot of work has been done. Build a prototype system based on existing technology.

In short, the backend should ensure efficient processing of single and concurrently running queries. With this goal, many modern distributed computing frameworks, such as Apache Spark SQL and Apache CalcTITE, can be used as the back-end software foundation with the help of a qualified resource manager (such as Apache Yarn).

1.2.3 Middleware

Middleware is a key component that supports the separation of data warehouses. At the upper level, it provides interfaces and semantic abstractions for front-end and back-end to communicate with each other. At a lower level, it directs the message exchange and data transmission between the client and the server. The design and implementation of middleware should solve the following problems:

(1) Data and interface abstraction. RDBMS uses SQL to manage structured data, while most big data platforms use imperative language interfaces to process unstructured data. To communicate and interoperate across heterogeneous systems, it is necessary to first develop a general abstraction of data models and query interfaces.

(2) Big data set. In order to manage storage and facilitate queries on related tables in a scalable manner, data sets are usually sharded or partitioned on key columns to be distributed across multiple servers. The significance of sharpening middleware is that data partitioning (which may have different partitioning schemes) should be processed collaboratively on all layers from data abstraction, communication protocol to underlying data transmission mechanism.

(3) Data transmission. Large data sets challenge not only storage management, but also data transmission on the network. During data analysis and calculation, the back-end needs to transmit data from the front-end cluster. In this process, network IO may easily become a bottleneck.

The present invention will be described in detail below with a specific implementation process.

2.1 Front end

Duo SQL uses PostgreSQL as the basis of the front end. The advantage that makes PostgreSQL stand out is its excellent support for writing database extensions. In fact, most of the front-end logic of Duo SQL is implemented in PostgreSQL extensions, and most user interface functions are in the form of UDF.

Fragmentation: As mentioned earlier, the management and storage of large data sets requires fragmentation support. Although PostgreSQL is strictly a relational database and does not support sharding itself, there are several open source extensions for sharding that can be used as reference. Specifically, the solution of the present invention is based on pg_shardman. The present invention implements all fragmentation logic in a PostgreSQL extension named xschema. Inside the extension, a group of slice controller functions and slice directory table are defined as follows.

FUNCTIONS

xschema.add_data_node

xschema.remove_data_node

xschema.partition_table

xschema.rebalance_partitions

CATALOG

xschema.data_nodes

xschema.data_tables

xschema.data_partitions

xschema.data_replicas

…

The shard controller runs on the designated master node. All user data is stored on the data node. Through sharding, system administrators can register and release data nodes for data dissemination, and users can define partitioning schemes for distributed fact tables.

Analysis service interface: The present invention also needs an interface for the user to start the analysis workload at the back end. Finally, by re-analyzing and planning in PostgreSQL, service calls can be made transparent to users. Currently, Duo SQL analysis service is invoked through a set of UDFs, including functions that connect to the backend, run SQL queries, and call remote functions.

2.2 Middleware

The present invention uses Dex as shown in Figure 2 as middleware to coordinate message transfer and data transmission between the front end and the back end. Dex was originally designed as an interoperability framework connecting heterogeneous data platforms (such as PostgreSQL and Spark). In the prior art, Dex does not support sharding front-end, nor does it fully utilize Spark's SQL query engine. The present invention adjusts it in Duo SQL to support both.

Dex middleware is composed of three main components, Dex server, PostgreSQL adapter and Spark adapter, which run through the Dex communication API. The PostgreSQL adapter converts the database query, communicates with the Dex server, and then converts the returned response from the backend cluster. The Dex server maintains the query context, monitors session state transitions and provides Dex services through the Dex communication API. The Spark adapter accepts and parses Dex requests, converts the Dex request into a series of Spark computing tasks, and sends it back to the Dex server once the response is ready. The Dex communication API provides an intermediate layer that enables end systems to communicate with pure isolation and abstraction.

Dex Context: Dex interoperability is a stateful service managed around Dex Context in the Dex server. The internal work is driven by the messages exchanged between PostgreSQL and Spark. Dex Context supports two settings of single backend and multiple backends. For single-backend settings, Dex Context uses the Dex Context API to proxy the communication between the PostgreSQL back-end process and Spark. To start a new session, the client application must first create or reuse a Dex Context instance by submitting a connection request to the Dex server. Once the Dex Context is set, the client application can use the Dex Context API to start calling services. Dex Context also supports multiple backends. Compared with the case of a single backend, when multiple backends are connected in a single session, the Dex server needs to be able to maintain all states. Of course, this introduces the Dex context manager, which ensures that each session assigns a Dex Context to a backend.

PostgreSQL Adapter: The PostgreSQL adapter is implemented in the PostgreSQL extension. It is used as a client library that provides Dex communication API interface. It also includes internal functions for converting database queues into Dex requests and converting the results back to PostgreSQL data records.

Spark Adapter: Spark Adapter is a module in Spark that parses Dex requests into corresponding Spark functions, starts executing tasks and returns the final result.

2.3 Backend

The back end of Duo SQL uses Spark SQL as the underlying query engine. The present invention always refers to the back end as a specific server cluster, whether it is virtual or physical distribution. In a backend cluster, there is only one corresponding Spark installation managed by YARN. However, multiple sessions connected to the same backend are maintained independently. In the Spark field, separate Spark sessions are served by different Spark jobs.

Resilience: Back-end flexibility is provided in multiple levels. First, the data warehouse can have multiple Spark clusters as backends. Second, Spark clusters can have different sizes and configurations. Third, in a Spark cluster, any session can also specify its own resource requirements, such as the total execution program memory and the number of cores.

Back-end management: Duo SQL records all back-end information in a shared directory, which is stored on the PostgreSQL master server and can be referenced by all Dex Contexts. Since Duo SQL is a system that solves coupling, it is possible to add, delete or select backends according to the request of privileged users.

Session management: The back-end session manager shares some common information with the home directory, which maintains a superset of all active sessions. Each session on the backend stores only session-specific metadata. This metadata includes important facts about the front-end, back-end, and network connections, such as shard nodes, database names, shard tables, partitions, functions available on Spark, database connection strings, and ZeroMQ handlers. Metadata will be used for request processing. Once the front-end request is received and parsed, a request data structure of type ReqStruct is generated. At this point, you need to look up the session metadata to identify the inputs and functions that appear in the request. The communication between the front end and the back end is implemented using ZeroMQ, which is a high-speed asynchronous network I/O library.

Request processing: Spark Adapter is responsible for processing requests from Dex middleware. When a front-end request is received, it may be converted into a series of Spark commands, which may include released SQL queries or Spark functions, as well as some pro-logue and epilogue commands that define auxiliary RDDs. Duo SQL supports two types of analysis requests: one is SQL query, and the other is UDF call. Each Dex request encodes its type in its header. The request processor uses this information to determine how to translate the request content.

Parallel data transmission: For decoupled systems like Duo SQL, it almost always relies on bulk data transmission over the network. For analyzing large data sets, parallel transmission can significantly reduce the total execution time. In Spark SQL, JdbcRDD is a standard API for importing data from remote databases. One function of JdbcRDD is to allow parallel connection to multiple partitions of a single table, thereby achieving parallel data transmission. However, the parallel connection function of JdbcRDD is not applicable to fragmented databases. In order to overcome this challenge, as shown in Figure 3, the present invention has developed a new Spark API named DuoRDD, which extends JdbcRDD and enhances the support for sharded databases. It is a shard JBDCRDD. Using DuoRDD, Spark executors can be loaded from multiple shard nodes in parallel.

The example shown in Figure 4 demonstrates how to execute user requests in the system. Step 1: The user submits a SQL containing a PostgreSQL request. Then, step 2 is that the interface calls the context API to execute the corresponding request. If the user submits multiple requests while executing the request, Duo SQL will create another context for other requests. Then step 3 is the middleware adapter, which analyzes the request initiated by the user and generates different ReqStructs according to different functions. The fourth step is that the Spark adapter analyzes the request sent by the middleware adapter, executes the request and returns the result, which also includes the Spark request data from the PostgreSQL cluster. On the Spark adapter, Duo SQL started the Spark job on the multi-line cluster management. All ReqStructs are processed in this Spark job to reduce the startup time of the Spark job. Finally, the final execution result will return to the next level until it returns to the user. The interaction between the various components of the system is reflected in the process. Through the design of middleware, there is no direct interaction between PostgreSQL and Spark.

The innovative technical points and beneficial effects of the present invention are at least as follows:

1. The comprehensive application of decoupling, flexible resource allocation and back-end diversity. In the system, the front-end and back-end are delivered through message middleware, and the front-end and back-end resources are elastically configured. The front-end is completed through the distributed database management system and the back-end is completed through the Yarn resource manager. When choosing different back-end systems, you can use different back-end features. Currently, you can use Spark's related features.

2. The present invention proposes a flexible data warehouse architecture by decoupling data management and calculation. The present invention constructs a prototype system DuoSQL by using Dex interoperability middleware, a fragmented PostgreSQL database as a front end and a Spark cluster as a back end. The present invention evaluates the performance potential of Duo SQL by comparing it with independent PostgreSQL with and without parallel query support. The invention runs test experiments with different workloads and input types. The results show that Duo SQL not only has obvious performance advantages, but also has excellent robustness.

The decoupling elastic data warehouse architecture proposed by the present invention has three important characteristics: decoupling, flexibility and diversity. Decoupling, the design of the present invention is a design that completely separates calculation and storage. Secondly, flexibility is the biggest consideration of the present invention. Elasticity, that is, the ability to independently and adaptively expand system components, is the main attribute that cloud data warehouses should support. In addition, it can also use the diversity of the back-end to diversify the data warehouse. For example, the Spark used in the present invention, then this experiment uses the features of the coming Spark's iterative calculation of memory, etc. The present invention can be determined by the user. Different backends.

The OLAP data analysis benchmark test TPC-H and machine learning algorithm verify the effectiveness of the system structure of the present invention. In the TPC-H experiment, the structure of the present invention and the case of non-storage computing separation were compared, including the TPC-H experiment under the scale factors of 30, 50, and 75. In the machine learning algorithm, the present invention is compared with the PostgreSQL clustering algorithm under the ApahceMADlib framework, and the data comes from the skin_noskin data set and the KGEE data set of UCI. The results of the experiment are basically better than the existing structure.

The sequence numbers of the foregoing embodiments of the present invention are only for description, and do not represent the superiority of the embodiments.

In the above-mentioned embodiments of the present invention, the description of each embodiment has its own focus. For parts that are not described in detail in an embodiment, reference may be made to related descriptions of other embodiments.

In the several embodiments provided in this application, it should be understood that the disclosed technical content can be implemented in other ways. Among them, the system embodiment described above is only illustrative. For example, the division of units may be a logical function division, and there may be other divisions in actual implementation. For example, multiple units or components may be combined or integrated into Another system, or some features can be ignored, or not implemented. In addition, the displayed or discussed mutual coupling or direct coupling or communication connection may be indirect coupling or communication connection through some interfaces, units or modules, and may be in electrical or other forms.

The units described as separate components may or may not be physically separate, and the components displayed as units may or may not be physical units, that is, they may be located in one place, or may be distributed on multiple units. Some or all of the units may be selected according to actual needs to achieve the objectives of the solutions of the embodiments.

In addition, the functional units in the various embodiments of the present invention may be integrated into one processing unit, or each unit may exist alone physically, or two or more units may be integrated into one unit. The above-mentioned integrated unit can be implemented in the form of hardware or software functional unit.

If the integrated unit is implemented in the form of a software functional unit and sold or used as an independent product, it can be stored in a computer readable storage medium. Based on this understanding, the technical solution of the present invention essentially or the part that contributes to the prior art or all or part of the technical solution can be embodied in the form of a software product, and the computer software product is stored in a storage medium , Including several instructions to make a computer device (which may be a personal computer, a server, or a network device, etc.) execute all or part of the steps of the methods in the various embodiments of the present invention. The aforementioned storage media include: U disk, read-only memory (ROM, Read-Only Memory), random access memory (RAM, Random Access Memory), mobile hard disk, magnetic disk or optical disk and other media that can store program code .

The above are only the preferred embodiments of the present invention. It should be pointed out that for those of ordinary skill in the art, without departing from the principle of the present invention, several improvements and modifications can be made, and these improvements and modifications are also It should be regarded as the protection scope of the present invention.

Claims

A decoupled elastic data warehouse architecture, which is characterized in that it includes:

The data warehouse front-end is used to use PostgreSQL as the basis of the data warehouse front-end to process incoming and outgoing data, provide control and query user interfaces, and manage the underlying storage; among them, all sharding logic is implemented in the PostgreSQL extension called xschema in the management of the underlying storage;

Data warehouse backend, used for scalable and flexible resource management, single or concurrent query; the resource allocation is carried out in two stages, the first stage is at the time of installation, the total resources are initially allocated from the cloud; the second stage After the cluster is set up, the user passes parameters when the session is started; Spark SQL is used as the underlying query engine;

Separate data warehouse middleware and use Dex to coordinate message and data transmission between the front end of the data warehouse and the back end of the data warehouse.
The decoupled elastic data warehouse architecture according to claim 1, wherein the Dex middleware includes: Dex server, PostgreSQL adapter and Spark adapter, and runs through the Dex communication API. The Dex communication API provides an intermediate layer. :

The PostgreSQL adapter is used to convert database queries, communicate with Dex servers, and then convert the returned response from the backend cluster;

The Dex server is used to maintain the query context, monitor the transition of the session state and provide Dex services through the Dex communication API;

The Spark adapter is used to accept and parse Dex requests, convert Dex requests into Spark computing tasks, and send them back to the Dex server once the response is ready.
The decoupled elastic data warehouse architecture according to claim 2, wherein Dex interoperability is a stateful service managed around Dex Context in the Dex server, and the internal work is driven by messages exchanged between PostgreSQL and Spark; Context supports two settings, single backend and multiple backends. For single backend settings, Dex Context uses the Dex Context API to proxy the communication between the PostgreSQL backend process and Spark; when a new session is started, the client application first passes Submit a connection request to the Dex server to create or reuse DexContext instances. Once the Dex Context is set, the client application uses the Dex Context API to start calling the service; Dex Context also supports multiple backends, when multiple connections are made in a single session At the end, the Dex server references the Dex context manager, and each session in it assigns a Dex Context to a back end;

The PostgreSQL adapter is implemented in the PostgreSQL extension, providing a client library for the Dex communication API interface, and also includes internal functions for converting database queues into Dex requests and converting results back to PostgreSQL data records;

The Spark adapter parses the Dex request into the corresponding Spark function, starts to execute the task and returns the final result.
The decoupled elastic data warehouse architecture according to claim 1, wherein the front-end of the data warehouse processes incoming and outgoing data, and for data integration, various data sources are selected and supported, including local and network file systems, relational databases, and non-relational databases. Database; and process specific types of data sources through data ingestion drivers;

The front end of the data warehouse provides a control and query user interface. The user interface inherits the SQL syntax and serves as a unified portal for system control and interactive query;

The underlying storage of the front-end management of the data warehouse runs on the designated master node through the shard controller; all user data is stored on the data node; the system administrator registers and releases the data nodes in shards, and the user defines the partition scheme for the distributed fact table;

The data warehouse front-end management bottom-level storage allows users to start analysis workloads in the back-end through the analysis service interface; and re-analyze and plan in PostgreSQL.
The decoupled elastic data warehouse architecture according to claim 1, wherein the back end of the data warehouse is a computer cluster managed by a software stack, and the data warehouse layer is designated for different functions, including resource allocation, task scheduling, and query combination.
The decoupled elastic data warehouse architecture according to claim 1, wherein for a single query in the back end of the data warehouse, the execution efficiency is jointly determined by the query optimizer and the execution framework; for multiple queries running concurrently, the overall Execution efficiency also involves the task scheduler.
The decoupled elastic data warehouse architecture according to claim 3, wherein the back end of the data warehouse uses Spark SQL as the underlying query engine, the back end of the data warehouse is a specific server cluster, and there is only one corresponding Spark installation managed by YARN. Multiple sessions connected to the same backend are maintained independently; separate Spark sessions are served by different Spark jobs.
The decoupled elastic data warehouse architecture according to claim 7, wherein the data warehouse has multiple Spark clusters as backends; Spark clusters have different sizes and configurations; in a Spark cluster, any session can specify itself Resource requirements;

In the back-end resource management of the data warehouse, all back-end information is recorded in a shared directory, which is stored on the PostgreSQL master server and can be referenced by all Dex Contexts; it can be added, deleted or selected according to the request of privileged users. end;

The back-end of the data warehouse includes a back-end session manager. The back-end session manager shares common information with the home directory, which maintains a superset of all active sessions; each session on the back-end session manager only stores session-specific elements Data; the metadata includes important facts about the front end of the data warehouse, the back end of the data warehouse, and the network connection; the metadata is used for request processing. Once the front-end request is received and parsed, a request data structure of type ReqStruct will be generated. When looking for session metadata to identify the input and function appearing in the request; the communication between the front end of the data warehouse and the back end of the data warehouse is implemented using ZeroMQ of the high-speed asynchronous network I/O library.
The decoupled elastic data warehouse architecture according to claim 8, wherein the Spark adapter is responsible for processing the request from the Dex middleware; when a front-end request is received, the request is converted into a Spark command, including the released SQL Queries or Spark functions, as well as pro-logue and epilogue commands that define auxiliary RDDs; Duo SQL supports two types of analysis requests: one is SQL query, and the other is UDF call; each Dex request has a right in its header Its type is encoded.
The decoupled elastic data warehouse architecture according to claim 9, characterized in that in Spark SQL, JdbcRDD is used to import data from a remote database standard API; JdbcRDD allows parallel connection to multiple partitions of a single table, enhancing the The shard database is supported by JBDCRDD of a shard; using DuoRDD, Spark executors are loaded from multiple shard nodes in parallel.