WO2020220717A1 - Decoupling elastic data warehouse architecture - Google Patents

Decoupling elastic data warehouse architecture Download PDF

Info

Publication number
WO2020220717A1
WO2020220717A1 PCT/CN2019/130535 CN2019130535W WO2020220717A1 WO 2020220717 A1 WO2020220717 A1 WO 2020220717A1 CN 2019130535 W CN2019130535 W CN 2019130535W WO 2020220717 A1 WO2020220717 A1 WO 2020220717A1
Authority
WO
WIPO (PCT)
Prior art keywords
data
data warehouse
dex
spark
postgresql
Prior art date
Application number
PCT/CN2019/130535
Other languages
French (fr)
Chinese (zh)
Inventor
伍浩文
白童心
须成忠
Original Assignee
中国科学院深圳先进技术研究院
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 中国科学院深圳先进技术研究院 filed Critical 中国科学院深圳先进技术研究院
Publication of WO2020220717A1 publication Critical patent/WO2020220717A1/en

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/21Design, administration or maintenance of databases
    • G06F16/211Schema design and management
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/24Querying
    • G06F16/242Query formulation
    • G06F16/2433Query languages
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/28Databases characterised by their database models, e.g. relational or object models
    • G06F16/283Multi-dimensional databases or data warehouses, e.g. MOLAP or ROLAP

Definitions

  • the invention relates to the field of data warehouses, in particular to a decoupled elastic data warehouse architecture.
  • the data warehouse has existed for decades, and its main architecture has changed from symmetric multi-processor (SMP) to massively parallel processor (MPP).
  • SMP symmetric multi-processor
  • MPP massively parallel processor
  • SMP symmetric multi-processor
  • MPP massively parallel processor
  • the current MPP data warehouse is statically installed on a few computer nodes that are not shared. This architecture cannot make use of the multi-function and powerful functions of the cloud for planning and resource allocation, which prevents users and cloud providers from achieving the expected performance, service quality, and budget control goals.
  • elasticity that is, the ability to independently and adaptively expand system components
  • cloud data warehouses should support.
  • the limitation of the current data warehouse software design to establish an elastic data warehouse is not a simple task. It assumes a symmetrical model in which each node is connected to a local storage, and all nodes are homogeneous. In MPP settings, using a strongly coupled model to process the system will have a positive impact on performance. But in cloud configurations, the software design becomes an obstacle to performance and cost-effectiveness. In order to obtain the required flexibility, the software must support a certain degree of separation of computing and storage so that more computing resources can be added when the workload requires it.
  • Azure SQL data warehouse is a large-scale data warehouse service available on Microsoft Azure cloud.
  • Azure SQL is based on Microsoft SQL server and is suitable for supporting relational and non-relational data.
  • Azure SQL stores and accesses all its data through Azure Blob storage service. Due to the physical separation, not only storage and computing are independent, but computing can also be suspended so that users only pay for storage.
  • the embodiment of the present invention provides a decoupled elastic data warehouse architecture to at least solve the technical problem that the existing data warehouse cannot separate data management and data calculation.
  • a decoupled elastic data warehouse architecture including:
  • the data warehouse front-end is used to use PostgreSQL as the basis of the data warehouse front-end to process incoming and outgoing data, provide control and query user interfaces, and manage the underlying storage; among them, all sharding logic is implemented in the PostgreSQL extension called xschema in the management of the underlying storage;
  • Data warehouse backend used for scalable and flexible resource management, single or concurrent query; the resource allocation is carried out in two stages, the first stage is at the time of installation, the total resources are initially allocated from the cloud; the second stage After the cluster is set up, users can pass parameters when the session starts; and use Spark SQL as the underlying query engine;
  • Dex middleware includes: Dex server, PostgreSQL adapter and Spark adapter, and runs through the Dex communication API.
  • the Dex communication API provides an intermediate layer, including:
  • the PostgreSQL adapter is used to convert database queries, communicate with Dex servers, and then convert the returned response from the backend cluster;
  • the Dex server is used to maintain the query context, monitor the transition of the session state and provide Dex services through the Dex communication API;
  • the Spark adapter is used to accept and parse Dex requests, convert Dex requests into Spark computing tasks, and send them back to the Dex server once the response is ready.
  • Dex interoperability is a stateful service managed by Dex Context in Dex server.
  • the internal work is driven by messages exchanged between PostgreSQL and Spark;
  • Dex Context supports two settings of single backend and multiple backends.
  • Back-end settings Dex Context proxy the communication between the PostgreSQL back-end process and Spark through the Dex Context API; when starting a new session, the client application first creates or reuses the Dex Context instance by submitting a connection request to the Dex server.
  • client applications can use Dex Context API to start calling services;
  • Dex Context also supports multiple back-ends. When multiple back-ends are connected in a single session, the Dex server references the Dex context manager. Each session assigns a Dex Context to a backend;
  • the PostgreSQL adapter is implemented in the PostgreSQL extension, providing a client library for the Dex communication API interface, and also includes internal functions for converting database queues into Dex requests and converting results back to PostgreSQL data records;
  • the Spark adapter parses the Dex request into the corresponding Spark function, starts to execute the task and returns the final result.
  • data warehouse front-end processing incoming and outgoing data choose to support various data sources, including local and network file systems, relational databases and non-relational databases; and process specific types of data sources through data ingestion drivers;
  • the front end of the data warehouse provides a control and query user interface.
  • the user interface inherits the SQL syntax and serves as a unified portal for system control and interactive query;
  • the front-end management of the data warehouse runs on the designated master node through the shard controller in the underlying storage; all user data is stored on the data node; the system administrator registers and releases the data node through sharding, and the user defines the partition for the distributed fact table Program;
  • the front-end management of the data warehouse uses the analysis service interface in the underlying storage for users to start the analysis workload on the back-end; and re-analyze and plan in PostgreSQL.
  • the back end of the data warehouse is a computer cluster managed by a software stack, and the data warehouse layer is designated for different functions, including resource allocation, task scheduling, and query combination.
  • the execution efficiency is determined by the query optimizer and the execution framework; for many concurrently running queries, the overall execution efficiency requirement involves the task scheduler.
  • the backend of the data warehouse uses Spark SQL as the underlying query engine.
  • the backend of the data warehouse is a specific server cluster. There is only one corresponding Spark installation managed by YARN. Multiple sessions connected to the same backend are maintained independently; Spark sessions are served by different Spark jobs.
  • the data warehouse has multiple Spark clusters as backends; Spark clusters have different sizes and configurations; in a Spark cluster, any session can specify its own resource requirements;
  • all back-end information is recorded in a shared directory, which is stored on the PostgreSQL master server and can be referenced by all Dex Contexts; it can be added, deleted or selected according to the request of privileged users. end;
  • the back-end of the data warehouse includes a back-end session manager.
  • the back-end session manager shares common information with the home directory, which maintains a superset of all active sessions; each session on the back-end session manager only stores session-specific elements Data; this metadata includes important facts about the front end of the data warehouse, the back end of the data warehouse, and the network connection; the metadata will be used for request processing.
  • a request data structure of type ReqStruct will be generated.
  • the session metadata needs to be searched to identify the input and functions appearing in the request; the communication between the front end of the data warehouse and the back end of the data warehouse is implemented using ZeroMQ of the high-speed asynchronous network I/O library.
  • the Spark adapter is responsible for processing the request from the Dex middleware; when a front-end request is received, the request will be converted into a Spark command, including the released SQL query or Spark function, and the pro-logue and epilogue that define the auxiliary RDD Command;
  • Duo SQL supports two types of analysis requests: one is SQL query, the other is UDF call; each Dex request encodes its type in its header.
  • JdbcRDD is used as a standard API for importing data from remote databases; JdbcRDD allows parallel connection to multiple partitions of a single table, which enhances support for sharded databases. It is a JBDCRDD of a shard; using DuoRDD, Spark executors can be loaded from multiple shard nodes in parallel.
  • the decoupled elastic data warehouse architecture in the embodiment of the present invention explores the cloud elastic data warehouse architecture by separating data management and data computing functions to achieve independent scalability.
  • the front end of the data warehouse receives data, manages storage, and provides high availability.
  • the data warehouse backend is used for data analysis queries. By separating data management and data calculation, the present invention can obtain flexibility in a single data warehouse.
  • Figure 1 is a schematic diagram of the architecture of DuoSQL in the present invention.
  • FIG. 2 is a schematic diagram of the structure of Dex as middleware for data coordination in the present invention.
  • FIG. 3 is a process diagram in which the new Spark API of the present invention extends JdbcRDD;
  • FIG. 4 is a flowchart of the execution of the Duo SQL system in the present invention.
  • a decoupled elastic data warehouse architecture including:
  • the data warehouse front-end is used to use PostgreSQL as the basis of the data warehouse front-end to process incoming and outgoing data, provide control and query user interfaces, and manage the underlying storage; among them, all sharding logic is implemented in the PostgreSQL extension called xschema in the management of the underlying storage;
  • Data warehouse backend used for scalable and flexible resource management, single or concurrent query; the resource allocation is carried out in two stages, the first stage is at the time of installation, the total resources are initially allocated from the cloud; the second stage After the cluster is set up, users can pass parameters when the session starts; and use Spark SQL as the underlying query engine;
  • the decoupled elastic data warehouse architecture in the embodiment of the present invention explores the cloud elastic data warehouse architecture by separating data management and data calculation functions to achieve independent scalability.
  • the front end of the data warehouse receives data, manages storage, and provides high availability.
  • the data warehouse backend is used for data analysis queries.
  • the present invention can obtain flexibility in a single data warehouse.
  • Dex middleware includes: Dex server, PostgreSQL adapter and Spark adapter, and runs through the Dex communication API.
  • the Dex communication API provides an intermediate layer, including:
  • the PostgreSQL adapter is used to convert database queries, communicate with Dex servers, and then convert the returned response from the backend cluster;
  • the Dex server is used to maintain the query context, monitor the transition of the session state and provide Dex services through the Dex communication API;
  • the Spark adapter is used to accept and parse Dex requests, convert Dex requests into Spark computing tasks, and send them back to the Dex server once the response is ready.
  • Dex interoperability is a stateful service managed by Dex Context in Dex server.
  • the internal work is driven by messages exchanged between PostgreSQL and Spark;
  • Dex Context supports single backend and multiple backends respectively Settings.
  • Dex Context proxy the communication between the PostgreSQL back-end process and Spark through the Dex Context API; when starting a new session, the client application first creates or reuses the Dex Context by submitting a connection request to the Dex server For example, once the DexContext is set, the client application can use the Dex Context API to start calling the service; Dex Context also supports multiple backends, when multiple backends are connected in a single session, the Dex server references the Dex context manager , Each of these sessions assigns a Dex Context to a backend;
  • the PostgreSQL adapter is implemented in the PostgreSQL extension, providing a client library for the Dex communication API interface, and also includes internal functions for converting database queues into Dex requests and converting results back to PostgreSQL data records;
  • the Spark adapter parses the Dex request into the corresponding Spark function, starts to execute the task and returns the final result.
  • the front-end of the data warehouse processes incoming and outgoing data.
  • it chooses to support various data sources, including local and network file systems, relational databases and non-relational databases; and handles specific types through data ingestion drivers Data source;
  • the front end of the data warehouse provides a control and query user interface.
  • the user interface inherits the SQL syntax and serves as a unified portal for system control and interactive query;
  • the front-end management of the data warehouse runs on the designated master node through the shard controller in the underlying storage; all user data is stored on the data node; the system administrator registers and releases the data node through sharding, and the user defines the partition for the distributed fact table Program;
  • the front-end management of the data warehouse uses the analysis service interface in the underlying storage for users to start the analysis workload on the back-end; and re-analyze and plan in PostgreSQL.
  • the back end of the data warehouse is a computer cluster managed by a software stack, and the data warehouse layer is designated for different functions, including resource allocation, task scheduling, and query combination.
  • the execution efficiency is jointly determined by the query optimizer and the execution framework; for many concurrently running queries, the overall execution efficiency requirement involves the task scheduler.
  • the data warehouse backend uses Spark SQL as the underlying query engine
  • the data warehouse backend is a specific server cluster, and there is only one corresponding Spark installation managed by YARN, and multiple sessions connected to the same backend are independently maintained Yes; separate Spark sessions are served by different Spark jobs.
  • the data warehouse has multiple Spark clusters as backends; Spark clusters have different sizes and configurations; in a Spark cluster, any session can specify its own resource requirements;
  • all back-end information is recorded in a shared directory, which is stored on the PostgreSQL master server and can be referenced by all Dex Contexts; it can be added, deleted or selected according to the request of privileged users. end;
  • the back-end of the data warehouse includes a back-end session manager.
  • the back-end session manager shares common information with the home directory, which maintains a superset of all active sessions; each session on the back-end session manager only stores session-specific elements Data; this metadata includes important facts about the front end of the data warehouse, the back end of the data warehouse, and the network connection; the metadata will be used for request processing.
  • a request data structure of type ReqStruct will be generated.
  • the session metadata needs to be searched to identify the input and functions appearing in the request; the communication between the front end of the data warehouse and the back end of the data warehouse is implemented using ZeroMQ of the high-speed asynchronous network I/O library.
  • the Spark adapter is responsible for processing the request from the Dex middleware; when a front-end request is received, the request will be converted into a Spark command, including the released SQL query or Spark function, and the pro that defines the auxiliary RDD -logue and epilogue commands;
  • Duo SQL supports two types of analysis requests: one is SQL query, the other is UDF call; each Dex request encodes its type in its header.
  • JdbcRDD is used as a standard API for importing data from remote databases; JdbcRDD allows parallel connection to multiple partitions of a single table, which enhances the support for sharded databases. It is a shard JBDCRDD ; Using DuoRDD, Spark executors can be loaded from multiple shard nodes in parallel.
  • the present invention explores an architecture that separates data management and data calculation. By separating these two parts, the system gains more flexibility and adaptability.
  • the present invention constructs a prototype system DuoSQL based on PostgreSQL and Spark.
  • the present invention uses the TPC-H benchmark to verify the system. Experimental results show that the decoupling algorithm has great performance potential.
  • the present invention separates data management and data calculation functions to achieve independent scalability, and explores the cloud elastic data warehouse system structure.
  • the data warehouse front end (data management unit) receives data, manages storage, and provides high availability.
  • the back end of the data warehouse (data computing unit) is used for data analysis queries.
  • the present invention can obtain flexibility in a single data warehouse, which is a characteristic that the existing system does not have.
  • Microsoft Azure cloud database only allows flexible processing across multiple data warehouses.
  • the invention enables the data warehouse to have stronger adaptability in the face of constantly changing workload requirements.
  • the invention uses an RDBMS and a memory cluster computing engine with SQL support to implement the system structure.
  • the present invention builds a prototype system named DuoSQL based on PostgreSQL and Spark.
  • the present invention proposes a cloud data warehouse architecture, which can decouple data management and data calculation, thereby achieving flexibility in a single data warehouse.
  • the present invention constructs a prototype system named DuoSQL based on PostgreSQL and Spark, and the experimental results show good performance potential.
  • the present invention discloses a decoupled elastic data warehouse architecture.
  • the present invention will be further described in detail with reference to the accompanying drawings and examples. It should be understood that the specific embodiments described here are only used to explain the present invention, but not to limit the present invention.
  • the flexible data warehouse architecture is based on flexible configuration, mainly starting from the flexible configuration of storage platforms and computing platforms, and then assisting with communication middleware.
  • the elastic data warehouse is introduced from the following aspects:
  • Figure 1 shows the architecture of DuoSQL.
  • the whole system structure combines the data management front-end and data computing back-end.
  • the architecture supports various front-end and back-end subsystems.
  • an internal operation middleware is needed to manage network work connections, user sessions, request agents, query translation, and data transmission.
  • the front end is the data management component of the data warehouse, and its main responsibilities are as follows:
  • the front end can choose to support various data sources, such as local and network file systems, relational databases, and non-relational databases.
  • a data ingest driver may be required to handle specific types of data sources.
  • the user interface inherits the SQL syntax and serves as a unified portal for system control and interactive query.
  • the back end is where the actual analysis and calculations are performed. It is a computer cluster completely managed by the software stack to ensure efficiency and flexibility.
  • the data warehouse layer is designated for different functions, such as resource allocation, task scheduling, and query combination. Together, these layers serve two purposes:
  • the central task of the backend is to complete the submitted query as quickly as possible.
  • the execution efficiency is determined by the query optimizer and execution framework.
  • the overall execution efficiency requirements involve the task scheduler. In order to improve the query efficiency of big data analytics, a lot of work has been done. Build a prototype system based on existing technology.
  • the backend should ensure efficient processing of single and concurrently running queries.
  • many modern distributed computing frameworks such as Apache Spark SQL and Apache CalcTITE, can be used as the back-end software foundation with the help of a qualified resource manager (such as Apache Yarn).
  • Middleware is a key component that supports the separation of data warehouses. At the upper level, it provides interfaces and semantic abstractions for front-end and back-end to communicate with each other. At a lower level, it directs the message exchange and data transmission between the client and the server.
  • the design and implementation of middleware should solve the following problems:
  • RDBMS uses SQL to manage structured data, while most big data platforms use imperative language interfaces to process unstructured data. To communicate and interoperate across heterogeneous systems, it is necessary to first develop a general abstraction of data models and query interfaces.
  • Duo SQL uses PostgreSQL as the basis of the front end.
  • the advantage that makes PostgreSQL stand out is its excellent support for writing database extensions.
  • most of the front-end logic of Duo SQL is implemented in PostgreSQL extensions, and most user interface functions are in the form of UDF.
  • the shard controller runs on the designated master node. All user data is stored on the data node. Through sharding, system administrators can register and release data nodes for data dissemination, and users can define partitioning schemes for distributed fact tables.
  • the present invention also needs an interface for the user to start the analysis workload at the back end. Finally, by re-analyzing and planning in PostgreSQL, service calls can be made transparent to users.
  • Duo SQL analysis service is invoked through a set of UDFs, including functions that connect to the backend, run SQL queries, and call remote functions.
  • the present invention uses Dex as shown in Figure 2 as middleware to coordinate message transfer and data transmission between the front end and the back end.
  • Dex was originally designed as an interoperability framework connecting heterogeneous data platforms (such as PostgreSQL and Spark).
  • Dex does not support sharding front-end, nor does it fully utilize Spark's SQL query engine.
  • the present invention adjusts it in Duo SQL to support both.
  • Dex middleware is composed of three main components, Dex server, PostgreSQL adapter and Spark adapter, which run through the Dex communication API.
  • the PostgreSQL adapter converts the database query, communicates with the Dex server, and then converts the returned response from the backend cluster.
  • the Dex server maintains the query context, monitors session state transitions and provides Dex services through the Dex communication API.
  • the Spark adapter accepts and parses Dex requests, converts the Dex request into a series of Spark computing tasks, and sends it back to the Dex server once the response is ready.
  • the Dex communication API provides an intermediate layer that enables end systems to communicate with pure isolation and abstraction.
  • Dex interoperability is a stateful service managed around Dex Context in the Dex server. The internal work is driven by the messages exchanged between PostgreSQL and Spark.
  • Dex Context supports two settings of single backend and multiple backends. For single-backend settings, Dex Context uses the Dex Context API to proxy the communication between the PostgreSQL back-end process and Spark. To start a new session, the client application must first create or reuse a Dex Context instance by submitting a connection request to the Dex server. Once the Dex Context is set, the client application can use the Dex Context API to start calling services. Dex Context also supports multiple backends.
  • the Dex server needs to be able to maintain all states.
  • this introduces the Dex context manager, which ensures that each session assigns a Dex Context to a backend.
  • PostgreSQL Adapter is implemented in the PostgreSQL extension. It is used as a client library that provides Dex communication API interface. It also includes internal functions for converting database queues into Dex requests and converting the results back to PostgreSQL data records.
  • Spark Adapter is a module in Spark that parses Dex requests into corresponding Spark functions, starts executing tasks and returns the final result.
  • the back end of Duo SQL uses Spark SQL as the underlying query engine.
  • the present invention always refers to the back end as a specific server cluster, whether it is virtual or physical distribution.
  • a backend cluster there is only one corresponding Spark installation managed by YARN. However, multiple sessions connected to the same backend are maintained independently. In the Spark field, separate Spark sessions are served by different Spark jobs.
  • Back-end flexibility is provided in multiple levels.
  • the data warehouse can have multiple Spark clusters as backends.
  • Spark clusters can have different sizes and configurations.
  • Duo SQL records all back-end information in a shared directory, which is stored on the PostgreSQL master server and can be referenced by all Dex Contexts. Since Duo SQL is a system that solves coupling, it is possible to add, delete or select backends according to the request of privileged users.
  • Session management The back-end session manager shares some common information with the home directory, which maintains a superset of all active sessions. Each session on the backend stores only session-specific metadata. This metadata includes important facts about the front-end, back-end, and network connections, such as shard nodes, database names, shard tables, partitions, functions available on Spark, database connection strings, and ZeroMQ handlers. Metadata will be used for request processing. Once the front-end request is received and parsed, a request data structure of type ReqStruct is generated. At this point, you need to look up the session metadata to identify the inputs and functions that appear in the request. The communication between the front end and the back end is implemented using ZeroMQ, which is a high-speed asynchronous network I/O library.
  • Spark Adapter is responsible for processing requests from Dex middleware. When a front-end request is received, it may be converted into a series of Spark commands, which may include released SQL queries or Spark functions, as well as some pro-logue and epilogue commands that define auxiliary RDDs. Duo SQL supports two types of analysis requests: one is SQL query, and the other is UDF call. Each Dex request encodes its type in its header. The request processor uses this information to determine how to translate the request content.
  • JdbcRDD is a standard API for importing data from remote databases.
  • One function of JdbcRDD is to allow parallel connection to multiple partitions of a single table, thereby achieving parallel data transmission.
  • the parallel connection function of JdbcRDD is not applicable to fragmented databases.
  • the present invention has developed a new Spark API named DuoRDD, which extends JdbcRDD and enhances the support for sharded databases. It is a shard JBDCRDD.
  • DuoRDD Spark executors can be loaded from multiple shard nodes in parallel.
  • Step 1 The user submits a SQL containing a PostgreSQL request.
  • step 2 is that the interface calls the context API to execute the corresponding request. If the user submits multiple requests while executing the request, Duo SQL will create another context for other requests.
  • step 3 is the middleware adapter, which analyzes the request initiated by the user and generates different ReqStructs according to different functions.
  • the fourth step is that the Spark adapter analyzes the request sent by the middleware adapter, executes the request and returns the result, which also includes the Spark request data from the PostgreSQL cluster.
  • Duo SQL started the Spark job on the multi-line cluster management.
  • the present invention proposes a flexible data warehouse architecture by decoupling data management and calculation.
  • the present invention constructs a prototype system DuoSQL by using Dex interoperability middleware, a fragmented PostgreSQL database as a front end and a Spark cluster as a back end.
  • the present invention evaluates the performance potential of Duo SQL by comparing it with independent PostgreSQL with and without parallel query support.
  • the invention runs test experiments with different workloads and input types. The results show that Duo SQL not only has obvious performance advantages, but also has excellent robustness.
  • the decoupling elastic data warehouse architecture proposed by the present invention has three important characteristics: decoupling, flexibility and diversity.
  • Decoupling the design of the present invention is a design that completely separates calculation and storage.
  • flexibility is the biggest consideration of the present invention.
  • Elasticity that is, the ability to independently and adaptively expand system components, is the main attribute that cloud data warehouses should support.
  • it can also use the diversity of the back-end to diversify the data warehouse.
  • the Spark used in the present invention then this experiment uses the features of the coming Spark's iterative calculation of memory, etc.
  • the present invention can be determined by the user. Different backends.
  • the OLAP data analysis benchmark test TPC-H and machine learning algorithm verify the effectiveness of the system structure of the present invention.
  • the structure of the present invention and the case of non-storage computing separation were compared, including the TPC-H experiment under the scale factors of 30, 50, and 75.
  • the machine learning algorithm the present invention is compared with the PostgreSQL clustering algorithm under the ApahceMADlib framework, and the data comes from the skin_noskin data set and the KGEE data set of UCI.
  • the results of the experiment are basically better than the existing structure.
  • the disclosed technical content can be implemented in other ways.
  • the system embodiment described above is only illustrative.
  • the division of units may be a logical function division, and there may be other divisions in actual implementation.
  • multiple units or components may be combined or integrated into Another system, or some features can be ignored, or not implemented.
  • the displayed or discussed mutual coupling or direct coupling or communication connection may be indirect coupling or communication connection through some interfaces, units or modules, and may be in electrical or other forms.
  • the units described as separate components may or may not be physically separate, and the components displayed as units may or may not be physical units, that is, they may be located in one place, or may be distributed on multiple units. Some or all of the units may be selected according to actual needs to achieve the objectives of the solutions of the embodiments.
  • the functional units in the various embodiments of the present invention may be integrated into one processing unit, or each unit may exist alone physically, or two or more units may be integrated into one unit.
  • the above-mentioned integrated unit can be implemented in the form of hardware or software functional unit.
  • the integrated unit is implemented in the form of a software functional unit and sold or used as an independent product, it can be stored in a computer readable storage medium.
  • the technical solution of the present invention essentially or the part that contributes to the prior art or all or part of the technical solution can be embodied in the form of a software product, and the computer software product is stored in a storage medium , Including several instructions to make a computer device (which may be a personal computer, a server, or a network device, etc.) execute all or part of the steps of the methods in the various embodiments of the present invention.
  • the aforementioned storage media include: U disk, read-only memory (ROM, Read-Only Memory), random access memory (RAM, Random Access Memory), mobile hard disk, magnetic disk or optical disk and other media that can store program code .

Landscapes

  • Engineering & Computer Science (AREA)
  • Databases & Information Systems (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Data Mining & Analysis (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Mathematical Physics (AREA)
  • Computational Linguistics (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

Provided is a decoupling elastic data warehouse architecture. The decoupling elastic data warehouse architecture comprises: a data warehouse front end for using PostgreSQL as a basis for the data warehouse front end, processing incoming and outgoing data, providing a control and query user interface, and managing underlying storage; a data warehouse back end for performing extensible and elastic resource management and performing single or concurrent querying; and separation data warehouse middleware for coordinating, by using Dex, message transfer and data transmission between the data warehouse front end and the data warehouse back end. By means of separating a data management function from a data calculation function, independent extensibility is achieved, and a cloud elastic data warehouse system structure is explored. The data warehouse front end receives data, manages storage and provides high availability. The data warehouse back end is used for data analysis querying. By means of separating data management from data calculation, the present data warehouse architecture can acquire elasticity from a single data warehouse.

Description

一种解耦合的弹性数据仓库架构A decoupled elastic data warehouse architecture 技术领域Technical field
本发明涉及数据仓库领域,具体而言,涉及一种解耦合的弹性数据仓库架构。The invention relates to the field of data warehouses, in particular to a decoupled elastic data warehouse architecture.
背景技术Background technique
随着云在提供共享和管理的IT基础设施方面越来越受欢迎,如今的公司迫切希望将其数据平台资产转移到云上,以减少设备、公用设施和维护支出。将数据仓库移动到云端是当今公司考虑的一种经济高效的数据管理趋势。为了充分实现经济目标,云数据仓库系统应该能够调整其资源配置,以适应不断变化的工作负载需求。然而,传统的数据仓库体系结构不够灵活,不允许按需资源控制,这严重限制了云提供商和用户的优化总成本和保持期望所需的服务质量。As the cloud becomes more and more popular in providing shared and managed IT infrastructure, companies today are eager to transfer their data platform assets to the cloud to reduce equipment, utilities, and maintenance expenses. Moving data warehouses to the cloud is a cost-effective data management trend considered by companies today. In order to fully achieve economic goals, a cloud data warehouse system should be able to adjust its resource allocation to adapt to changing workload requirements. However, the traditional data warehouse architecture is not flexible enough to allow on-demand resource control, which severely limits the total cost of cloud providers and users to optimize the total cost and maintain the desired quality of service.
数据仓库已经存在几十年了,它的主要架构已经从对称多处理器(SMP)转变为大规模并行处理器(MPP)。然而,云计算和大数据的出现需要一种新的范式变革,这种变革比以前的范式更为紧迫和具有破坏性。目前的MPP数据仓库静态地安装在少数不共享的计算机节点上。这种架构无法利用云的多功能和强大的功能进行计划和资源分配,从而阻碍用户和云提供商实现预期的性能、服务质量和预算控制目标。The data warehouse has existed for decades, and its main architecture has changed from symmetric multi-processor (SMP) to massively parallel processor (MPP). However, the emergence of cloud computing and big data requires a new paradigm change, which is more urgent and destructive than the previous paradigm. The current MPP data warehouse is statically installed on a few computer nodes that are not shared. This architecture cannot make use of the multi-function and powerful functions of the cloud for planning and resource allocation, which prevents users and cloud providers from achieving the expected performance, service quality, and budget control goals.
传统上,由于良好的可扩展性,MPP在数据仓库中得到了普及。但是,这种可扩展性几乎只在安装时提供。在安装之前,工作负载和数据量的类型是非常清楚的。但对于MPP数据仓库,很难支持异构工作负载,计算密集型算法和细粒度资源管理。有了云,现代业务工作流程的波动性质是不可避免的。首先,工作流程必须处理多样化的数据源。传统上,数据仓库设置为分析从内部数据源集成的数据。在云时代,分析数据的可能性越来越大,来自各种各样 的应用程序,并且速度差别很大。此外,分析请求由外部客户按需进行。当在短时间内提交大量请求时,系统承受着很大的压力来处理它们。在这种情况下,扩展计算资源的能力对于保证服务质量至关重要。其次,分析将采用更复杂和迭代的算法,这些算法比传统的分析工作负载需要更多的计算能力。数据挖掘和机器学习方面的现代算法已经成功地实现了识别模式和发现商业数据中的有效数据,现代分析使用高级算法来推动应用,例如个性化推荐,欺诈检测和业务决策。因此,数据仓库中将运行比以往更多的CPU密集型工作负载。Traditionally, due to good scalability, MPP has gained popularity in data warehouses. However, this scalability is almost only provided at installation time. Before installation, the type of workload and data volume is very clear. But for MPP data warehouses, it is difficult to support heterogeneous workloads, computationally intensive algorithms and fine-grained resource management. With the cloud, the volatile nature of modern business workflow is inevitable. First, the workflow must deal with diverse data sources. Traditionally, data warehouses are set up to analyze data integrated from internal data sources. In the cloud era, the possibility of analyzing data is increasing, coming from a variety of applications, and the speed varies greatly. In addition, analysis requests are made on demand by external customers. When a large number of requests are submitted in a short period of time, the system is under great pressure to process them. In this case, the ability to expand computing resources is essential to ensure service quality. Second, the analysis will use more complex and iterative algorithms that require more computing power than traditional analysis workloads. Modern algorithms in data mining and machine learning have succeeded in identifying patterns and discovering valid data in business data. Modern analytics use advanced algorithms to drive applications, such as personalized recommendations, fraud detection, and business decision-making. As a result, more CPU-intensive workloads will run in the data warehouse than ever before.
基于上述观察,可以认为弹性即能够独立自适应地扩展系统组件的能力,是云数据仓库应该支持的主要属性。但是建立一个弹性数据仓库对于当前数据仓库软件设计的限制并不是一个简单的任务,它假设一个对称模型,其中每个节点连接一个本地存储,所有节点都是同构的。在MPP设置中,使用强耦合模型处理系统会对性能产生积极影响。但在云配置中,该软件设计成为性能和成本效益的障碍。为了获得所需的弹性,软件必须支持一定程度的计算和存储分离,以便在工作负载需要时可以添加更多的计算资源。Based on the above observations, it can be considered that elasticity, that is, the ability to independently and adaptively expand system components, is the main attribute that cloud data warehouses should support. However, the limitation of the current data warehouse software design to establish an elastic data warehouse is not a simple task. It assumes a symmetrical model in which each node is connected to a local storage, and all nodes are homogeneous. In MPP settings, using a strongly coupled model to process the system will have a positive impact on performance. But in cloud configurations, the software design becomes an obstacle to performance and cost-effectiveness. In order to obtain the required flexibility, the software must support a certain degree of separation of computing and storage so that more computing resources can be added when the workload requires it.
在认识到这一阻碍后,一些数据库供应商已经开始重新设计云计算的数据仓库。Azure SQL数据仓库是Microsoft Azure云上可用的大规模数据仓库服务。Azure SQL基于Microsoft SQL服务器,适用于支持关系数据和非关系数据。Azure SQL通过Azure Blob存储服务存储和访问其所有数据。由于物理分离,不仅存储和计算独立,而且计算也可以暂停,以便用户只需为存储付费。After recognizing this obstacle, some database vendors have begun to redesign cloud computing data warehouses. Azure SQL data warehouse is a large-scale data warehouse service available on Microsoft Azure cloud. Azure SQL is based on Microsoft SQL server and is suitable for supporting relational and non-relational data. Azure SQL stores and accesses all its data through Azure Blob storage service. Due to the physical separation, not only storage and computing are independent, but computing can also be suspended so that users only pay for storage.
发明内容Summary of the invention
本发明实施例提供了一种解耦合的弹性数据仓库架构,以至少解决现有数据仓库无法将数据管理和数据计算进行分离的技术问题。The embodiment of the present invention provides a decoupled elastic data warehouse architecture to at least solve the technical problem that the existing data warehouse cannot separate data management and data calculation.
根据本发明的实施例,提供了一种解耦合的弹性数据仓库架构,包括:According to an embodiment of the present invention, a decoupled elastic data warehouse architecture is provided, including:
数据仓库前端,用于使用PostgreSQL作为数据仓库前端的基础,处理进出数据、提供控制和查询用户界面及管理底层存储;其中在管理底层存储中在名为xschema的PostgreSQL扩展中实现所有分片逻辑;The data warehouse front-end is used to use PostgreSQL as the basis of the data warehouse front-end to process incoming and outgoing data, provide control and query user interfaces, and manage the underlying storage; among them, all sharding logic is implemented in the PostgreSQL extension called xschema in the management of the underlying storage;
数据仓库后端,用于可扩展和弹性的资源管理、单个或并发查询;其中资源分配分两个阶段进行,第一个阶段是在安装时,总资源最初是从云端分配;第二个阶段是在集群设置之后,用户可以在会话启动时传递参数;并使用Spark SQL作为底层查询引擎;Data warehouse backend, used for scalable and flexible resource management, single or concurrent query; the resource allocation is carried out in two stages, the first stage is at the time of installation, the total resources are initially allocated from the cloud; the second stage After the cluster is set up, users can pass parameters when the session starts; and use Spark SQL as the underlying query engine;
分离数据仓库中间件,用于使用Dex协调数据仓库前端和数据仓库后端之间的消息传递和数据传输。Separate data warehouse middleware and use Dex to coordinate message and data transmission between the front end of the data warehouse and the back end of the data warehouse.
进一步地,Dex中间件包括:Dex服务器、PostgreSQL适配器和Spark适配器,并通过Dex通信API运行,Dex通信API提供了一个中间层,其中:Further, Dex middleware includes: Dex server, PostgreSQL adapter and Spark adapter, and runs through the Dex communication API. The Dex communication API provides an intermediate layer, including:
PostgreSQL适配器用于转换数据库查询,与Dex服务器通信,然后从后端群集转换返回的响应;The PostgreSQL adapter is used to convert database queries, communicate with Dex servers, and then convert the returned response from the backend cluster;
Dex服务器用于维护查询上下文,监视会话状态转换并通过Dex通信API提供Dex服务;The Dex server is used to maintain the query context, monitor the transition of the session state and provide Dex services through the Dex communication API;
Spark适配器用于接受并解析Dex requests,将Dex请求转换为Spark计算任务,一旦响应准备好,就将其发送回Dex服务器。The Spark adapter is used to accept and parse Dex requests, convert Dex requests into Spark computing tasks, and send them back to the Dex server once the response is ready.
进一步地,Dex互操作是在Dex服务器中围绕Dex Context管理的有状态服务,内部工作由PostgreSQL和Spark之间交换的消息驱动;Dex Context分别支持单后端和多后端两种设置,对于单后端设置,Dex Context通过Dex Context API代理PostgreSQL后端进程与Spark之间的通信;启动新的会话时,客户端应用程序首先通过向Dex服务器提交连接请求来创建或重用Dex Context实例,一旦设置了Dex Context,客户端应用程序就可以使用Dex Context API开始调用服务;Dex Context还支持多个后端,当在单个会话中连接多个后端时,Dex服务器引用了Dex上下文管理器,其中的每个会话都将一个Dex Context指定给一个后端;Furthermore, Dex interoperability is a stateful service managed by Dex Context in Dex server. The internal work is driven by messages exchanged between PostgreSQL and Spark; Dex Context supports two settings of single backend and multiple backends. Back-end settings, Dex Context proxy the communication between the PostgreSQL back-end process and Spark through the Dex Context API; when starting a new session, the client application first creates or reuses the Dex Context instance by submitting a connection request to the Dex server. Once set With Dex Context, client applications can use Dex Context API to start calling services; Dex Context also supports multiple back-ends. When multiple back-ends are connected in a single session, the Dex server references the Dex context manager. Each session assigns a Dex Context to a backend;
PostgreSQL适配器在PostgreSQL扩展中实现,提供Dex通信API接口的客户端库,还包括用于将数据库队列转换为Dex请求以及将结果转换回PostgreSQL数据记录的内部函数;The PostgreSQL adapter is implemented in the PostgreSQL extension, providing a client library for the Dex communication API interface, and also includes internal functions for converting database queues into Dex requests and converting results back to PostgreSQL data records;
Spark适配器将Dex请求解析为相应的Spark函数,开始执行任务并返回 最终结果。The Spark adapter parses the Dex request into the corresponding Spark function, starts to execute the task and returns the final result.
进一步地,数据仓库前端处理进出数据中,对于数据集成,选择支持各种数据源,包括本地和网络文件系统、关系数据库和非关系数据库;并通过数据摄取驱动程序来处理特定类型的数据源;Further, in the data warehouse front-end processing incoming and outgoing data, for data integration, choose to support various data sources, including local and network file systems, relational databases and non-relational databases; and process specific types of data sources through data ingestion drivers;
数据仓库前端提供控制和查询用户界面中,用户界面继承了SQL语法,并作为系统控制和交互查询的统一门户;The front end of the data warehouse provides a control and query user interface. The user interface inherits the SQL syntax and serves as a unified portal for system control and interactive query;
数据仓库前端管理底层存储中通过分片控制器在指定的主节点上运行;所有用户数据都存储在数据节点上;系统管理员通过分片注册和释放数据节点,用户为分布式事实表定义分区方案;The front-end management of the data warehouse runs on the designated master node through the shard controller in the underlying storage; all user data is stored on the data node; the system administrator registers and releases the data node through sharding, and the user defines the partition for the distributed fact table Program;
数据仓库前端管理底层存储中通过分析服务接口供用户在后端启动分析工作负载;并通过在PostgreSQL中重新进行解析和规划。The front-end management of the data warehouse uses the analysis service interface in the underlying storage for users to start the analysis workload on the back-end; and re-analyze and plan in PostgreSQL.
进一步地,数据仓库后端为由软件栈管理的计算机集群,数据仓库层被指定为不同的功能,包括资源分配、任务调度和查询组合。Further, the back end of the data warehouse is a computer cluster managed by a software stack, and the data warehouse layer is designated for different functions, including resource allocation, task scheduling, and query combination.
进一步地,在数据仓库后端中对于单个查询,执行效率由查询优化器和执行框架共同决定;对于并发运行的许多查询,总体执行效率要求涉及任务调度程序。Furthermore, for a single query in the backend of the data warehouse, the execution efficiency is determined by the query optimizer and the execution framework; for many concurrently running queries, the overall execution efficiency requirement involves the task scheduler.
进一步地,数据仓库后端使用Spark SQL作为底层查询引擎中,数据仓库后端为特定服务器群集,只有一个由YARN管理的相应Spark安装,连接到同一后端的多个会话是独立维护的;单独的Spark会话由不同的Spark作业提供服务。Further, the backend of the data warehouse uses Spark SQL as the underlying query engine. The backend of the data warehouse is a specific server cluster. There is only one corresponding Spark installation managed by YARN. Multiple sessions connected to the same backend are maintained independently; Spark sessions are served by different Spark jobs.
进一步地,数据仓库有多个Spark集群作为后端;Spark集群具有不同的大小和配置;在Spark集群中,任何会话可以指定自己的资源需求;Furthermore, the data warehouse has multiple Spark clusters as backends; Spark clusters have different sizes and configurations; in a Spark cluster, any session can specify its own resource requirements;
数据仓库后端的资源管理中将所有后端信息记录在共享目录中,该共享目录存储在PostgreSQL主服务器上,并且可以被所有Dex Context引用;可以根据特权用户的请求添加、删除或选择数据仓库后端;In the back-end resource management of the data warehouse, all back-end information is recorded in a shared directory, which is stored on the PostgreSQL master server and can be referenced by all Dex Contexts; it can be added, deleted or selected according to the request of privileged users. end;
数据仓库后端包括后端会话管理器,后端会话管理器与主目录共享公共信 息,该主目录维护所有活动会话的超集;后端会话管理器上的每个会话仅存储会话特定的元数据;该元数据包括有关数据仓库前端、数据仓库后端和网络连接的重要事实;元数据将用于请求处理,一旦接收并解析了前端请求,就会生成ReqStruct类型的请求数据结构,此时需要查找会话元数据以识别请求中出现的输入和功能;数据仓库前端和数据仓库后端之间的通信是使用高速异步网络I/O库的ZeroMQ来实现。The back-end of the data warehouse includes a back-end session manager. The back-end session manager shares common information with the home directory, which maintains a superset of all active sessions; each session on the back-end session manager only stores session-specific elements Data; this metadata includes important facts about the front end of the data warehouse, the back end of the data warehouse, and the network connection; the metadata will be used for request processing. Once the front-end request is received and parsed, a request data structure of type ReqStruct will be generated. The session metadata needs to be searched to identify the input and functions appearing in the request; the communication between the front end of the data warehouse and the back end of the data warehouse is implemented using ZeroMQ of the high-speed asynchronous network I/O library.
进一步地,Spark适配器负责处理来自Dex中间件的请求;当收到前端请求时,该请求会被转换为Spark命令,包括已解除的SQL查询或Spark函数,以及定义辅助RDD的pro-logue和epilogue命令;Duo SQL支持两种类型的分析请求:一种是SQL查询,另一种是UDF调用;每个Dex请求在其标头中对其类型进行编码。Further, the Spark adapter is responsible for processing the request from the Dex middleware; when a front-end request is received, the request will be converted into a Spark command, including the released SQL query or Spark function, and the pro-logue and epilogue that define the auxiliary RDD Command; Duo SQL supports two types of analysis requests: one is SQL query, the other is UDF call; each Dex request encodes its type in its header.
进一步地,在Spark SQL中,JdbcRDD用于从远程数据库导入数据的标准API;JdbcRDD允许并行连接到单个表的多个分区,增强了对分片数据库的支持,是一个shard的JBDCRDD;使用DuoRDD,Spark执行程序可以并行地从多个分片节点加载。Furthermore, in Spark SQL, JdbcRDD is used as a standard API for importing data from remote databases; JdbcRDD allows parallel connection to multiple partitions of a single table, which enhances support for sharded databases. It is a JBDCRDD of a shard; using DuoRDD, Spark executors can be loaded from multiple shard nodes in parallel.
本发明实施例中的解耦合的弹性数据仓库架构,通过将数据管理和数据计算功能相分离,以实现独立的可扩展性,探索了云的弹性数据仓库体系结构。数据仓库前端接收数据、管理存储并提供高可用性。数据仓库后端用于数据分析的查询。通过分离数据管理和数据计算,本发明可以在单一的数据仓库中获得弹性。The decoupled elastic data warehouse architecture in the embodiment of the present invention explores the cloud elastic data warehouse architecture by separating data management and data computing functions to achieve independent scalability. The front end of the data warehouse receives data, manages storage, and provides high availability. The data warehouse backend is used for data analysis queries. By separating data management and data calculation, the present invention can obtain flexibility in a single data warehouse.
附图说明Description of the drawings
图1为本发明中DuoSQL的体系结构示意图;Figure 1 is a schematic diagram of the architecture of DuoSQL in the present invention;
图2为本发明中Dex作为中间件进行数据协调的结构示意图;2 is a schematic diagram of the structure of Dex as middleware for data coordination in the present invention;
图3为本发明中新Spark API扩展了JdbcRDD的过程图;Figure 3 is a process diagram in which the new Spark API of the present invention extends JdbcRDD;
图4为本发明中Duo SQL系统执行流程图。Figure 4 is a flowchart of the execution of the Duo SQL system in the present invention.
具体实施方式Detailed ways
为了使本技术领域的人员更好地理解本发明方案,下面将结合本发明实施例中的附图,对本发明实施例中的技术方案进行清楚、完整地描述,显然,所描述的实施例仅仅是本发明一部分的实施例,而不是全部的实施例。基于本发明中的实施例,本领域普通技术人员在没有做出创造性劳动前提下所获得的所有其他实施例,都应当属于本发明保护的范围。In order to enable those skilled in the art to better understand the solutions of the present invention, the technical solutions in the embodiments of the present invention will be described clearly and completely with reference to the accompanying drawings in the embodiments of the present invention. Obviously, the described embodiments are only It is a part of the embodiments of the present invention, not all the embodiments. Based on the embodiments of the present invention, all other embodiments obtained by those of ordinary skill in the art without creative work shall fall within the protection scope of the present invention.
需要说明的是,本发明的说明书和权利要求书及上述附图中的术语“第一”、“第二”等是用于区别类似的对象,而不必用于描述特定的顺序或先后次序。应该理解这样使用的数据在适当情况下可以互换,以便这里描述的本发明的实施例能够以除了在这里图示或描述的那些以外的顺序实施。此外,术语“包括”和“具有”以及他们的任何变形,意图在于覆盖不排他的包含,例如,包含了一系列步骤或单元的过程、方法、系统、产品或设备不必限于清楚地列出的那些步骤或单元,而是可包括没有清楚地列出的或对于这些过程、方法、产品或设备固有的其它步骤或单元。It should be noted that the terms "first" and "second" in the specification and claims of the present invention and the above-mentioned drawings are used to distinguish similar objects, and not necessarily used to describe a specific sequence or sequence. It should be understood that the data used in this way can be interchanged under appropriate circumstances so that the embodiments of the present invention described herein can be implemented in an order other than those illustrated or described herein. In addition, the terms "including" and "having" and any variations of them are intended to cover non-exclusive inclusions. For example, a process, method, system, product or device that includes a series of steps or units is not necessarily limited to the clearly listed Those steps or units may include other steps or units that are not clearly listed or are inherent to these processes, methods, products, or equipment.
根据本发明的实施例,提供了一种解耦合的弹性数据仓库架构,包括:According to an embodiment of the present invention, a decoupled elastic data warehouse architecture is provided, including:
数据仓库前端,用于使用PostgreSQL作为数据仓库前端的基础,处理进出数据、提供控制和查询用户界面及管理底层存储;其中在管理底层存储中在名为xschema的PostgreSQL扩展中实现所有分片逻辑;The data warehouse front-end is used to use PostgreSQL as the basis of the data warehouse front-end to process incoming and outgoing data, provide control and query user interfaces, and manage the underlying storage; among them, all sharding logic is implemented in the PostgreSQL extension called xschema in the management of the underlying storage;
数据仓库后端,用于可扩展和弹性的资源管理、单个或并发查询;其中资源分配分两个阶段进行,第一个阶段是在安装时,总资源最初是从云端分配;第二个阶段是在集群设置之后,用户可以在会话启动时传递参数;并使用Spark SQL作为底层查询引擎;Data warehouse backend, used for scalable and flexible resource management, single or concurrent query; the resource allocation is carried out in two stages, the first stage is at the time of installation, the total resources are initially allocated from the cloud; the second stage After the cluster is set up, users can pass parameters when the session starts; and use Spark SQL as the underlying query engine;
分离数据仓库中间件,用于使用Dex协调数据仓库前端和数据仓库后端之间的消息传递和数据传输。Separate data warehouse middleware and use Dex to coordinate message and data transmission between the front end of the data warehouse and the back end of the data warehouse.
本发明实施例中的解耦合的弹性数据仓库架构,通过将数据管理和数据计 算功能相分离,以实现独立的可扩展性,探索了云的弹性数据仓库体系结构。数据仓库前端接收数据、管理存储并提供高可用性。数据仓库后端用于数据分析的查询。通过分离数据管理和数据计算,本发明可以在单一的数据仓库中获得弹性。The decoupled elastic data warehouse architecture in the embodiment of the present invention explores the cloud elastic data warehouse architecture by separating data management and data calculation functions to achieve independent scalability. The front end of the data warehouse receives data, manages storage, and provides high availability. The data warehouse backend is used for data analysis queries. By separating data management and data calculation, the present invention can obtain flexibility in a single data warehouse.
作为优选的技术方案中,Dex中间件包括:Dex服务器、PostgreSQL适配器和Spark适配器,并通过Dex通信API运行,Dex通信API提供了一个中间层,其中:As a preferred technical solution, Dex middleware includes: Dex server, PostgreSQL adapter and Spark adapter, and runs through the Dex communication API. The Dex communication API provides an intermediate layer, including:
PostgreSQL适配器用于转换数据库查询,与Dex服务器通信,然后从后端群集转换返回的响应;The PostgreSQL adapter is used to convert database queries, communicate with Dex servers, and then convert the returned response from the backend cluster;
Dex服务器用于维护查询上下文,监视会话状态转换并通过Dex通信API提供Dex服务;The Dex server is used to maintain the query context, monitor the transition of the session state and provide Dex services through the Dex communication API;
Spark适配器用于接受并解析Dex requests,将Dex请求转换为Spark计算任务,一旦响应准备好,就将其发送回Dex服务器。The Spark adapter is used to accept and parse Dex requests, convert Dex requests into Spark computing tasks, and send them back to the Dex server once the response is ready.
作为优选的技术方案中,Dex互操作是在Dex服务器中围绕Dex Context管理的有状态服务,内部工作由PostgreSQL和Spark之间交换的消息驱动;Dex Context分别支持单后端和多后端两种设置,对于单后端设置,Dex Context通过Dex Context API代理PostgreSQL后端进程与Spark之间的通信;启动新的会话时,客户端应用程序首先通过向Dex服务器提交连接请求来创建或重用Dex Context实例,一旦设置了DexContext,客户端应用程序就可以使用Dex Context API开始调用服务;Dex Context还支持多个后端,当在单个会话中连接多个后端时,Dex服务器引用了Dex上下文管理器,其中的每个会话都将一个Dex Context指定给一个后端;As a preferred technical solution, Dex interoperability is a stateful service managed by Dex Context in Dex server. The internal work is driven by messages exchanged between PostgreSQL and Spark; Dex Context supports single backend and multiple backends respectively Settings. For single-backend settings, Dex Context proxy the communication between the PostgreSQL back-end process and Spark through the Dex Context API; when starting a new session, the client application first creates or reuses the Dex Context by submitting a connection request to the Dex server For example, once the DexContext is set, the client application can use the Dex Context API to start calling the service; Dex Context also supports multiple backends, when multiple backends are connected in a single session, the Dex server references the Dex context manager , Each of these sessions assigns a Dex Context to a backend;
PostgreSQL适配器在PostgreSQL扩展中实现,提供Dex通信API接口的客户端库,还包括用于将数据库队列转换为Dex请求以及将结果转换回PostgreSQL数据记录的内部函数;The PostgreSQL adapter is implemented in the PostgreSQL extension, providing a client library for the Dex communication API interface, and also includes internal functions for converting database queues into Dex requests and converting results back to PostgreSQL data records;
Spark适配器将Dex请求解析为相应的Spark函数,开始执行任务并返回最终结果。The Spark adapter parses the Dex request into the corresponding Spark function, starts to execute the task and returns the final result.
作为优选的技术方案中,数据仓库前端处理进出数据中,对于数据集成,选择支持各种数据源,包括本地和网络文件系统、关系数据库和非关系数据库;并通过数据摄取驱动程序来处理特定类型的数据源;As a preferred technical solution, the front-end of the data warehouse processes incoming and outgoing data. For data integration, it chooses to support various data sources, including local and network file systems, relational databases and non-relational databases; and handles specific types through data ingestion drivers Data source;
数据仓库前端提供控制和查询用户界面中,用户界面继承了SQL语法,并作为系统控制和交互查询的统一门户;The front end of the data warehouse provides a control and query user interface. The user interface inherits the SQL syntax and serves as a unified portal for system control and interactive query;
数据仓库前端管理底层存储中通过分片控制器在指定的主节点上运行;所有用户数据都存储在数据节点上;系统管理员通过分片注册和释放数据节点,用户为分布式事实表定义分区方案;The front-end management of the data warehouse runs on the designated master node through the shard controller in the underlying storage; all user data is stored on the data node; the system administrator registers and releases the data node through sharding, and the user defines the partition for the distributed fact table Program;
数据仓库前端管理底层存储中通过分析服务接口供用户在后端启动分析工作负载;并通过在PostgreSQL中重新进行解析和规划。The front-end management of the data warehouse uses the analysis service interface in the underlying storage for users to start the analysis workload on the back-end; and re-analyze and plan in PostgreSQL.
作为优选的技术方案中,数据仓库后端为由软件栈管理的计算机集群,数据仓库层被指定为不同的功能,包括资源分配、任务调度和查询组合。As a preferred technical solution, the back end of the data warehouse is a computer cluster managed by a software stack, and the data warehouse layer is designated for different functions, including resource allocation, task scheduling, and query combination.
作为优选的技术方案中,在数据仓库后端中对于单个查询,执行效率由查询优化器和执行框架共同决定;对于并发运行的许多查询,总体执行效率要求涉及任务调度程序。As a preferred technical solution, for a single query in the data warehouse backend, the execution efficiency is jointly determined by the query optimizer and the execution framework; for many concurrently running queries, the overall execution efficiency requirement involves the task scheduler.
作为优选的技术方案中,数据仓库后端使用Spark SQL作为底层查询引擎中,数据仓库后端为特定服务器群集,只有一个由YARN管理的相应Spark安装,连接到同一后端的多个会话是独立维护的;单独的Spark会话由不同的Spark作业提供服务。As a preferred technical solution, the data warehouse backend uses Spark SQL as the underlying query engine, the data warehouse backend is a specific server cluster, and there is only one corresponding Spark installation managed by YARN, and multiple sessions connected to the same backend are independently maintained Yes; separate Spark sessions are served by different Spark jobs.
作为优选的技术方案中,数据仓库有多个Spark集群作为后端;Spark集群具有不同的大小和配置;在Spark集群中,任何会话可以指定自己的资源需求;As a preferred technical solution, the data warehouse has multiple Spark clusters as backends; Spark clusters have different sizes and configurations; in a Spark cluster, any session can specify its own resource requirements;
数据仓库后端的资源管理中将所有后端信息记录在共享目录中,该共享目录存储在PostgreSQL主服务器上,并且可以被所有Dex Context引用;可以根据特权用户的请求添加、删除或选择数据仓库后端;In the back-end resource management of the data warehouse, all back-end information is recorded in a shared directory, which is stored on the PostgreSQL master server and can be referenced by all Dex Contexts; it can be added, deleted or selected according to the request of privileged users. end;
数据仓库后端包括后端会话管理器,后端会话管理器与主目录共享公共信 息,该主目录维护所有活动会话的超集;后端会话管理器上的每个会话仅存储会话特定的元数据;该元数据包括有关数据仓库前端、数据仓库后端和网络连接的重要事实;元数据将用于请求处理,一旦接收并解析了前端请求,就会生成ReqStruct类型的请求数据结构,此时需要查找会话元数据以识别请求中出现的输入和功能;数据仓库前端和数据仓库后端之间的通信是使用高速异步网络I/O库的ZeroMQ来实现。The back-end of the data warehouse includes a back-end session manager. The back-end session manager shares common information with the home directory, which maintains a superset of all active sessions; each session on the back-end session manager only stores session-specific elements Data; this metadata includes important facts about the front end of the data warehouse, the back end of the data warehouse, and the network connection; the metadata will be used for request processing. Once the front-end request is received and parsed, a request data structure of type ReqStruct will be generated. The session metadata needs to be searched to identify the input and functions appearing in the request; the communication between the front end of the data warehouse and the back end of the data warehouse is implemented using ZeroMQ of the high-speed asynchronous network I/O library.
作为优选的技术方案中,Spark适配器负责处理来自Dex中间件的请求;当收到前端请求时,该请求会被转换为Spark命令,包括已解除的SQL查询或Spark函数,以及定义辅助RDD的pro-logue和epilogue命令;Duo SQL支持两种类型的分析请求:一种是SQL查询,另一种是UDF调用;每个Dex请求在其标头中对其类型进行编码。As a preferred technical solution, the Spark adapter is responsible for processing the request from the Dex middleware; when a front-end request is received, the request will be converted into a Spark command, including the released SQL query or Spark function, and the pro that defines the auxiliary RDD -logue and epilogue commands; Duo SQL supports two types of analysis requests: one is SQL query, the other is UDF call; each Dex request encodes its type in its header.
作为优选的技术方案中,在Spark SQL中,JdbcRDD用于从远程数据库导入数据的标准API;JdbcRDD允许并行连接到单个表的多个分区,增强了对分片数据库的支持,是一个shard的JBDCRDD;使用DuoRDD,Spark执行程序可以并行地从多个分片节点加载。As a preferred technical solution, in Spark SQL, JdbcRDD is used as a standard API for importing data from remote databases; JdbcRDD allows parallel connection to multiple partitions of a single table, which enhances the support for sharded databases. It is a shard JBDCRDD ; Using DuoRDD, Spark executors can be loaded from multiple shard nodes in parallel.
在本发明具体实施例中,本发明探讨了一种将数据管理和数据计算分离的体系结构。通过将这两个部分分开,系统获得了更多的弹性和适应性。为了实现,本发明构建了一个基于PostgreSQL和Spark的原型系统DuoSQL。同时本发明使用TPC-H基准来验证系统。实验结果表明,该解耦算法具有很大的性能潜力。In the specific embodiment of the present invention, the present invention explores an architecture that separates data management and data calculation. By separating these two parts, the system gains more flexibility and adaptability. In order to realize, the present invention constructs a prototype system DuoSQL based on PostgreSQL and Spark. At the same time, the present invention uses the TPC-H benchmark to verify the system. Experimental results show that the decoupling algorithm has great performance potential.
本发明通过将数据管理和数据计算功能相分离,以实现独立的可扩展性,探索了云的弹性数据仓库体系结构。数据仓库前端(数据管理单元)接收数据、管理存储并提供高可用性。数据仓库后端(数据计算单元)用于数据分析的查询。通过分离数据管理和数据计算,本发明可以在单一的数据仓库中获得弹性,这是一种现有系统所不具备的特性。相比之下,Microsoft Azure云数据库只允许跨多个数据仓库进行弹性处理。本发明使数据仓库在面对不断变化的工作负载需求时具有更强的适应性。本发明使用一个RDBMS和一个具有SQL支持的内存集群计算引擎来实现该体系结构。具体来说,本发明基于PostgreSQL和 Spark构建了一个名为DuoSQL的原型系统。首先,本发明提出了一种云数据仓库架构,它可以解耦数据管理和数据计算,从而在单个数据仓库中实现弹性。其次,本发明基于PostgreSQL和Spark构建了一个名为DuoSQL的原型系统,实验结果显示出很好的性能潜力。The present invention separates data management and data calculation functions to achieve independent scalability, and explores the cloud elastic data warehouse system structure. The data warehouse front end (data management unit) receives data, manages storage, and provides high availability. The back end of the data warehouse (data computing unit) is used for data analysis queries. By separating data management and data calculation, the present invention can obtain flexibility in a single data warehouse, which is a characteristic that the existing system does not have. In contrast, Microsoft Azure cloud database only allows flexible processing across multiple data warehouses. The invention enables the data warehouse to have stronger adaptability in the face of constantly changing workload requirements. The invention uses an RDBMS and a memory cluster computing engine with SQL support to implement the system structure. Specifically, the present invention builds a prototype system named DuoSQL based on PostgreSQL and Spark. First, the present invention proposes a cloud data warehouse architecture, which can decouple data management and data calculation, thereby achieving flexibility in a single data warehouse. Secondly, the present invention constructs a prototype system named DuoSQL based on PostgreSQL and Spark, and the experimental results show good performance potential.
本发明公开了一种解耦合的弹性数据仓库架构,为使本发明的目的及技术方案更加清楚明确,以下参照附图并举实例对本发明进一步详细说明。应当理解,此处所描述的具体实施例仅用以解释本发明,并非用于限定本发明。The present invention discloses a decoupled elastic data warehouse architecture. In order to make the purpose and technical solution of the present invention clearer and clearer, the present invention will be further described in detail with reference to the accompanying drawings and examples. It should be understood that the specific embodiments described here are only used to explain the present invention, but not to limit the present invention.
弹性数据仓库架构,基于弹性配置,主要从存储平台和计算平台的弹性配置出发,再辅助于通信中间件。弹性数据仓由以下几个方面来介绍:The flexible data warehouse architecture is based on flexible configuration, mainly starting from the flexible configuration of storage platforms and computing platforms, and then assisting with communication middleware. The elastic data warehouse is introduced from the following aspects:
1.1体系结构:1.1 System structure:
图1显示了DuoSQL的体系结构。整个体系结构将数据管理前端和数据计算后端结合起来。只要选择满足设计目标的子系统,体系结构就支持前端和后端的各种子系统。为了集成前端和后端,需要一个内部操作中间件来管理网络工作连接、用户会话、请求代理、查询翻译和数据传输。Figure 1 shows the architecture of DuoSQL. The whole system structure combines the data management front-end and data computing back-end. As long as the subsystems that meet the design goals are selected, the architecture supports various front-end and back-end subsystems. In order to integrate the front-end and the back-end, an internal operation middleware is needed to manage network work connections, user sessions, request agents, query translation, and data transmission.
1.2内部设计:1.2 Internal design:
1.2.1前端1.2.1 Frontend
如图1左侧所示,前端是数据仓库的数据管理组件,其主要职责如下:As shown on the left side of Figure 1, the front end is the data management component of the data warehouse, and its main responsibilities are as follows:
(1)处理进出数据。对于数据集成,前端可以选择支持各种数据源,例如本地和网络文件系统、关系数据库和非关系数据库。可能需要数据摄取驱动程序来处理特定类型的数据源。(1) Process incoming and outgoing data. For data integration, the front end can choose to support various data sources, such as local and network file systems, relational databases, and non-relational databases. A data ingest driver may be required to handle specific types of data sources.
(2)提供控制和查询用户界面。用户界面继承了SQL语法,并作为系统控制和交互查询的统一门户。(2) Provide control and query user interface. The user interface inherits the SQL syntax and serves as a unified portal for system control and interactive query.
(3)管理底层存储。当受管数据的大小太大,不适合本地存储时,切分是不可避免的。对于一个OLAP数据库,切分严重地使查询计划和执行复杂化。因此,针对分片数据的查询规划的复杂性从数据管理单元转移到了数据计算单元。(3) Manage the underlying storage. When the size of the managed data is too large for local storage, segmentation is inevitable. For an OLAP database, segmentation severely complicates query planning and execution. Therefore, the complexity of query planning for fragmented data is transferred from the data management unit to the data calculation unit.
1.2.2后端1.2.2 Backend
后端是进行实际分析计算的地方。它是一个完全由软件栈管理的计算机集群,以保证效率和灵活性。在这个软件栈中,数据仓库层被指定为不同的功能,如资源分配、任务调度和查询组合。这些层一起用于以下两个目的:The back end is where the actual analysis and calculations are performed. It is a computer cluster completely managed by the software stack to ensure efficiency and flexibility. In this software stack, the data warehouse layer is designated for different functions, such as resource allocation, task scheduling, and query combination. Together, these layers serve two purposes:
(1)可扩展和弹性的资源管理。资源分配分两个阶段进行。第一个阶段是在安装时,总资源最初是从云端分配的。第二个阶段是在集群设置之后,用户可以在会话启动时传递参数,如工作线程数、核心和内存总量。通过允许两阶段分配,系统提供粗粒度和细粒度资源弹性。(1) Scalable and flexible resource management. Resource allocation is carried out in two stages. The first stage is during installation, when total resources are initially allocated from the cloud. The second stage is after the cluster is set up, the user can pass parameters when the session starts, such as the number of worker threads, cores and total memory. By allowing two-stage allocation, the system provides coarse-grained and fine-grained resource elasticity.
(2)查询效率。后端的中心任务是尽可能快地完成提交的查询。对于单个查询,执行效率由查询优化器和执行框架共同决定。对于并发运行的许多查询,总体执行效率要求涉及任务调度程序。为提高大数据分析学的查询效率,进行了大量的工作。建立一个以现有技术为基础的原型系统。(2) Query efficiency. The central task of the backend is to complete the submitted query as quickly as possible. For a single query, the execution efficiency is determined by the query optimizer and execution framework. For many queries that run concurrently, the overall execution efficiency requirements involve the task scheduler. In order to improve the query efficiency of big data analytics, a lot of work has been done. Build a prototype system based on existing technology.
总之,后端应确保有效处理单个和并发运行的查询。有了这个目标,许多现代分布的计算框架,如Apache Spark SQL和Apache CalcTITE,在一个合格的资源管理器(如Apache Yarn)的帮助下,可以用作后端的软件基础。In short, the backend should ensure efficient processing of single and concurrently running queries. With this goal, many modern distributed computing frameworks, such as Apache Spark SQL and Apache CalcTITE, can be used as the back-end software foundation with the help of a qualified resource manager (such as Apache Yarn).
1.2.3中间件1.2.3 Middleware
中间件是支持分离数据仓库的关键组件。在上层,它为前端和后端提供了接口和语义抽象,以便相互通信。在较低级别,它指导客户端和服务器之间的消息交换和数据传输。中间件的设计和实施应解决以下问题:Middleware is a key component that supports the separation of data warehouses. At the upper level, it provides interfaces and semantic abstractions for front-end and back-end to communicate with each other. At a lower level, it directs the message exchange and data transmission between the client and the server. The design and implementation of middleware should solve the following problems:
(1)数据和接口抽象。RDBMS使用SQL管理结构化数据,而大多数大数据平台使用命令式语言接口处理非结构化数据。要跨异构系统进行通信和互操作,需要首先制定数据模型和查询接口的通用抽象。(1) Data and interface abstraction. RDBMS uses SQL to manage structured data, while most big data platforms use imperative language interfaces to process unstructured data. To communicate and interoperate across heterogeneous systems, it is necessary to first develop a general abstraction of data models and query interfaces.
(2)大数据集。为了以可扩展的方式管理存储并促进对相关表的查询,数据集通常在关键列上进行分片或分区,以分布在多个服务器上。对中间件进行锐化的意义在于,数据分区(可能具有不同的分区方案)应该从数据抽象、通信协议到底层数据传输机制的所有层上协同处理。(2) Big data set. In order to manage storage and facilitate queries on related tables in a scalable manner, data sets are usually sharded or partitioned on key columns to be distributed across multiple servers. The significance of sharpening middleware is that data partitioning (which may have different partitioning schemes) should be processed collaboratively on all layers from data abstraction, communication protocol to underlying data transmission mechanism.
(3)数据传输。大型数据集不仅挑战存储管理,还挑战网络上的数据传输。在数据分析计算期间,后端需要从前端集群传输数据。在这个过程中,网络IO可能很容易成为瓶颈。(3) Data transmission. Large data sets challenge not only storage management, but also data transmission on the network. During data analysis and calculation, the back-end needs to transmit data from the front-end cluster. In this process, network IO may easily become a bottleneck.
下面以具体实施过程对本发明继进行详细描述。The present invention will be described in detail below with a specific implementation process.
2.1前端2.1 Front end
Duo SQL使用PostgreSQL作为前端的基础。使PostgreSQL脱颖而出的优势在于它对编写数据库扩展的出色支持。实际上,大多数Duo SQL的前端逻辑都是在PostgreSQL扩展中实现的,大多数用户界面函数都是UDF中的形式。Duo SQL uses PostgreSQL as the basis of the front end. The advantage that makes PostgreSQL stand out is its excellent support for writing database extensions. In fact, most of the front-end logic of Duo SQL is implemented in PostgreSQL extensions, and most user interface functions are in the form of UDF.
分片:如前所述,大数据集的管理存储需要分片支持。虽然PostgreSQL严格来说是一个关系型数据库,并且本身不支持分片,但有几个用于分片的开源扩展可以作为参考。具体来说,本发明的解决方案基于pg_shardman。本发明在名为xschema的PostgreSQL扩展中实现所有分片逻辑。在扩展内部,一组分片控制器功能和分片目录表定义如下。Fragmentation: As mentioned earlier, the management and storage of large data sets requires fragmentation support. Although PostgreSQL is strictly a relational database and does not support sharding itself, there are several open source extensions for sharding that can be used as reference. Specifically, the solution of the present invention is based on pg_shardman. The present invention implements all fragmentation logic in a PostgreSQL extension named xschema. Inside the extension, a group of slice controller functions and slice directory table are defined as follows.
FUNCTIONSFUNCTIONS
xschema.add_data_nodexschema.add_data_node
xschema.remove_data_nodexschema.remove_data_node
xschema.partition_tablexschema.partition_table
xschema.rebalance_partitionsxschema.rebalance_partitions
CATALOGCATALOG
xschema.data_nodesxschema.data_nodes
xschema.data_tablesxschema.data_tables
xschema.data_partitionsxschema.data_partitions
xschema.data_replicasxschema.data_replicas
分片控制器在指定的主节点上运行。所有用户数据都存储在数据节点上。通过分片,系统管理员可以注册和释放数据节点以便数据传播,而用户可以为分布式事实表定义分区方案。The shard controller runs on the designated master node. All user data is stored on the data node. Through sharding, system administrators can register and release data nodes for data dissemination, and users can define partitioning schemes for distributed fact tables.
分析服务接口:本发明还需要一个接口,供用户在后端启动分析工作负载。最终,通过在PostgreSQL中重新进行解析和规划,可以使服务调用对用户透明。目前,Duo SQL分析服务是通过一组UDF调用的,包括连接到后端,运行SQL查询和调用远程函数的函数。Analysis service interface: The present invention also needs an interface for the user to start the analysis workload at the back end. Finally, by re-analyzing and planning in PostgreSQL, service calls can be made transparent to users. Currently, Duo SQL analysis service is invoked through a set of UDFs, including functions that connect to the backend, run SQL queries, and call remote functions.
2.2中间件2.2 Middleware
本发明使用如图2的Dex作为中间件来协调前端和后端之间的消息传递和数据传输。Dex最初设计为连接异构数据平台(如PostgreSQL和Spark)的互操作框架。在现有技术中,Dex不支持分片前端,也不完全利用Spark的SQL查询引擎。本发明在Duo SQL中对它进行调整以支持两者。The present invention uses Dex as shown in Figure 2 as middleware to coordinate message transfer and data transmission between the front end and the back end. Dex was originally designed as an interoperability framework connecting heterogeneous data platforms (such as PostgreSQL and Spark). In the prior art, Dex does not support sharding front-end, nor does it fully utilize Spark's SQL query engine. The present invention adjusts it in Duo SQL to support both.
Dex中间件由三个主要组件组成,Dex服务器、PostgreSQL适配器和Spark适配器,通过Dex通信API运行。PostgreSQL适配器转换数据库查询,与Dex服务器通信,然后从后端群集转换返回的响应。Dex服务器维护查询上下文,监视会话状态转换并通过Dex通信API提供Dex服务。Spark适配器接受并解析Dex requests,将Dex请求转换为一系列Spark计算任务,一旦响应准备好,就将其发送回Dex服务器。Dex通信API提供了一个中间层,使终端系统能够与纯隔离和抽象进行通信。Dex middleware is composed of three main components, Dex server, PostgreSQL adapter and Spark adapter, which run through the Dex communication API. The PostgreSQL adapter converts the database query, communicates with the Dex server, and then converts the returned response from the backend cluster. The Dex server maintains the query context, monitors session state transitions and provides Dex services through the Dex communication API. The Spark adapter accepts and parses Dex requests, converts the Dex request into a series of Spark computing tasks, and sends it back to the Dex server once the response is ready. The Dex communication API provides an intermediate layer that enables end systems to communicate with pure isolation and abstraction.
Dex Context:Dex互操作是在Dex服务器中围绕Dex Context管理的有状态服务。内部工作由PostgreSQL和Spark之间交换的消息驱动。Dex Context分别支持单后端和多后端的两种设置。对于单后端设置,Dex Context通过Dex Context API代理PostgreSQL后端进程与Spark之间的通信。要启动新的会话,客户端应用程序必须首先通过向Dex服务器提交连接请求来创建或重用Dex Context实例。一旦设置了Dex Context,客户端应用程序就可以使用Dex Context API开始调用服务。Dex Context还支持多个后端。与单个后端情况相比,当在单个会话中连接多个后端时,Dex服务器需要能够维持所有状态。当然,这引入了Dex上下文管理器,它确保每个会话将一个Dex Context指定给 一个后端。Dex Context: Dex interoperability is a stateful service managed around Dex Context in the Dex server. The internal work is driven by the messages exchanged between PostgreSQL and Spark. Dex Context supports two settings of single backend and multiple backends. For single-backend settings, Dex Context uses the Dex Context API to proxy the communication between the PostgreSQL back-end process and Spark. To start a new session, the client application must first create or reuse a Dex Context instance by submitting a connection request to the Dex server. Once the Dex Context is set, the client application can use the Dex Context API to start calling services. Dex Context also supports multiple backends. Compared with the case of a single backend, when multiple backends are connected in a single session, the Dex server needs to be able to maintain all states. Of course, this introduces the Dex context manager, which ensures that each session assigns a Dex Context to a backend.
PostgreSQL Adapter:PostgreSQL适配器在PostgreSQL扩展中实现。它用作提供Dex通信API接口的客户端库。它还包括用于将数据库队列转换为Dex请求以及将结果转换回PostgreSQL数据记录的内部函数。PostgreSQL Adapter: The PostgreSQL adapter is implemented in the PostgreSQL extension. It is used as a client library that provides Dex communication API interface. It also includes internal functions for converting database queues into Dex requests and converting the results back to PostgreSQL data records.
Spark Adapter:Spark Adapter是Spark中的一个模块,它将Dex请求解析为相应的Spark函数,开始执行任务并返回最终结果。Spark Adapter: Spark Adapter is a module in Spark that parses Dex requests into corresponding Spark functions, starts executing tasks and returns the final result.
2.3后端2.3 Backend
Duo SQL的后端使用Spark SQL作为底层查询引擎。本发明始终将后端称为特定服务器群集,无论是虚拟还是物理分配。在一个后端群集中,只有一个由YARN管理的相应Spark安装。然而,连接到同一后端的多个会话是独立维护的。在Spark领域,单独的Spark会话由不同的Spark作业提供服务。The back end of Duo SQL uses Spark SQL as the underlying query engine. The present invention always refers to the back end as a specific server cluster, whether it is virtual or physical distribution. In a backend cluster, there is only one corresponding Spark installation managed by YARN. However, multiple sessions connected to the same backend are maintained independently. In the Spark field, separate Spark sessions are served by different Spark jobs.
弹性:后端弹性以多个级别提供。首先,数据仓库可以有多个Spark集群作为后端。其次,Spark集群可以具有不同的大小和配置。第三,在Spark集群中,任何会话也可以指定自己的资源需求,例如总执行程序内存和核心数。Resilience: Back-end flexibility is provided in multiple levels. First, the data warehouse can have multiple Spark clusters as backends. Second, Spark clusters can have different sizes and configurations. Third, in a Spark cluster, any session can also specify its own resource requirements, such as the total execution program memory and the number of cores.
后端管理:Duo SQL将所有后端信息记录在共享目录中,该目录存储在PostgreSQL主服务器上,并且可以被所有Dex Context引用。由于Duo SQL是一个解决耦合的系统,因此可以根据特权用户的请求添加、删除或选择后端。Back-end management: Duo SQL records all back-end information in a shared directory, which is stored on the PostgreSQL master server and can be referenced by all Dex Contexts. Since Duo SQL is a system that solves coupling, it is possible to add, delete or select backends according to the request of privileged users.
会话管理:后端会话管理器与主目录共享一些公共信息,该目录维护所有活动会话的超集。后端上的每个会话仅存储会话特定的元数据。此元数据包括有关前端、后端和网络连接的重要事实,例如,分片节点、数据库名称、分片表、分区、Spark上的可用功能、数据库连接字符串和ZeroMQ处理程序。元数据将用于请求处理。一旦接收并解析了前端请求,就会生成ReqStruct类型的请求数据结构。此时,需要查找会话元数据以识别请求中出现的输入和功能。前端和后端之间的通信是使用ZeroMQ实现的,ZeroMQ是一个高速异步网络I/O库。Session management: The back-end session manager shares some common information with the home directory, which maintains a superset of all active sessions. Each session on the backend stores only session-specific metadata. This metadata includes important facts about the front-end, back-end, and network connections, such as shard nodes, database names, shard tables, partitions, functions available on Spark, database connection strings, and ZeroMQ handlers. Metadata will be used for request processing. Once the front-end request is received and parsed, a request data structure of type ReqStruct is generated. At this point, you need to look up the session metadata to identify the inputs and functions that appear in the request. The communication between the front end and the back end is implemented using ZeroMQ, which is a high-speed asynchronous network I/O library.
请求处理:Spark Adapter负责处理来自Dex中间件的请求。当收到前端请求时,它可能会被转换为一系列Spark命令,可能包括已解除的SQL查询 或Spark函数,以及一些定义辅助RDD的pro-logue和epilogue命令。Duo SQL支持两种类型的分析请求:一种是SQL查询,另一种是UDF调用。每个Dex请求在其标头中对其类型进行编码。请求处理器使用此信息来确定如何翻译请求内容。Request processing: Spark Adapter is responsible for processing requests from Dex middleware. When a front-end request is received, it may be converted into a series of Spark commands, which may include released SQL queries or Spark functions, as well as some pro-logue and epilogue commands that define auxiliary RDDs. Duo SQL supports two types of analysis requests: one is SQL query, and the other is UDF call. Each Dex request encodes its type in its header. The request processor uses this information to determine how to translate the request content.
并行数据传输:对于像Duo SQL这样的解耦合系统,几乎总是依赖于通过网络批量传输数据。对于分析大型数据集,并行传输可以显著减少总执行时间。在Spark SQL中,JdbcRDD是用于从远程数据库导入数据的标准API。JdbcRDD的一个功能是允许并行连接到单个表的多个分区,从而实现并行数据传输。但是,JdbcRDD的并行连接功能不适用于分片数据库。为了克服这一挑战,如图3所示,本发明开发了一个名为DuoRDD的新Spark API,它扩展了JdbcRDD,增强了对分片数据库的支持,是一个shard的JBDCRDD。使用DuoRDD,Spark执行程序可以并行地从多个分片节点加载。Parallel data transmission: For decoupled systems like Duo SQL, it almost always relies on bulk data transmission over the network. For analyzing large data sets, parallel transmission can significantly reduce the total execution time. In Spark SQL, JdbcRDD is a standard API for importing data from remote databases. One function of JdbcRDD is to allow parallel connection to multiple partitions of a single table, thereby achieving parallel data transmission. However, the parallel connection function of JdbcRDD is not applicable to fragmented databases. In order to overcome this challenge, as shown in Figure 3, the present invention has developed a new Spark API named DuoRDD, which extends JdbcRDD and enhances the support for sharded databases. It is a shard JBDCRDD. Using DuoRDD, Spark executors can be loaded from multiple shard nodes in parallel.
示例图4来演示如何在系统中执行用户请求。步骤1:用户提交一个包含PostgreSQL请求的SQL。然后,步骤2是接口调用上下文API来执行相应的请求。如果用户在执行请求时提交多个请求,Duo SQL将为其他请求创建另一个上下文。然后步骤3是中间件适配器,分析用户发起的请求,根据不同的功能生成不同的ReqStruct。第四步是Spark适配器对中间件适配器发送的请求进行分析,执行请求并返回结果,其中还包括来自PostgreSQL集群的Spark请求数据。在Spark适配器上,Duo SQL在多线集群管理上启动了Spark作业。所有ReqStruct都在这个Spark作业中处理,以减少Spark作业启动的时间。最后,最终的执行结果将返回到下一个级别,直到返回到用户。系统各个组件之间的交互反映在过程中。通过中间件的设计,PostgreSQL和Spark之间没有直接的交互。The example shown in Figure 4 demonstrates how to execute user requests in the system. Step 1: The user submits a SQL containing a PostgreSQL request. Then, step 2 is that the interface calls the context API to execute the corresponding request. If the user submits multiple requests while executing the request, Duo SQL will create another context for other requests. Then step 3 is the middleware adapter, which analyzes the request initiated by the user and generates different ReqStructs according to different functions. The fourth step is that the Spark adapter analyzes the request sent by the middleware adapter, executes the request and returns the result, which also includes the Spark request data from the PostgreSQL cluster. On the Spark adapter, Duo SQL started the Spark job on the multi-line cluster management. All ReqStructs are processed in this Spark job to reduce the startup time of the Spark job. Finally, the final execution result will return to the next level until it returns to the user. The interaction between the various components of the system is reflected in the process. Through the design of middleware, there is no direct interaction between PostgreSQL and Spark.
本发明的创新技术点及有益效果至少在于:The innovative technical points and beneficial effects of the present invention are at least as follows:
1.其中解耦合、资源弹性配置和后端多样性的三者综合运用。在系统中,前后端通过消息中间件来传递,同时前后端的资源弹性配置,前端是通过分布式数据库管理系统完成,后端通过Yarn资源管理器完成的。选择不同的后端系统时,可以利用到不同的后台特性,目前可以用到Spark的相关特性。1. The comprehensive application of decoupling, flexible resource allocation and back-end diversity. In the system, the front-end and back-end are delivered through message middleware, and the front-end and back-end resources are elastically configured. The front-end is completed through the distributed database management system and the back-end is completed through the Yarn resource manager. When choosing different back-end systems, you can use different back-end features. Currently, you can use Spark's related features.
2.本发明通过解耦合数据管理和计算来提出弹性数据仓库架构。本发明通过使用Dex互操作中间件,将一个分片的PostgreSQL数据库作为前端和一个Spark集群作为后端来构建一个原型系统Duo SQL。本发明通过将它与具有和不具有并行查询支持的独立PostgreSQL进行比较来评估Duo SQL的性能潜力。本发明运行具有不同工作负载和输入类型的测试实验。结果表明,Duo SQL不仅具有明显的性能优势,而且还具有出色的稳健性。2. The present invention proposes a flexible data warehouse architecture by decoupling data management and calculation. The present invention constructs a prototype system DuoSQL by using Dex interoperability middleware, a fragmented PostgreSQL database as a front end and a Spark cluster as a back end. The present invention evaluates the performance potential of Duo SQL by comparing it with independent PostgreSQL with and without parallel query support. The invention runs test experiments with different workloads and input types. The results show that Duo SQL not only has obvious performance advantages, but also has excellent robustness.
本发明提出的一种解耦合的弹性数据仓库架构,其优点在于三个重要的特性:解耦合、弹性和多样性。解耦合,本发明的设计是把计算和存储完全分离的设计。其次,弹性是本发明最大的考虑点。弹性,即能够独立自适应地扩展系统组件的能力,是云数据仓库应该支持的主要属性。另外它还可以利用后端的多样性,来多样化数据仓库,比如本发明中使用的Spark,那么此次实验就利用到来Spark的内存迭代式计算等等特点,本发明是可以由用户来确定使用不同的后端的。The decoupling elastic data warehouse architecture proposed by the present invention has three important characteristics: decoupling, flexibility and diversity. Decoupling, the design of the present invention is a design that completely separates calculation and storage. Secondly, flexibility is the biggest consideration of the present invention. Elasticity, that is, the ability to independently and adaptively expand system components, is the main attribute that cloud data warehouses should support. In addition, it can also use the diversity of the back-end to diversify the data warehouse. For example, the Spark used in the present invention, then this experiment uses the features of the coming Spark's iterative calculation of memory, etc. The present invention can be determined by the user. Different backends.
采用OLAP数据分析基准测试TPC-H和机器学习算法验证了本发明系统结构的有效性。在TPC-H的实验中,本发明和非存储计算分离的情况下的结构作了对比,包括了30、50、75规模因子下的TPC-H实验。在机器学习算法中,本发明和ApahceMADlib框架下的PostgreSQL作了聚类算法对比,数据来自UCI的skin_noskin数据集和KGEE数据集。实验的结果来看,都基本优于现有结构。The OLAP data analysis benchmark test TPC-H and machine learning algorithm verify the effectiveness of the system structure of the present invention. In the TPC-H experiment, the structure of the present invention and the case of non-storage computing separation were compared, including the TPC-H experiment under the scale factors of 30, 50, and 75. In the machine learning algorithm, the present invention is compared with the PostgreSQL clustering algorithm under the ApahceMADlib framework, and the data comes from the skin_noskin data set and the KGEE data set of UCI. The results of the experiment are basically better than the existing structure.
上述本发明实施例序号仅仅为了描述,不代表实施例的优劣。The sequence numbers of the foregoing embodiments of the present invention are only for description, and do not represent the superiority of the embodiments.
在本发明的上述实施例中,对各个实施例的描述都各有侧重,某个实施例中没有详述的部分,可以参见其他实施例的相关描述。In the above-mentioned embodiments of the present invention, the description of each embodiment has its own focus. For parts that are not described in detail in an embodiment, reference may be made to related descriptions of other embodiments.
在本申请所提供的几个实施例中,应该理解到,所揭露的技术内容,可通过其它的方式实现。其中,以上所描述的系统实施例仅仅是示意性的,例如单元的划分,可以为一种逻辑功能划分,实际实现时可以有另外的划分方式,例如多个单元或组件可以结合或者可以集成到另一个系统,或一些特征可以忽略,或不执行。另一点,所显示或讨论的相互之间的耦合或直接耦合或通信连接可以是通过一些接口,单元或模块的间接耦合或通信连接,可以是电性或其它的 形式。In the several embodiments provided in this application, it should be understood that the disclosed technical content can be implemented in other ways. Among them, the system embodiment described above is only illustrative. For example, the division of units may be a logical function division, and there may be other divisions in actual implementation. For example, multiple units or components may be combined or integrated into Another system, or some features can be ignored, or not implemented. In addition, the displayed or discussed mutual coupling or direct coupling or communication connection may be indirect coupling or communication connection through some interfaces, units or modules, and may be in electrical or other forms.
作为分离部件说明的单元可以是或者也可以不是物理上分开的,作为单元显示的部件可以是或者也可以不是物理单元,即可以位于一个地方,或者也可以分布到多个单元上。可以根据实际的需要选择其中的部分或者全部单元来实现本实施例方案的目的。The units described as separate components may or may not be physically separate, and the components displayed as units may or may not be physical units, that is, they may be located in one place, or may be distributed on multiple units. Some or all of the units may be selected according to actual needs to achieve the objectives of the solutions of the embodiments.
另外,在本发明各个实施例中的各功能单元可以集成在一个处理单元中,也可以是各个单元单独物理存在,也可以两个或两个以上单元集成在一个单元中。上述集成的单元既可以采用硬件的形式实现,也可以采用软件功能单元的形式实现。In addition, the functional units in the various embodiments of the present invention may be integrated into one processing unit, or each unit may exist alone physically, or two or more units may be integrated into one unit. The above-mentioned integrated unit can be implemented in the form of hardware or software functional unit.
集成的单元如果以软件功能单元的形式实现并作为独立的产品销售或使用时,可以存储在一个计算机可读取存储介质中。基于这样的理解,本发明的技术方案本质上或者说对现有技术做出贡献的部分或者该技术方案的全部或部分可以以软件产品的形式体现出来,该计算机软件产品存储在一个存储介质中,包括若干指令用以使得一台计算机设备(可为个人计算机、服务器或者网络设备等)执行本发明各个实施例方法的全部或部分步骤。而前述的存储介质包括:U盘、只读存储器(ROM,Read-Only Memory)、随机存取存储器(RAM,Random Access Memory)、移动硬盘、磁碟或者光盘等各种可以存储程序代码的介质。If the integrated unit is implemented in the form of a software functional unit and sold or used as an independent product, it can be stored in a computer readable storage medium. Based on this understanding, the technical solution of the present invention essentially or the part that contributes to the prior art or all or part of the technical solution can be embodied in the form of a software product, and the computer software product is stored in a storage medium , Including several instructions to make a computer device (which may be a personal computer, a server, or a network device, etc.) execute all or part of the steps of the methods in the various embodiments of the present invention. The aforementioned storage media include: U disk, read-only memory (ROM, Read-Only Memory), random access memory (RAM, Random Access Memory), mobile hard disk, magnetic disk or optical disk and other media that can store program code .
以上所述仅是本发明的优选实施方式,应当指出,对于本技术领域的普通技术人员来说,在不脱离本发明原理的前提下,还可以做出若干改进和润饰,这些改进和润饰也应视为本发明的保护范围。The above are only the preferred embodiments of the present invention. It should be pointed out that for those of ordinary skill in the art, without departing from the principle of the present invention, several improvements and modifications can be made, and these improvements and modifications are also It should be regarded as the protection scope of the present invention.

Claims (10)

  1. 一种解耦合的弹性数据仓库架构,其特征在于,包括:A decoupled elastic data warehouse architecture, which is characterized in that it includes:
    数据仓库前端,用于使用PostgreSQL作为数据仓库前端的基础,处理进出数据、提供控制和查询用户界面及管理底层存储;其中在管理底层存储中在名为xschema的PostgreSQL扩展中实现所有分片逻辑;The data warehouse front-end is used to use PostgreSQL as the basis of the data warehouse front-end to process incoming and outgoing data, provide control and query user interfaces, and manage the underlying storage; among them, all sharding logic is implemented in the PostgreSQL extension called xschema in the management of the underlying storage;
    数据仓库后端,用于可扩展和弹性的资源管理、单个或并发查询;其中资源分配分两个阶段进行,第一个阶段是在安装时,总资源最初是从云端分配;第二个阶段是在集群设置之后,用户在会话启动时传递参数;并使用Spark SQL作为底层查询引擎;Data warehouse backend, used for scalable and flexible resource management, single or concurrent query; the resource allocation is carried out in two stages, the first stage is at the time of installation, the total resources are initially allocated from the cloud; the second stage After the cluster is set up, the user passes parameters when the session is started; Spark SQL is used as the underlying query engine;
    分离数据仓库中间件,用于使用Dex协调数据仓库前端和数据仓库后端之间的消息传递和数据传输。Separate data warehouse middleware and use Dex to coordinate message and data transmission between the front end of the data warehouse and the back end of the data warehouse.
  2. 根据权利要求1所述的解耦合的弹性数据仓库架构,其特征在于,Dex中间件包括:Dex服务器、PostgreSQL适配器和Spark适配器,并通过Dex通信API运行,Dex通信API提供了一个中间层,其中:The decoupled elastic data warehouse architecture according to claim 1, wherein the Dex middleware includes: Dex server, PostgreSQL adapter and Spark adapter, and runs through the Dex communication API. The Dex communication API provides an intermediate layer. :
    PostgreSQL适配器用于转换数据库查询,与Dex服务器通信,然后从后端群集转换返回的响应;The PostgreSQL adapter is used to convert database queries, communicate with Dex servers, and then convert the returned response from the backend cluster;
    Dex服务器用于维护查询上下文,监视会话状态转换并通过Dex通信API提供Dex服务;The Dex server is used to maintain the query context, monitor the transition of the session state and provide Dex services through the Dex communication API;
    Spark适配器用于接受并解析Dex requests,将Dex请求转换为Spark计算任务,一旦响应准备好,就将其发送回Dex服务器。The Spark adapter is used to accept and parse Dex requests, convert Dex requests into Spark computing tasks, and send them back to the Dex server once the response is ready.
  3. 根据权利要求2所述的解耦合的弹性数据仓库架构,其特征在于,Dex互操作是在Dex服务器中围绕Dex Context管理的有状态服务,内部工作由PostgreSQL和Spark之间交换的消息驱动;Dex Context分别支持单后端和多后端两种设置,对于单后端设置,Dex Context通过Dex Context API代理PostgreSQL后端进程与Spark之间的通信;启动新的会话时,客户端应用程序首先通过向Dex服务器提交连接请求来创建或重用DexContext实例,一旦设置了Dex  Context,客户端应用程序就使用Dex Context API开始调用服务;Dex Context还支持多个后端,当在单个会话中连接多个后端时,Dex服务器引用Dex上下文管理器,其中的每个会话都将一个Dex Context指定给一个后端;The decoupled elastic data warehouse architecture according to claim 2, wherein Dex interoperability is a stateful service managed around Dex Context in the Dex server, and the internal work is driven by messages exchanged between PostgreSQL and Spark; Context supports two settings, single backend and multiple backends. For single backend settings, Dex Context uses the Dex Context API to proxy the communication between the PostgreSQL backend process and Spark; when a new session is started, the client application first passes Submit a connection request to the Dex server to create or reuse DexContext instances. Once the Dex Context is set, the client application uses the Dex Context API to start calling the service; Dex Context also supports multiple backends, when multiple connections are made in a single session At the end, the Dex server references the Dex context manager, and each session in it assigns a Dex Context to a back end;
    PostgreSQL适配器在PostgreSQL扩展中实现,提供Dex通信API接口的客户端库,还包括用于将数据库队列转换为Dex请求以及将结果转换回PostgreSQL数据记录的内部函数;The PostgreSQL adapter is implemented in the PostgreSQL extension, providing a client library for the Dex communication API interface, and also includes internal functions for converting database queues into Dex requests and converting results back to PostgreSQL data records;
    Spark适配器将Dex请求解析为相应的Spark函数,开始执行任务并返回最终结果。The Spark adapter parses the Dex request into the corresponding Spark function, starts to execute the task and returns the final result.
  4. 根据权利要求1所述的解耦合的弹性数据仓库架构,其特征在于,数据仓库前端处理进出数据中,对于数据集成,选择支持各种数据源,包括本地和网络文件系统、关系数据库和非关系数据库;并通过数据摄取驱动程序处理特定类型的数据源;The decoupled elastic data warehouse architecture according to claim 1, wherein the front-end of the data warehouse processes incoming and outgoing data, and for data integration, various data sources are selected and supported, including local and network file systems, relational databases, and non-relational databases. Database; and process specific types of data sources through data ingestion drivers;
    数据仓库前端提供控制和查询用户界面,用户界面继承了SQL语法,并作为系统控制和交互查询的统一门户;The front end of the data warehouse provides a control and query user interface. The user interface inherits the SQL syntax and serves as a unified portal for system control and interactive query;
    数据仓库前端管理底层存储通过分片控制器在指定的主节点上运行;所有用户数据都存储在数据节点上;系统管理员分片注册和释放数据节点,用户为分布式事实表定义分区方案;The underlying storage of the front-end management of the data warehouse runs on the designated master node through the shard controller; all user data is stored on the data node; the system administrator registers and releases the data nodes in shards, and the user defines the partition scheme for the distributed fact table;
    数据仓库前端管理底层存储通过分析服务接口供用户在后端启动分析工作负载;并通过在PostgreSQL中重新进行解析和规划。The data warehouse front-end management bottom-level storage allows users to start analysis workloads in the back-end through the analysis service interface; and re-analyze and plan in PostgreSQL.
  5. 根据权利要求1所述的解耦合的弹性数据仓库架构,其特征在于,数据仓库后端为由软件栈管理的计算机集群,数据仓库层被指定为不同的功能,包括资源分配、任务调度和查询组合。The decoupled elastic data warehouse architecture according to claim 1, wherein the back end of the data warehouse is a computer cluster managed by a software stack, and the data warehouse layer is designated for different functions, including resource allocation, task scheduling, and query combination.
  6. 根据权利要求1所述的解耦合的弹性数据仓库架构,其特征在于,在数据仓库后端中对于单个查询,执行效率由查询优化器和执行框架共同决定;对于并发运行的多个查询,总体执行效率还涉及任务调度程序。The decoupled elastic data warehouse architecture according to claim 1, wherein for a single query in the back end of the data warehouse, the execution efficiency is jointly determined by the query optimizer and the execution framework; for multiple queries running concurrently, the overall Execution efficiency also involves the task scheduler.
  7. 根据权利要求3所述的解耦合的弹性数据仓库架构,其特征在于,数据仓库后端使用Spark SQL作为底层查询引擎,数据仓库后端为特定服务器群集, 只有一个由YARN管理的相应Spark安装,连接到同一后端的多个会话是独立维护的;单独的Spark会话由不同的Spark作业提供服务。The decoupled elastic data warehouse architecture according to claim 3, wherein the back end of the data warehouse uses Spark SQL as the underlying query engine, the back end of the data warehouse is a specific server cluster, and there is only one corresponding Spark installation managed by YARN. Multiple sessions connected to the same backend are maintained independently; separate Spark sessions are served by different Spark jobs.
  8. 根据权利要求7所述的解耦合的弹性数据仓库架构,其特征在于,数据仓库有多个Spark集群作为后端;Spark集群具有不同的大小和配置;在Spark集群中,任何会话均可指定自己的资源需求;The decoupled elastic data warehouse architecture according to claim 7, wherein the data warehouse has multiple Spark clusters as backends; Spark clusters have different sizes and configurations; in a Spark cluster, any session can specify itself Resource requirements;
    数据仓库后端的资源管理中将所有后端信息记录在共享目录中,该共享目录存储在PostgreSQL主服务器上,并且可以被所有Dex Context引用;可以根据特权用户的请求添加、删除或选择数据仓库后端;In the back-end resource management of the data warehouse, all back-end information is recorded in a shared directory, which is stored on the PostgreSQL master server and can be referenced by all Dex Contexts; it can be added, deleted or selected according to the request of privileged users. end;
    数据仓库后端包括后端会话管理器,后端会话管理器与主目录共享公共信息,该主目录维护所有活动会话的超集;后端会话管理器上的每个会话仅存储会话特定的元数据;该元数据包括有关数据仓库前端、数据仓库后端和网络连接的重要事实;所述元数据用于请求处理,一旦接收并解析了前端请求,就会生成ReqStruct类型的请求数据结构,此时要查找会话元数据以识别请求中出现的输入和功能;数据仓库前端和数据仓库后端之间的通信使用高速异步网络I/O库的ZeroMQ来实现。The back-end of the data warehouse includes a back-end session manager. The back-end session manager shares common information with the home directory, which maintains a superset of all active sessions; each session on the back-end session manager only stores session-specific elements Data; the metadata includes important facts about the front end of the data warehouse, the back end of the data warehouse, and the network connection; the metadata is used for request processing. Once the front-end request is received and parsed, a request data structure of type ReqStruct will be generated. When looking for session metadata to identify the input and function appearing in the request; the communication between the front end of the data warehouse and the back end of the data warehouse is implemented using ZeroMQ of the high-speed asynchronous network I/O library.
  9. 根据权利要求8所述的解耦合的弹性数据仓库架构,其特征在于,Spark适配器负责处理来自Dex中间件的请求;当收到前端请求时,该请求被转换为Spark命令,包括已解除的SQL查询或Spark函数,以及定义辅助RDD的pro-logue和epilogue命令;Duo SQL支持两种类型的分析请求:一种是SQL查询,另一种是UDF调用;每个Dex请求在其标头中对其类型进行编码。The decoupled elastic data warehouse architecture according to claim 8, wherein the Spark adapter is responsible for processing the request from the Dex middleware; when a front-end request is received, the request is converted into a Spark command, including the released SQL Queries or Spark functions, as well as pro-logue and epilogue commands that define auxiliary RDDs; Duo SQL supports two types of analysis requests: one is SQL query, and the other is UDF call; each Dex request has a right in its header Its type is encoded.
  10. 根据权利要求9所述的解耦合的弹性数据仓库架构,其特征在于,在Spark SQL中,JdbcRDD用于从远程数据库导入数据的标准API;JdbcRDD允许并行连接到单个表的多个分区,增强对分片数据库的支持,是一个shard的JBDCRDD;使用DuoRDD,Spark执行程序并行地从多个分片节点加载。The decoupled elastic data warehouse architecture according to claim 9, characterized in that in Spark SQL, JdbcRDD is used to import data from a remote database standard API; JdbcRDD allows parallel connection to multiple partitions of a single table, enhancing the The shard database is supported by JBDCRDD of a shard; using DuoRDD, Spark executors are loaded from multiple shard nodes in parallel.
PCT/CN2019/130535 2019-04-30 2019-12-31 Decoupling elastic data warehouse architecture WO2020220717A1 (en)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN201910362554.9A CN110162515A (en) 2019-04-30 2019-04-30 A kind of uncoupled elastic data warehouse schema
CN201910362554.9 2019-04-30

Publications (1)

Publication Number Publication Date
WO2020220717A1 true WO2020220717A1 (en) 2020-11-05

Family

ID=67633159

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2019/130535 WO2020220717A1 (en) 2019-04-30 2019-12-31 Decoupling elastic data warehouse architecture

Country Status (2)

Country Link
CN (1) CN110162515A (en)
WO (1) WO2020220717A1 (en)

Families Citing this family (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110162515A (en) * 2019-04-30 2019-08-23 中国科学院深圳先进技术研究院 A kind of uncoupled elastic data warehouse schema
CN111414381B (en) * 2020-03-04 2021-09-14 腾讯科技(深圳)有限公司 Data processing method and device, electronic equipment and storage medium
CN111639062B (en) * 2020-05-29 2023-07-28 京东方科技集团股份有限公司 Method, system and storage medium for one-key construction of data warehouse
CN111966727A (en) * 2020-08-12 2020-11-20 北京海致网聚信息技术有限公司 Spark and Hive based distributed OLAP (on-line analytical processing) ad hoc query method
CN114490842B (en) * 2021-12-28 2022-11-11 航天科工智慧产业发展有限公司 Interface data query method and data query engine for multi-source data
CN116401254A (en) * 2023-04-17 2023-07-07 广东数果科技有限公司 Unified storage method and device for index result data

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US7590623B2 (en) * 2005-01-06 2009-09-15 International Business Machines Corporation Automated management of software images for efficient resource node building within a grid environment
CN106339760A (en) * 2016-08-31 2017-01-18 湖北既济电力集团有限公司科技信息分公司 Communication cable maintenance management information system
CN106685737A (en) * 2017-02-17 2017-05-17 国网山东省电力公司信息通信公司 IMS fault analysis operation and maintenance system and method based on IP telephones and servers
CN110162515A (en) * 2019-04-30 2019-08-23 中国科学院深圳先进技术研究院 A kind of uncoupled elastic data warehouse schema

Family Cites Families (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN100538650C (en) * 2008-02-01 2009-09-09 清华大学 Based on the mutual exchange method of the grid middleware of assembly
CN101546325B (en) * 2008-12-23 2012-04-18 重庆邮电大学 Grid heterogeneous data integrating method based on SOA
US10817528B2 (en) * 2015-12-15 2020-10-27 Futurewei Technologies, Inc. System and method for data warehouse engine
CN105608758B (en) * 2015-12-17 2018-03-27 山东鲁能软件技术有限公司 A kind of big data analysis platform device and method calculated based on algorithm configuration and distributed stream
US10565199B2 (en) * 2017-04-10 2020-02-18 Sap Se Massively parallel processing database middleware connector

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US7590623B2 (en) * 2005-01-06 2009-09-15 International Business Machines Corporation Automated management of software images for efficient resource node building within a grid environment
CN106339760A (en) * 2016-08-31 2017-01-18 湖北既济电力集团有限公司科技信息分公司 Communication cable maintenance management information system
CN106685737A (en) * 2017-02-17 2017-05-17 国网山东省电力公司信息通信公司 IMS fault analysis operation and maintenance system and method based on IP telephones and servers
CN110162515A (en) * 2019-04-30 2019-08-23 中国科学院深圳先进技术研究院 A kind of uncoupled elastic data warehouse schema

Also Published As

Publication number Publication date
CN110162515A (en) 2019-08-23

Similar Documents

Publication Publication Date Title
WO2020220717A1 (en) Decoupling elastic data warehouse architecture
US11216485B2 (en) Push model for scheduling query plans
JP7333424B2 (en) Graph generation for distributed event processing systems
US11822569B2 (en) Integrated application server and data server processes with matching data formats
US20200364223A1 (en) Search time estimate in a data intake and query system
US9990399B2 (en) Low latency query engine for apache hadoop
WO2016165525A1 (en) Data query method in crossing-partition database, and crossing-partition query device
US20210004712A1 (en) Machine Learning Performance and Workload Management
WO2016123921A1 (en) Http protocol-based multiple data resource data processing method and system
US8914390B2 (en) Repetitive query recognition and processing
WO2015030767A1 (en) Queries involving multiple databases and execution engines
US20090094258A1 (en) Off-loading star join operations to a storage server
US20150269234A1 (en) User Defined Functions Including Requests for Analytics by External Analytic Engines
WO2024060956A1 (en) Hybrid database management method and apparatus, hybrid database, and electronic device
US20220374424A1 (en) Join queries in data virtualization-based architecture
Liu et al. Towards Elastic Data Warehousing by Decoupling Data Management and Computation
CN112163029A (en) Virtual database system and query method
Fong et al. Toward a scale-out data-management middleware for low-latency enterprise computing
US20230244661A1 (en) Intelligent parity service with database query optimization
Ruiz-Rohena et al. ArcaDB: A Disaggregated Query Engine for Heterogenous Computational Environments
US11030241B2 (en) Query usage based organization for very large databases
CN110419026B (en) In-memory search technique
Aimonier-Davat et al. SaGe: A Preemptive SPARQL Server for OnlineKnowledge Graphs
Berka Distributed image retrieval on the grid using the vector space model

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 19926843

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 19926843

Country of ref document: EP

Kind code of ref document: A1

122 Ep: pct application non-entry in european phase

Ref document number: 19926843

Country of ref document: EP

Kind code of ref document: A1

32PN Ep: public notification in the ep bulletin as address of the adressee cannot be established

Free format text: NOTING OF LOSS OF RIGHTS PURSUANT TO RULE 112(1) EPC (EPO FORM 1205A DATED 14.06.2022)

122 Ep: pct application non-entry in european phase

Ref document number: 19926843

Country of ref document: EP

Kind code of ref document: A1