WO2020220717A1 - Decoupling elastic data warehouse architecture - Google Patents
Decoupling elastic data warehouse architecture Download PDFInfo
- Publication number
- WO2020220717A1 WO2020220717A1 PCT/CN2019/130535 CN2019130535W WO2020220717A1 WO 2020220717 A1 WO2020220717 A1 WO 2020220717A1 CN 2019130535 W CN2019130535 W CN 2019130535W WO 2020220717 A1 WO2020220717 A1 WO 2020220717A1
- Authority
- WO
- WIPO (PCT)
- Prior art keywords
- data
- data warehouse
- dex
- spark
- postgresql
- Prior art date
Links
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/20—Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
- G06F16/21—Design, administration or maintenance of databases
- G06F16/211—Schema design and management
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/20—Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
- G06F16/24—Querying
- G06F16/242—Query formulation
- G06F16/2433—Query languages
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/20—Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
- G06F16/28—Databases characterised by their database models, e.g. relational or object models
- G06F16/283—Multi-dimensional databases or data warehouses, e.g. MOLAP or ROLAP
Definitions
- the invention relates to the field of data warehouses, in particular to a decoupled elastic data warehouse architecture.
- the data warehouse has existed for decades, and its main architecture has changed from symmetric multi-processor (SMP) to massively parallel processor (MPP).
- SMP symmetric multi-processor
- MPP massively parallel processor
- SMP symmetric multi-processor
- MPP massively parallel processor
- the current MPP data warehouse is statically installed on a few computer nodes that are not shared. This architecture cannot make use of the multi-function and powerful functions of the cloud for planning and resource allocation, which prevents users and cloud providers from achieving the expected performance, service quality, and budget control goals.
- elasticity that is, the ability to independently and adaptively expand system components
- cloud data warehouses should support.
- the limitation of the current data warehouse software design to establish an elastic data warehouse is not a simple task. It assumes a symmetrical model in which each node is connected to a local storage, and all nodes are homogeneous. In MPP settings, using a strongly coupled model to process the system will have a positive impact on performance. But in cloud configurations, the software design becomes an obstacle to performance and cost-effectiveness. In order to obtain the required flexibility, the software must support a certain degree of separation of computing and storage so that more computing resources can be added when the workload requires it.
- Azure SQL data warehouse is a large-scale data warehouse service available on Microsoft Azure cloud.
- Azure SQL is based on Microsoft SQL server and is suitable for supporting relational and non-relational data.
- Azure SQL stores and accesses all its data through Azure Blob storage service. Due to the physical separation, not only storage and computing are independent, but computing can also be suspended so that users only pay for storage.
- the embodiment of the present invention provides a decoupled elastic data warehouse architecture to at least solve the technical problem that the existing data warehouse cannot separate data management and data calculation.
- a decoupled elastic data warehouse architecture including:
- the data warehouse front-end is used to use PostgreSQL as the basis of the data warehouse front-end to process incoming and outgoing data, provide control and query user interfaces, and manage the underlying storage; among them, all sharding logic is implemented in the PostgreSQL extension called xschema in the management of the underlying storage;
- Data warehouse backend used for scalable and flexible resource management, single or concurrent query; the resource allocation is carried out in two stages, the first stage is at the time of installation, the total resources are initially allocated from the cloud; the second stage After the cluster is set up, users can pass parameters when the session starts; and use Spark SQL as the underlying query engine;
- Dex middleware includes: Dex server, PostgreSQL adapter and Spark adapter, and runs through the Dex communication API.
- the Dex communication API provides an intermediate layer, including:
- the PostgreSQL adapter is used to convert database queries, communicate with Dex servers, and then convert the returned response from the backend cluster;
- the Dex server is used to maintain the query context, monitor the transition of the session state and provide Dex services through the Dex communication API;
- the Spark adapter is used to accept and parse Dex requests, convert Dex requests into Spark computing tasks, and send them back to the Dex server once the response is ready.
- Dex interoperability is a stateful service managed by Dex Context in Dex server.
- the internal work is driven by messages exchanged between PostgreSQL and Spark;
- Dex Context supports two settings of single backend and multiple backends.
- Back-end settings Dex Context proxy the communication between the PostgreSQL back-end process and Spark through the Dex Context API; when starting a new session, the client application first creates or reuses the Dex Context instance by submitting a connection request to the Dex server.
- client applications can use Dex Context API to start calling services;
- Dex Context also supports multiple back-ends. When multiple back-ends are connected in a single session, the Dex server references the Dex context manager. Each session assigns a Dex Context to a backend;
- the PostgreSQL adapter is implemented in the PostgreSQL extension, providing a client library for the Dex communication API interface, and also includes internal functions for converting database queues into Dex requests and converting results back to PostgreSQL data records;
- the Spark adapter parses the Dex request into the corresponding Spark function, starts to execute the task and returns the final result.
- data warehouse front-end processing incoming and outgoing data choose to support various data sources, including local and network file systems, relational databases and non-relational databases; and process specific types of data sources through data ingestion drivers;
- the front end of the data warehouse provides a control and query user interface.
- the user interface inherits the SQL syntax and serves as a unified portal for system control and interactive query;
- the front-end management of the data warehouse runs on the designated master node through the shard controller in the underlying storage; all user data is stored on the data node; the system administrator registers and releases the data node through sharding, and the user defines the partition for the distributed fact table Program;
- the front-end management of the data warehouse uses the analysis service interface in the underlying storage for users to start the analysis workload on the back-end; and re-analyze and plan in PostgreSQL.
- the back end of the data warehouse is a computer cluster managed by a software stack, and the data warehouse layer is designated for different functions, including resource allocation, task scheduling, and query combination.
- the execution efficiency is determined by the query optimizer and the execution framework; for many concurrently running queries, the overall execution efficiency requirement involves the task scheduler.
- the backend of the data warehouse uses Spark SQL as the underlying query engine.
- the backend of the data warehouse is a specific server cluster. There is only one corresponding Spark installation managed by YARN. Multiple sessions connected to the same backend are maintained independently; Spark sessions are served by different Spark jobs.
- the data warehouse has multiple Spark clusters as backends; Spark clusters have different sizes and configurations; in a Spark cluster, any session can specify its own resource requirements;
- all back-end information is recorded in a shared directory, which is stored on the PostgreSQL master server and can be referenced by all Dex Contexts; it can be added, deleted or selected according to the request of privileged users. end;
- the back-end of the data warehouse includes a back-end session manager.
- the back-end session manager shares common information with the home directory, which maintains a superset of all active sessions; each session on the back-end session manager only stores session-specific elements Data; this metadata includes important facts about the front end of the data warehouse, the back end of the data warehouse, and the network connection; the metadata will be used for request processing.
- a request data structure of type ReqStruct will be generated.
- the session metadata needs to be searched to identify the input and functions appearing in the request; the communication between the front end of the data warehouse and the back end of the data warehouse is implemented using ZeroMQ of the high-speed asynchronous network I/O library.
- the Spark adapter is responsible for processing the request from the Dex middleware; when a front-end request is received, the request will be converted into a Spark command, including the released SQL query or Spark function, and the pro-logue and epilogue that define the auxiliary RDD Command;
- Duo SQL supports two types of analysis requests: one is SQL query, the other is UDF call; each Dex request encodes its type in its header.
- JdbcRDD is used as a standard API for importing data from remote databases; JdbcRDD allows parallel connection to multiple partitions of a single table, which enhances support for sharded databases. It is a JBDCRDD of a shard; using DuoRDD, Spark executors can be loaded from multiple shard nodes in parallel.
- the decoupled elastic data warehouse architecture in the embodiment of the present invention explores the cloud elastic data warehouse architecture by separating data management and data computing functions to achieve independent scalability.
- the front end of the data warehouse receives data, manages storage, and provides high availability.
- the data warehouse backend is used for data analysis queries. By separating data management and data calculation, the present invention can obtain flexibility in a single data warehouse.
- Figure 1 is a schematic diagram of the architecture of DuoSQL in the present invention.
- FIG. 2 is a schematic diagram of the structure of Dex as middleware for data coordination in the present invention.
- FIG. 3 is a process diagram in which the new Spark API of the present invention extends JdbcRDD;
- FIG. 4 is a flowchart of the execution of the Duo SQL system in the present invention.
- a decoupled elastic data warehouse architecture including:
- the data warehouse front-end is used to use PostgreSQL as the basis of the data warehouse front-end to process incoming and outgoing data, provide control and query user interfaces, and manage the underlying storage; among them, all sharding logic is implemented in the PostgreSQL extension called xschema in the management of the underlying storage;
- Data warehouse backend used for scalable and flexible resource management, single or concurrent query; the resource allocation is carried out in two stages, the first stage is at the time of installation, the total resources are initially allocated from the cloud; the second stage After the cluster is set up, users can pass parameters when the session starts; and use Spark SQL as the underlying query engine;
- the decoupled elastic data warehouse architecture in the embodiment of the present invention explores the cloud elastic data warehouse architecture by separating data management and data calculation functions to achieve independent scalability.
- the front end of the data warehouse receives data, manages storage, and provides high availability.
- the data warehouse backend is used for data analysis queries.
- the present invention can obtain flexibility in a single data warehouse.
- Dex middleware includes: Dex server, PostgreSQL adapter and Spark adapter, and runs through the Dex communication API.
- the Dex communication API provides an intermediate layer, including:
- the PostgreSQL adapter is used to convert database queries, communicate with Dex servers, and then convert the returned response from the backend cluster;
- the Dex server is used to maintain the query context, monitor the transition of the session state and provide Dex services through the Dex communication API;
- the Spark adapter is used to accept and parse Dex requests, convert Dex requests into Spark computing tasks, and send them back to the Dex server once the response is ready.
- Dex interoperability is a stateful service managed by Dex Context in Dex server.
- the internal work is driven by messages exchanged between PostgreSQL and Spark;
- Dex Context supports single backend and multiple backends respectively Settings.
- Dex Context proxy the communication between the PostgreSQL back-end process and Spark through the Dex Context API; when starting a new session, the client application first creates or reuses the Dex Context by submitting a connection request to the Dex server For example, once the DexContext is set, the client application can use the Dex Context API to start calling the service; Dex Context also supports multiple backends, when multiple backends are connected in a single session, the Dex server references the Dex context manager , Each of these sessions assigns a Dex Context to a backend;
- the PostgreSQL adapter is implemented in the PostgreSQL extension, providing a client library for the Dex communication API interface, and also includes internal functions for converting database queues into Dex requests and converting results back to PostgreSQL data records;
- the Spark adapter parses the Dex request into the corresponding Spark function, starts to execute the task and returns the final result.
- the front-end of the data warehouse processes incoming and outgoing data.
- it chooses to support various data sources, including local and network file systems, relational databases and non-relational databases; and handles specific types through data ingestion drivers Data source;
- the front end of the data warehouse provides a control and query user interface.
- the user interface inherits the SQL syntax and serves as a unified portal for system control and interactive query;
- the front-end management of the data warehouse runs on the designated master node through the shard controller in the underlying storage; all user data is stored on the data node; the system administrator registers and releases the data node through sharding, and the user defines the partition for the distributed fact table Program;
- the front-end management of the data warehouse uses the analysis service interface in the underlying storage for users to start the analysis workload on the back-end; and re-analyze and plan in PostgreSQL.
- the back end of the data warehouse is a computer cluster managed by a software stack, and the data warehouse layer is designated for different functions, including resource allocation, task scheduling, and query combination.
- the execution efficiency is jointly determined by the query optimizer and the execution framework; for many concurrently running queries, the overall execution efficiency requirement involves the task scheduler.
- the data warehouse backend uses Spark SQL as the underlying query engine
- the data warehouse backend is a specific server cluster, and there is only one corresponding Spark installation managed by YARN, and multiple sessions connected to the same backend are independently maintained Yes; separate Spark sessions are served by different Spark jobs.
- the data warehouse has multiple Spark clusters as backends; Spark clusters have different sizes and configurations; in a Spark cluster, any session can specify its own resource requirements;
- all back-end information is recorded in a shared directory, which is stored on the PostgreSQL master server and can be referenced by all Dex Contexts; it can be added, deleted or selected according to the request of privileged users. end;
- the back-end of the data warehouse includes a back-end session manager.
- the back-end session manager shares common information with the home directory, which maintains a superset of all active sessions; each session on the back-end session manager only stores session-specific elements Data; this metadata includes important facts about the front end of the data warehouse, the back end of the data warehouse, and the network connection; the metadata will be used for request processing.
- a request data structure of type ReqStruct will be generated.
- the session metadata needs to be searched to identify the input and functions appearing in the request; the communication between the front end of the data warehouse and the back end of the data warehouse is implemented using ZeroMQ of the high-speed asynchronous network I/O library.
- the Spark adapter is responsible for processing the request from the Dex middleware; when a front-end request is received, the request will be converted into a Spark command, including the released SQL query or Spark function, and the pro that defines the auxiliary RDD -logue and epilogue commands;
- Duo SQL supports two types of analysis requests: one is SQL query, the other is UDF call; each Dex request encodes its type in its header.
- JdbcRDD is used as a standard API for importing data from remote databases; JdbcRDD allows parallel connection to multiple partitions of a single table, which enhances the support for sharded databases. It is a shard JBDCRDD ; Using DuoRDD, Spark executors can be loaded from multiple shard nodes in parallel.
- the present invention explores an architecture that separates data management and data calculation. By separating these two parts, the system gains more flexibility and adaptability.
- the present invention constructs a prototype system DuoSQL based on PostgreSQL and Spark.
- the present invention uses the TPC-H benchmark to verify the system. Experimental results show that the decoupling algorithm has great performance potential.
- the present invention separates data management and data calculation functions to achieve independent scalability, and explores the cloud elastic data warehouse system structure.
- the data warehouse front end (data management unit) receives data, manages storage, and provides high availability.
- the back end of the data warehouse (data computing unit) is used for data analysis queries.
- the present invention can obtain flexibility in a single data warehouse, which is a characteristic that the existing system does not have.
- Microsoft Azure cloud database only allows flexible processing across multiple data warehouses.
- the invention enables the data warehouse to have stronger adaptability in the face of constantly changing workload requirements.
- the invention uses an RDBMS and a memory cluster computing engine with SQL support to implement the system structure.
- the present invention builds a prototype system named DuoSQL based on PostgreSQL and Spark.
- the present invention proposes a cloud data warehouse architecture, which can decouple data management and data calculation, thereby achieving flexibility in a single data warehouse.
- the present invention constructs a prototype system named DuoSQL based on PostgreSQL and Spark, and the experimental results show good performance potential.
- the present invention discloses a decoupled elastic data warehouse architecture.
- the present invention will be further described in detail with reference to the accompanying drawings and examples. It should be understood that the specific embodiments described here are only used to explain the present invention, but not to limit the present invention.
- the flexible data warehouse architecture is based on flexible configuration, mainly starting from the flexible configuration of storage platforms and computing platforms, and then assisting with communication middleware.
- the elastic data warehouse is introduced from the following aspects:
- Figure 1 shows the architecture of DuoSQL.
- the whole system structure combines the data management front-end and data computing back-end.
- the architecture supports various front-end and back-end subsystems.
- an internal operation middleware is needed to manage network work connections, user sessions, request agents, query translation, and data transmission.
- the front end is the data management component of the data warehouse, and its main responsibilities are as follows:
- the front end can choose to support various data sources, such as local and network file systems, relational databases, and non-relational databases.
- a data ingest driver may be required to handle specific types of data sources.
- the user interface inherits the SQL syntax and serves as a unified portal for system control and interactive query.
- the back end is where the actual analysis and calculations are performed. It is a computer cluster completely managed by the software stack to ensure efficiency and flexibility.
- the data warehouse layer is designated for different functions, such as resource allocation, task scheduling, and query combination. Together, these layers serve two purposes:
- the central task of the backend is to complete the submitted query as quickly as possible.
- the execution efficiency is determined by the query optimizer and execution framework.
- the overall execution efficiency requirements involve the task scheduler. In order to improve the query efficiency of big data analytics, a lot of work has been done. Build a prototype system based on existing technology.
- the backend should ensure efficient processing of single and concurrently running queries.
- many modern distributed computing frameworks such as Apache Spark SQL and Apache CalcTITE, can be used as the back-end software foundation with the help of a qualified resource manager (such as Apache Yarn).
- Middleware is a key component that supports the separation of data warehouses. At the upper level, it provides interfaces and semantic abstractions for front-end and back-end to communicate with each other. At a lower level, it directs the message exchange and data transmission between the client and the server.
- the design and implementation of middleware should solve the following problems:
- RDBMS uses SQL to manage structured data, while most big data platforms use imperative language interfaces to process unstructured data. To communicate and interoperate across heterogeneous systems, it is necessary to first develop a general abstraction of data models and query interfaces.
- Duo SQL uses PostgreSQL as the basis of the front end.
- the advantage that makes PostgreSQL stand out is its excellent support for writing database extensions.
- most of the front-end logic of Duo SQL is implemented in PostgreSQL extensions, and most user interface functions are in the form of UDF.
- the shard controller runs on the designated master node. All user data is stored on the data node. Through sharding, system administrators can register and release data nodes for data dissemination, and users can define partitioning schemes for distributed fact tables.
- the present invention also needs an interface for the user to start the analysis workload at the back end. Finally, by re-analyzing and planning in PostgreSQL, service calls can be made transparent to users.
- Duo SQL analysis service is invoked through a set of UDFs, including functions that connect to the backend, run SQL queries, and call remote functions.
- the present invention uses Dex as shown in Figure 2 as middleware to coordinate message transfer and data transmission between the front end and the back end.
- Dex was originally designed as an interoperability framework connecting heterogeneous data platforms (such as PostgreSQL and Spark).
- Dex does not support sharding front-end, nor does it fully utilize Spark's SQL query engine.
- the present invention adjusts it in Duo SQL to support both.
- Dex middleware is composed of three main components, Dex server, PostgreSQL adapter and Spark adapter, which run through the Dex communication API.
- the PostgreSQL adapter converts the database query, communicates with the Dex server, and then converts the returned response from the backend cluster.
- the Dex server maintains the query context, monitors session state transitions and provides Dex services through the Dex communication API.
- the Spark adapter accepts and parses Dex requests, converts the Dex request into a series of Spark computing tasks, and sends it back to the Dex server once the response is ready.
- the Dex communication API provides an intermediate layer that enables end systems to communicate with pure isolation and abstraction.
- Dex interoperability is a stateful service managed around Dex Context in the Dex server. The internal work is driven by the messages exchanged between PostgreSQL and Spark.
- Dex Context supports two settings of single backend and multiple backends. For single-backend settings, Dex Context uses the Dex Context API to proxy the communication between the PostgreSQL back-end process and Spark. To start a new session, the client application must first create or reuse a Dex Context instance by submitting a connection request to the Dex server. Once the Dex Context is set, the client application can use the Dex Context API to start calling services. Dex Context also supports multiple backends.
- the Dex server needs to be able to maintain all states.
- this introduces the Dex context manager, which ensures that each session assigns a Dex Context to a backend.
- PostgreSQL Adapter is implemented in the PostgreSQL extension. It is used as a client library that provides Dex communication API interface. It also includes internal functions for converting database queues into Dex requests and converting the results back to PostgreSQL data records.
- Spark Adapter is a module in Spark that parses Dex requests into corresponding Spark functions, starts executing tasks and returns the final result.
- the back end of Duo SQL uses Spark SQL as the underlying query engine.
- the present invention always refers to the back end as a specific server cluster, whether it is virtual or physical distribution.
- a backend cluster there is only one corresponding Spark installation managed by YARN. However, multiple sessions connected to the same backend are maintained independently. In the Spark field, separate Spark sessions are served by different Spark jobs.
- Back-end flexibility is provided in multiple levels.
- the data warehouse can have multiple Spark clusters as backends.
- Spark clusters can have different sizes and configurations.
- Duo SQL records all back-end information in a shared directory, which is stored on the PostgreSQL master server and can be referenced by all Dex Contexts. Since Duo SQL is a system that solves coupling, it is possible to add, delete or select backends according to the request of privileged users.
- Session management The back-end session manager shares some common information with the home directory, which maintains a superset of all active sessions. Each session on the backend stores only session-specific metadata. This metadata includes important facts about the front-end, back-end, and network connections, such as shard nodes, database names, shard tables, partitions, functions available on Spark, database connection strings, and ZeroMQ handlers. Metadata will be used for request processing. Once the front-end request is received and parsed, a request data structure of type ReqStruct is generated. At this point, you need to look up the session metadata to identify the inputs and functions that appear in the request. The communication between the front end and the back end is implemented using ZeroMQ, which is a high-speed asynchronous network I/O library.
- Spark Adapter is responsible for processing requests from Dex middleware. When a front-end request is received, it may be converted into a series of Spark commands, which may include released SQL queries or Spark functions, as well as some pro-logue and epilogue commands that define auxiliary RDDs. Duo SQL supports two types of analysis requests: one is SQL query, and the other is UDF call. Each Dex request encodes its type in its header. The request processor uses this information to determine how to translate the request content.
- JdbcRDD is a standard API for importing data from remote databases.
- One function of JdbcRDD is to allow parallel connection to multiple partitions of a single table, thereby achieving parallel data transmission.
- the parallel connection function of JdbcRDD is not applicable to fragmented databases.
- the present invention has developed a new Spark API named DuoRDD, which extends JdbcRDD and enhances the support for sharded databases. It is a shard JBDCRDD.
- DuoRDD Spark executors can be loaded from multiple shard nodes in parallel.
- Step 1 The user submits a SQL containing a PostgreSQL request.
- step 2 is that the interface calls the context API to execute the corresponding request. If the user submits multiple requests while executing the request, Duo SQL will create another context for other requests.
- step 3 is the middleware adapter, which analyzes the request initiated by the user and generates different ReqStructs according to different functions.
- the fourth step is that the Spark adapter analyzes the request sent by the middleware adapter, executes the request and returns the result, which also includes the Spark request data from the PostgreSQL cluster.
- Duo SQL started the Spark job on the multi-line cluster management.
- the present invention proposes a flexible data warehouse architecture by decoupling data management and calculation.
- the present invention constructs a prototype system DuoSQL by using Dex interoperability middleware, a fragmented PostgreSQL database as a front end and a Spark cluster as a back end.
- the present invention evaluates the performance potential of Duo SQL by comparing it with independent PostgreSQL with and without parallel query support.
- the invention runs test experiments with different workloads and input types. The results show that Duo SQL not only has obvious performance advantages, but also has excellent robustness.
- the decoupling elastic data warehouse architecture proposed by the present invention has three important characteristics: decoupling, flexibility and diversity.
- Decoupling the design of the present invention is a design that completely separates calculation and storage.
- flexibility is the biggest consideration of the present invention.
- Elasticity that is, the ability to independently and adaptively expand system components, is the main attribute that cloud data warehouses should support.
- it can also use the diversity of the back-end to diversify the data warehouse.
- the Spark used in the present invention then this experiment uses the features of the coming Spark's iterative calculation of memory, etc.
- the present invention can be determined by the user. Different backends.
- the OLAP data analysis benchmark test TPC-H and machine learning algorithm verify the effectiveness of the system structure of the present invention.
- the structure of the present invention and the case of non-storage computing separation were compared, including the TPC-H experiment under the scale factors of 30, 50, and 75.
- the machine learning algorithm the present invention is compared with the PostgreSQL clustering algorithm under the ApahceMADlib framework, and the data comes from the skin_noskin data set and the KGEE data set of UCI.
- the results of the experiment are basically better than the existing structure.
- the disclosed technical content can be implemented in other ways.
- the system embodiment described above is only illustrative.
- the division of units may be a logical function division, and there may be other divisions in actual implementation.
- multiple units or components may be combined or integrated into Another system, or some features can be ignored, or not implemented.
- the displayed or discussed mutual coupling or direct coupling or communication connection may be indirect coupling or communication connection through some interfaces, units or modules, and may be in electrical or other forms.
- the units described as separate components may or may not be physically separate, and the components displayed as units may or may not be physical units, that is, they may be located in one place, or may be distributed on multiple units. Some or all of the units may be selected according to actual needs to achieve the objectives of the solutions of the embodiments.
- the functional units in the various embodiments of the present invention may be integrated into one processing unit, or each unit may exist alone physically, or two or more units may be integrated into one unit.
- the above-mentioned integrated unit can be implemented in the form of hardware or software functional unit.
- the integrated unit is implemented in the form of a software functional unit and sold or used as an independent product, it can be stored in a computer readable storage medium.
- the technical solution of the present invention essentially or the part that contributes to the prior art or all or part of the technical solution can be embodied in the form of a software product, and the computer software product is stored in a storage medium , Including several instructions to make a computer device (which may be a personal computer, a server, or a network device, etc.) execute all or part of the steps of the methods in the various embodiments of the present invention.
- the aforementioned storage media include: U disk, read-only memory (ROM, Read-Only Memory), random access memory (RAM, Random Access Memory), mobile hard disk, magnetic disk or optical disk and other media that can store program code .
Landscapes
- Engineering & Computer Science (AREA)
- Databases & Information Systems (AREA)
- Theoretical Computer Science (AREA)
- Physics & Mathematics (AREA)
- Data Mining & Analysis (AREA)
- General Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Mathematical Physics (AREA)
- Computational Linguistics (AREA)
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
Abstract
Description
Claims (10)
- 一种解耦合的弹性数据仓库架构,其特征在于,包括:A decoupled elastic data warehouse architecture, which is characterized in that it includes:数据仓库前端,用于使用PostgreSQL作为数据仓库前端的基础,处理进出数据、提供控制和查询用户界面及管理底层存储;其中在管理底层存储中在名为xschema的PostgreSQL扩展中实现所有分片逻辑;The data warehouse front-end is used to use PostgreSQL as the basis of the data warehouse front-end to process incoming and outgoing data, provide control and query user interfaces, and manage the underlying storage; among them, all sharding logic is implemented in the PostgreSQL extension called xschema in the management of the underlying storage;数据仓库后端,用于可扩展和弹性的资源管理、单个或并发查询;其中资源分配分两个阶段进行,第一个阶段是在安装时,总资源最初是从云端分配;第二个阶段是在集群设置之后,用户在会话启动时传递参数;并使用Spark SQL作为底层查询引擎;Data warehouse backend, used for scalable and flexible resource management, single or concurrent query; the resource allocation is carried out in two stages, the first stage is at the time of installation, the total resources are initially allocated from the cloud; the second stage After the cluster is set up, the user passes parameters when the session is started; Spark SQL is used as the underlying query engine;分离数据仓库中间件,用于使用Dex协调数据仓库前端和数据仓库后端之间的消息传递和数据传输。Separate data warehouse middleware and use Dex to coordinate message and data transmission between the front end of the data warehouse and the back end of the data warehouse.
- 根据权利要求1所述的解耦合的弹性数据仓库架构,其特征在于,Dex中间件包括:Dex服务器、PostgreSQL适配器和Spark适配器,并通过Dex通信API运行,Dex通信API提供了一个中间层,其中:The decoupled elastic data warehouse architecture according to claim 1, wherein the Dex middleware includes: Dex server, PostgreSQL adapter and Spark adapter, and runs through the Dex communication API. The Dex communication API provides an intermediate layer. :PostgreSQL适配器用于转换数据库查询,与Dex服务器通信,然后从后端群集转换返回的响应;The PostgreSQL adapter is used to convert database queries, communicate with Dex servers, and then convert the returned response from the backend cluster;Dex服务器用于维护查询上下文,监视会话状态转换并通过Dex通信API提供Dex服务;The Dex server is used to maintain the query context, monitor the transition of the session state and provide Dex services through the Dex communication API;Spark适配器用于接受并解析Dex requests,将Dex请求转换为Spark计算任务,一旦响应准备好,就将其发送回Dex服务器。The Spark adapter is used to accept and parse Dex requests, convert Dex requests into Spark computing tasks, and send them back to the Dex server once the response is ready.
- 根据权利要求2所述的解耦合的弹性数据仓库架构,其特征在于,Dex互操作是在Dex服务器中围绕Dex Context管理的有状态服务,内部工作由PostgreSQL和Spark之间交换的消息驱动;Dex Context分别支持单后端和多后端两种设置,对于单后端设置,Dex Context通过Dex Context API代理PostgreSQL后端进程与Spark之间的通信;启动新的会话时,客户端应用程序首先通过向Dex服务器提交连接请求来创建或重用DexContext实例,一旦设置了Dex Context,客户端应用程序就使用Dex Context API开始调用服务;Dex Context还支持多个后端,当在单个会话中连接多个后端时,Dex服务器引用Dex上下文管理器,其中的每个会话都将一个Dex Context指定给一个后端;The decoupled elastic data warehouse architecture according to claim 2, wherein Dex interoperability is a stateful service managed around Dex Context in the Dex server, and the internal work is driven by messages exchanged between PostgreSQL and Spark; Context supports two settings, single backend and multiple backends. For single backend settings, Dex Context uses the Dex Context API to proxy the communication between the PostgreSQL backend process and Spark; when a new session is started, the client application first passes Submit a connection request to the Dex server to create or reuse DexContext instances. Once the Dex Context is set, the client application uses the Dex Context API to start calling the service; Dex Context also supports multiple backends, when multiple connections are made in a single session At the end, the Dex server references the Dex context manager, and each session in it assigns a Dex Context to a back end;PostgreSQL适配器在PostgreSQL扩展中实现,提供Dex通信API接口的客户端库,还包括用于将数据库队列转换为Dex请求以及将结果转换回PostgreSQL数据记录的内部函数;The PostgreSQL adapter is implemented in the PostgreSQL extension, providing a client library for the Dex communication API interface, and also includes internal functions for converting database queues into Dex requests and converting results back to PostgreSQL data records;Spark适配器将Dex请求解析为相应的Spark函数,开始执行任务并返回最终结果。The Spark adapter parses the Dex request into the corresponding Spark function, starts to execute the task and returns the final result.
- 根据权利要求1所述的解耦合的弹性数据仓库架构,其特征在于,数据仓库前端处理进出数据中,对于数据集成,选择支持各种数据源,包括本地和网络文件系统、关系数据库和非关系数据库;并通过数据摄取驱动程序处理特定类型的数据源;The decoupled elastic data warehouse architecture according to claim 1, wherein the front-end of the data warehouse processes incoming and outgoing data, and for data integration, various data sources are selected and supported, including local and network file systems, relational databases, and non-relational databases. Database; and process specific types of data sources through data ingestion drivers;数据仓库前端提供控制和查询用户界面,用户界面继承了SQL语法,并作为系统控制和交互查询的统一门户;The front end of the data warehouse provides a control and query user interface. The user interface inherits the SQL syntax and serves as a unified portal for system control and interactive query;数据仓库前端管理底层存储通过分片控制器在指定的主节点上运行;所有用户数据都存储在数据节点上;系统管理员分片注册和释放数据节点,用户为分布式事实表定义分区方案;The underlying storage of the front-end management of the data warehouse runs on the designated master node through the shard controller; all user data is stored on the data node; the system administrator registers and releases the data nodes in shards, and the user defines the partition scheme for the distributed fact table;数据仓库前端管理底层存储通过分析服务接口供用户在后端启动分析工作负载;并通过在PostgreSQL中重新进行解析和规划。The data warehouse front-end management bottom-level storage allows users to start analysis workloads in the back-end through the analysis service interface; and re-analyze and plan in PostgreSQL.
- 根据权利要求1所述的解耦合的弹性数据仓库架构,其特征在于,数据仓库后端为由软件栈管理的计算机集群,数据仓库层被指定为不同的功能,包括资源分配、任务调度和查询组合。The decoupled elastic data warehouse architecture according to claim 1, wherein the back end of the data warehouse is a computer cluster managed by a software stack, and the data warehouse layer is designated for different functions, including resource allocation, task scheduling, and query combination.
- 根据权利要求1所述的解耦合的弹性数据仓库架构,其特征在于,在数据仓库后端中对于单个查询,执行效率由查询优化器和执行框架共同决定;对于并发运行的多个查询,总体执行效率还涉及任务调度程序。The decoupled elastic data warehouse architecture according to claim 1, wherein for a single query in the back end of the data warehouse, the execution efficiency is jointly determined by the query optimizer and the execution framework; for multiple queries running concurrently, the overall Execution efficiency also involves the task scheduler.
- 根据权利要求3所述的解耦合的弹性数据仓库架构,其特征在于,数据仓库后端使用Spark SQL作为底层查询引擎,数据仓库后端为特定服务器群集, 只有一个由YARN管理的相应Spark安装,连接到同一后端的多个会话是独立维护的;单独的Spark会话由不同的Spark作业提供服务。The decoupled elastic data warehouse architecture according to claim 3, wherein the back end of the data warehouse uses Spark SQL as the underlying query engine, the back end of the data warehouse is a specific server cluster, and there is only one corresponding Spark installation managed by YARN. Multiple sessions connected to the same backend are maintained independently; separate Spark sessions are served by different Spark jobs.
- 根据权利要求7所述的解耦合的弹性数据仓库架构,其特征在于,数据仓库有多个Spark集群作为后端;Spark集群具有不同的大小和配置;在Spark集群中,任何会话均可指定自己的资源需求;The decoupled elastic data warehouse architecture according to claim 7, wherein the data warehouse has multiple Spark clusters as backends; Spark clusters have different sizes and configurations; in a Spark cluster, any session can specify itself Resource requirements;数据仓库后端的资源管理中将所有后端信息记录在共享目录中,该共享目录存储在PostgreSQL主服务器上,并且可以被所有Dex Context引用;可以根据特权用户的请求添加、删除或选择数据仓库后端;In the back-end resource management of the data warehouse, all back-end information is recorded in a shared directory, which is stored on the PostgreSQL master server and can be referenced by all Dex Contexts; it can be added, deleted or selected according to the request of privileged users. end;数据仓库后端包括后端会话管理器,后端会话管理器与主目录共享公共信息,该主目录维护所有活动会话的超集;后端会话管理器上的每个会话仅存储会话特定的元数据;该元数据包括有关数据仓库前端、数据仓库后端和网络连接的重要事实;所述元数据用于请求处理,一旦接收并解析了前端请求,就会生成ReqStruct类型的请求数据结构,此时要查找会话元数据以识别请求中出现的输入和功能;数据仓库前端和数据仓库后端之间的通信使用高速异步网络I/O库的ZeroMQ来实现。The back-end of the data warehouse includes a back-end session manager. The back-end session manager shares common information with the home directory, which maintains a superset of all active sessions; each session on the back-end session manager only stores session-specific elements Data; the metadata includes important facts about the front end of the data warehouse, the back end of the data warehouse, and the network connection; the metadata is used for request processing. Once the front-end request is received and parsed, a request data structure of type ReqStruct will be generated. When looking for session metadata to identify the input and function appearing in the request; the communication between the front end of the data warehouse and the back end of the data warehouse is implemented using ZeroMQ of the high-speed asynchronous network I/O library.
- 根据权利要求8所述的解耦合的弹性数据仓库架构,其特征在于,Spark适配器负责处理来自Dex中间件的请求;当收到前端请求时,该请求被转换为Spark命令,包括已解除的SQL查询或Spark函数,以及定义辅助RDD的pro-logue和epilogue命令;Duo SQL支持两种类型的分析请求:一种是SQL查询,另一种是UDF调用;每个Dex请求在其标头中对其类型进行编码。The decoupled elastic data warehouse architecture according to claim 8, wherein the Spark adapter is responsible for processing the request from the Dex middleware; when a front-end request is received, the request is converted into a Spark command, including the released SQL Queries or Spark functions, as well as pro-logue and epilogue commands that define auxiliary RDDs; Duo SQL supports two types of analysis requests: one is SQL query, and the other is UDF call; each Dex request has a right in its header Its type is encoded.
- 根据权利要求9所述的解耦合的弹性数据仓库架构,其特征在于,在Spark SQL中,JdbcRDD用于从远程数据库导入数据的标准API;JdbcRDD允许并行连接到单个表的多个分区,增强对分片数据库的支持,是一个shard的JBDCRDD;使用DuoRDD,Spark执行程序并行地从多个分片节点加载。The decoupled elastic data warehouse architecture according to claim 9, characterized in that in Spark SQL, JdbcRDD is used to import data from a remote database standard API; JdbcRDD allows parallel connection to multiple partitions of a single table, enhancing the The shard database is supported by JBDCRDD of a shard; using DuoRDD, Spark executors are loaded from multiple shard nodes in parallel.
Applications Claiming Priority (2)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201910362554.9A CN110162515A (en) | 2019-04-30 | 2019-04-30 | A kind of uncoupled elastic data warehouse schema |
CN201910362554.9 | 2019-04-30 |
Publications (1)
Publication Number | Publication Date |
---|---|
WO2020220717A1 true WO2020220717A1 (en) | 2020-11-05 |
Family
ID=67633159
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
PCT/CN2019/130535 WO2020220717A1 (en) | 2019-04-30 | 2019-12-31 | Decoupling elastic data warehouse architecture |
Country Status (2)
Country | Link |
---|---|
CN (1) | CN110162515A (en) |
WO (1) | WO2020220717A1 (en) |
Families Citing this family (6)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN110162515A (en) * | 2019-04-30 | 2019-08-23 | 中国科学院深圳先进技术研究院 | A kind of uncoupled elastic data warehouse schema |
CN111414381B (en) * | 2020-03-04 | 2021-09-14 | 腾讯科技(深圳)有限公司 | Data processing method and device, electronic equipment and storage medium |
CN111639062B (en) * | 2020-05-29 | 2023-07-28 | 京东方科技集团股份有限公司 | Method, system and storage medium for one-key construction of data warehouse |
CN111966727A (en) * | 2020-08-12 | 2020-11-20 | 北京海致网聚信息技术有限公司 | Spark and Hive based distributed OLAP (on-line analytical processing) ad hoc query method |
CN114490842B (en) * | 2021-12-28 | 2022-11-11 | 航天科工智慧产业发展有限公司 | Interface data query method and data query engine for multi-source data |
CN116401254A (en) * | 2023-04-17 | 2023-07-07 | 广东数果科技有限公司 | Unified storage method and device for index result data |
Citations (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US7590623B2 (en) * | 2005-01-06 | 2009-09-15 | International Business Machines Corporation | Automated management of software images for efficient resource node building within a grid environment |
CN106339760A (en) * | 2016-08-31 | 2017-01-18 | 湖北既济电力集团有限公司科技信息分公司 | Communication cable maintenance management information system |
CN106685737A (en) * | 2017-02-17 | 2017-05-17 | 国网山东省电力公司信息通信公司 | IMS fault analysis operation and maintenance system and method based on IP telephones and servers |
CN110162515A (en) * | 2019-04-30 | 2019-08-23 | 中国科学院深圳先进技术研究院 | A kind of uncoupled elastic data warehouse schema |
Family Cites Families (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN100538650C (en) * | 2008-02-01 | 2009-09-09 | 清华大学 | Based on the mutual exchange method of the grid middleware of assembly |
CN101546325B (en) * | 2008-12-23 | 2012-04-18 | 重庆邮电大学 | Grid heterogeneous data integrating method based on SOA |
US10817528B2 (en) * | 2015-12-15 | 2020-10-27 | Futurewei Technologies, Inc. | System and method for data warehouse engine |
CN105608758B (en) * | 2015-12-17 | 2018-03-27 | 山东鲁能软件技术有限公司 | A kind of big data analysis platform device and method calculated based on algorithm configuration and distributed stream |
US10565199B2 (en) * | 2017-04-10 | 2020-02-18 | Sap Se | Massively parallel processing database middleware connector |
-
2019
- 2019-04-30 CN CN201910362554.9A patent/CN110162515A/en active Pending
- 2019-12-31 WO PCT/CN2019/130535 patent/WO2020220717A1/en active Application Filing
Patent Citations (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US7590623B2 (en) * | 2005-01-06 | 2009-09-15 | International Business Machines Corporation | Automated management of software images for efficient resource node building within a grid environment |
CN106339760A (en) * | 2016-08-31 | 2017-01-18 | 湖北既济电力集团有限公司科技信息分公司 | Communication cable maintenance management information system |
CN106685737A (en) * | 2017-02-17 | 2017-05-17 | 国网山东省电力公司信息通信公司 | IMS fault analysis operation and maintenance system and method based on IP telephones and servers |
CN110162515A (en) * | 2019-04-30 | 2019-08-23 | 中国科学院深圳先进技术研究院 | A kind of uncoupled elastic data warehouse schema |
Also Published As
Publication number | Publication date |
---|---|
CN110162515A (en) | 2019-08-23 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
WO2020220717A1 (en) | Decoupling elastic data warehouse architecture | |
US11216485B2 (en) | Push model for scheduling query plans | |
JP7333424B2 (en) | Graph generation for distributed event processing systems | |
US11822569B2 (en) | Integrated application server and data server processes with matching data formats | |
US20200364223A1 (en) | Search time estimate in a data intake and query system | |
US9990399B2 (en) | Low latency query engine for apache hadoop | |
WO2016165525A1 (en) | Data query method in crossing-partition database, and crossing-partition query device | |
US20210004712A1 (en) | Machine Learning Performance and Workload Management | |
WO2016123921A1 (en) | Http protocol-based multiple data resource data processing method and system | |
US8914390B2 (en) | Repetitive query recognition and processing | |
WO2015030767A1 (en) | Queries involving multiple databases and execution engines | |
US20090094258A1 (en) | Off-loading star join operations to a storage server | |
US20150269234A1 (en) | User Defined Functions Including Requests for Analytics by External Analytic Engines | |
WO2024060956A1 (en) | Hybrid database management method and apparatus, hybrid database, and electronic device | |
US20220374424A1 (en) | Join queries in data virtualization-based architecture | |
Liu et al. | Towards Elastic Data Warehousing by Decoupling Data Management and Computation | |
CN112163029A (en) | Virtual database system and query method | |
Fong et al. | Toward a scale-out data-management middleware for low-latency enterprise computing | |
US20230244661A1 (en) | Intelligent parity service with database query optimization | |
Ruiz-Rohena et al. | ArcaDB: A Disaggregated Query Engine for Heterogenous Computational Environments | |
US11030241B2 (en) | Query usage based organization for very large databases | |
CN110419026B (en) | In-memory search technique | |
Aimonier-Davat et al. | SaGe: A Preemptive SPARQL Server for OnlineKnowledge Graphs | |
Berka | Distributed image retrieval on the grid using the vector space model |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
121 | Ep: the epo has been informed by wipo that ep was designated in this application |
Ref document number: 19926843 Country of ref document: EP Kind code of ref document: A1 |
|
NENP | Non-entry into the national phase |
Ref country code: DE |
|
122 | Ep: pct application non-entry in european phase |
Ref document number: 19926843 Country of ref document: EP Kind code of ref document: A1 |
|
122 | Ep: pct application non-entry in european phase |
Ref document number: 19926843 Country of ref document: EP Kind code of ref document: A1 |
|
32PN | Ep: public notification in the ep bulletin as address of the adressee cannot be established |
Free format text: NOTING OF LOSS OF RIGHTS PURSUANT TO RULE 112(1) EPC (EPO FORM 1205A DATED 14.06.2022) |
|
122 | Ep: pct application non-entry in european phase |
Ref document number: 19926843 Country of ref document: EP Kind code of ref document: A1 |