CN105045929A

CN105045929A - MPP architecture based distributed relational database

Info

Publication number: CN105045929A
Application number: CN201510547427.8A
Authority: CN
Inventors: 张宇; 杨利兵; 缪燕; 李海; 吕志来; 张学深
Original assignee: State Grid Corp of China SGCC; Beijing Xuji Electric Co Ltd
Current assignee: State Grid Corp of China SGCC; Beijing Xuji Electric Co Ltd
Priority date: 2015-08-31
Filing date: 2015-08-31
Publication date: 2015-11-11

Abstract

The present invention provided an MPP architecture based distributed relational database and relates to the fields of databases, big data and distributed computing. The database comprises four modules: a global transaction manager in charge of global transaction processing; a load balancing system in charge of load balancing management of a cluster; a cluster coordination manager used for coordinating work among data nodes; and the data nodes of a relational database based on PowerDB deployment. According to the MPP architecture based distributed relational database provided by the present invention, aimed at the PowerDB relational database, a distributed environmental cluster is established; by adopting an MPP architecture technology and Shared Nothing among the nodes, no influence on cluster work is ensured when a single point has a fault; and the cluster can extend laterally, PB level data storage is implemented, and massive parallel data writing is supported.

Description

A kind of distributed relation database based on MPP framework

Technical field

The present invention relates to database field, large data fields, Distributed Calculation field, for large data processing provides a kind of mass data storage solution based on relevant database, support OLTP and OLAP two kinds of application scenarioss.

Background technology

Infotech obtains develop rapidly in recent years, the quantity of information of enterprise and society be geometric format growth, and this is that the Storage and Processing of large data brings huge challenge, as He Jianshe large data sets group, realizing the storage of mass data, is the matter of utmost importance that large data fields faces.

To increase income large data framework based on hadoop, it is current extensive adopted solution, by the distributed file system that hadoop provides, PB DBMS storage problem can be solved, simultaneously by the product under the hadoop ecosphere, can realize the data management of column storage, data warehouse level, part solves mass data access issues, but the storage solution that hadoop provides has the following disadvantages, one is do not support SQL business.Current a large number of services system all based on SQL exploitation, cannot successfully move to hadoop platform.Two is that one process is written in parallel to scarce capacity.Its one process of HDFS or HBase does not possess the ability of being written in parallel to, and cannot satisfying magnanimity information acquisition need.

Summary of the invention

PowerDB is a kind of Database Systems based on Single-Server exploitation, the technical problem to be solved in the present invention is for PowerDB relevant database, set up the cluster of distributed environment, by adopting MPP architecture technology, SharedNothing between each node, when bonding point breaks down, do not affect cluster work, cluster can be extending transversely, realizes PB DBMS and store, and supports the write of massive parallel data.

For meeting the mass data storage needs based on SQL, the invention provides a kind of distributed relation database based on MPP framework, this database is distributed structure/architecture, comprising four modules, is global transaction management module (Power-GTM), SiteServer LBS (Power-Proxy), data harmonization manager (Power-COORD), back end (Power-DataNode) respectively; Wherein:

Global transaction manager is responsible for global transaction process, and SiteServer LBS is responsible for the load balancing management of cluster, and cluster-coordinator manager is for coordinating the work between each back end, and back end is the relational database disposed based on PowerDB.

As a further improvement on the present invention, a cluster generally only has a global transaction management module.

As a further improvement on the present invention, single global transaction management module can configure (StandBy) for subsequent use node.

As a further improvement on the present invention, multiple coordination manager can be had in a cluster.

As a further improvement on the present invention, the annoying physical arrangement of back end and logical organization keep completely the same with PowerDB, namely under a station server, maintain single relational database system.

Optionally, for improving cluster reliability, GTM provides active/standby pattern, configured by GTM-Standby, can set up multiple GTM node in the cluster, the same time only has a node job, data are synchronized to GTM-Standby by stream reproduction technology from GTM-host, when Single Point of Faliure occurs GTM-host, GTM-Standby becomes GTM-host automatically, bears global transaction management work.

Optionally, for back end, its reliability is also realized by multinode Redundancy Design, namely each back end designs one or more Standby nodes, data in ablation process, by stream reproduction technology, be synchronized on secondary node, when there is Single Point of Faliure, system switches automatically, to ensure to work for cluster 7*24 hour.

Optionally, roundrobin algorithm is adopted to carry out building table handling, make distributed type assemblies can the most efficient response data write operation, consider cluster its own overhead, for the cluster of N number of back end, compare Single-Server application, the performance of about 0.7*N can be obtained, the needs of the high concurrent write service of satisfying magnanimity data.

Optionally, with java exploitation based on the Auto-mounting deployment tool under windows, Linux.

Optionally, distributed type assemblies management tool mainly comprises Telnet and management, query analysis manager, cluster monitoring instrument etc.

Accompanying drawing explanation

Fig. 1 is autonomous controlled distribution formula relational database architecture figure of the present invention;

Fig. 2 is simple distributed relational database cluster efficiently of the present invention.

Embodiment

Below in conjunction with Figure of description, the present invention is described in more detail.Should be understood to, embodiment described herein only for explaining the present invention, but does not limit the present invention.

PowerDB is relational database system, it is developed based on PostgreSQL database postgreSQL, PostgreSQL is the concurrent operation of a kind of support, preferably PostgreSQL database system compatible with ORACLE, by adopting MVCC (Multi version concurrency control) mechanism, improve data writing capability, by tables of data space is cut into block, thus support that unit is concurrent, PowerDB relevant database is in the constant situation of maintenance postgreSQL kernel, external tool is developed, meet operation system exploitation, DBA requirements of one's work, support SQL2008 standard.

(1) autonomous controlled distribution formula relational database architecture design.

System supports main flow Linux, windows environmental structure, and suggestion uses the hardware environment of X86, to save group construction cost.

Consider performance and problem of management, suggestion is installed under linux, as UBUNTU, RedHat, centOS etc.

GTM requires higher to server reliability, and suggestion adopts commercial server, and because GTM does not preserve any data, to hard disk no requirement (NR), request memory is not high yet, and general 16GB just can satisfy the demand.

Each back end generally disposes Power-proxy, Power-COORD, Power-DataNode tri-functions simultaneously, thus utilize server resource to greatest extent, higher internal memory and hard disk resources need be configured, be typically to strong E5CPU, 64GB internal memory, 6TB hard disk.

(2) autonomous controlled distribution formula relational database architecture planning.

For 1 GTM node and 5 back end, do to plan as follows:

For each assembly of data-base cluster, should distribute corresponding machine name and port numbers, be below planning table.

(3) distributed database management mechanism

By GTM, global transaction is managed, the table transversally cutting of database is become multiple data block, and be stored into corresponding back end (Power-DataNode) respectively, the operating mechanism of back end and the database of stand-alone environment are as good as, and it is responsible for the service such as insertion, inquiry, amendment of data.(1) data insertion process.GTM calculates data according to distributed algorithm should be put into for which back end, algorithm comprises hash algorithm (generating hash function according to field scope, determination data memory node), roundrobin algorithm (being distributed to each back end at random) etc.(2) singly data query is shown.Querying command is distributed to each back end by GTM, and Query Result is uploaded to GTM by each point, organizes data query collection, give user by GTM.(3) multilist correlation inquiry.Organized by Power-COORD coordination manager, general each back end deploy has coordination manager, coordination manager be in charge of from other querying node to associated data, and carry out associating with this node data and calculate, form single node Query Result.(4) index.Index adopts two-tier system, and namely each back end safeguards the index of oneself, and GTM also safeguards a simple index, sends a command to each back end respectively to index response process.

(4) distributed data library initialization

APD is deployed to the associative directory of each node, as/usr/local/PowerDB catalogue.

Installation arranges SSH, realizes cluster and exempts from key login.

Run the initialization that the instruments such as initgtm, initdb realize Power-GTM, Power-Proxy, Power-COORD, Power-DataNode.

Also above process can be realized with installation and deployment instrument.

(5) startup of cluster.

The first step, starts GTM service, global transaction manager is normally worked.

Second step, starts the Power-Proxy of each back end respectively.Realize associating of node and GTM.

3rd step, starts the Power-DataNode of each back end respectively.

4th step, starts the Power-COORD of each back end respectively.

5th step, each back end is set up the correspondence table of other node.

Above process realizes by cluster installation and deployment instrument.

(6) database measuring and application

Distributed data base is set up by SQL query manager.Process of establishing is accessed consistent with relational database.Building database user, and authorize different rights.

Set up node group (group), and for building table handling.

Creation database table.Typical commands is:

Createtablet1(idint,ageint)distributbyroundrobintogroupgp1；

The table of t1 is in above order establishment one, and it uses stochastic distribution algorithm, uses the node group of gp1.

Carry out readwrite tests to database table, unit can reach the writing speed of 100000/second substantially, distributed environment, and writing speed estimates the speed that can reach 0.7*N*100000 bar/second.

Set up data access environment by JDBC, ODBC, OLEDB, realize application system development.

(7) foundation of highly reliable distributed data base system.

Master/slave node pattern can be configured, realize the highly reliable scheme of cluster, when making cluster occur single node failure, loss of data, cluster shutdown can not be caused, thus promote cluster reliability.

GTM node master/slave is arranged to need to configure standy in configuration file be on state, and be configured for hot standby machine name or IP address, port numbers.

For the master/slave setting of back end, be also realized by configuration file, basic skills is consistent.

Above process can be realized by installation and deployment instrument.

(8) autonomous controlled distribution formula database maintenance.

When the discontented foot of distributed data storage capacity requires, can carrying out extending transversely to it, namely by increasing back end, realizing this function.

New node need configure Power-Proxy, Power-COORD, Power-DataNode tri-modules.

Configure standby secondary node on request.

Start the service function of three modules.

Added by new node in the group of cluster, node new so just comes into effect, and the data of new write will partly be saved on new node.

When Single Point of Faliure appears in system, malfunctioning node need be unloaded, after repairing under line, add cluster, after carrying out data syn-chronization, resume work.

Above process completes by installation and deployment instrument.

(9) data-base cluster is closed.

The first step, sends message to user, determines that cluster will be closed, and can wait for that user job completes, or postpones certain hour.

Second step, closes back end service.

3rd step, closes coordinator node service.

4th step, closes load balancing node serve.

5th step, closes global transaction node serve.

For general technical staff of the technical field of the invention, under the prerequisite not departing from design of the present invention and spirit, by some simple deduction or replace, all should be considered as belonging to protection scope of the present invention.

Claims

1. the distributed relation database based on MPP framework, it is characterized in that: this database comprises four modules, global transaction management module (Power-GTM), SiteServer LBS (Power-Proxy), data harmonization manager (Power-COORD), back end (Power-DataNode) respectively, wherein:

2. a kind of distributed relation database based on MPP framework according to claim 1, is characterized in that: a cluster generally only has a global transaction management module.

3. a kind of distributed relation database based on MPP framework according to claim 2, is characterized in that: single global transaction management module can configure secondary node.

4. a kind of distributed relation database based on MPP framework according to claim 1, is characterized in that: have multiple coordination manager in a cluster.

5. a kind of distributed relation database based on MPP framework according to claim 1, is characterized in that: maintain single relational database system under a station server.

6. a kind of distributed relation database based on MPP framework according to claim 1, is characterized in that: global transaction management module provides active/standby pattern.

7. a kind of distributed relation database based on MPP framework according to claim 6, is characterized in that: set up multiple global transaction management node in the cluster.

8. a kind of distributed relation database based on MPP framework according to claim 7, is characterized in that: the same time only has a node job.

9. a kind of distributed relation database based on MPP framework according to claim 1, it is characterized in that: each back end designs one or more secondary nodes, data are in ablation process, by stream reproduction technology, be synchronized on secondary node, when there is Single Point of Faliure, system switches automatically.

10. a kind of distributed relation database based on MPP framework according to claim 1, is characterized in that: adopt roundrobin algorithm to carry out building table handling, makes distributed type assemblies can the most efficient response data write operation.

11. a kind of distributed relation databases based on MPP framework according to claim 1, is characterized in that: with java exploitation based on the Auto-mounting deployment tool under windows, Linux.

12. a kind of distributed relation databases based on MPP framework according to claim 1, is characterized in that: distributed type assemblies management tool mainly comprises Telnet and management, query analysis manager, cluster monitoring instrument etc.