CN104572895A

CN104572895A - MPP (Massively Parallel Processor) database and Hadoop cluster data intercommunication method, tool and realization method

Info

Publication number: CN104572895A
Application number: CN201410820059.5A
Authority: CN
Inventors: 陈雨; 夏旭东; 崔维力; 武新
Original assignee: TIANJIN NANKAI UNIVERSITY GENERAL DATA TECHNOLOGIES Co Ltd
Current assignee: TIANJIN NANKAI UNIVERSITY GENERAL DATA TECHNOLOGIES Co Ltd
Priority date: 2014-12-24
Filing date: 2014-12-24
Publication date: 2015-04-29
Anticipated expiration: 2034-12-24
Also published as: CN104572895B

Abstract

The invention provides an MPP (Massively Parallel Processor) database and Hadoop cluster data intercommunication method, a tool and a realization method, comprising a method for intercommunicating data between an MPP database and a Hadoop cluster by utilizing a data intercommunication tool and a method for intercommunicating the data through TXT transmit; the data is directly exported (imported) into the Hadoop cluster from the MPP database, and does not need to be transferred through a storage unit except the MPP database and the Hadoop cluster; and thereby the export process is more efficient. If the data is needed to be processed secondly through the Hadoop cluster, the TXT format transmit way is selected; according to the invention, the problem that the data between the MPP databast and the Hadoop business cannot be intercommunicated can be solved; the mashup of two business platforms of the MPP database and the Hadoop cluster is realized.

Description

MPP database and Hadoop company-data interoperability methods, instrument and implementation method

Technical field

The present invention relates to and belong to distributed data base field, particularly relate to a kind of MPP database and Hadoop company-data interoperability methods, instrument and its implementation.

Background technology

Before internet occurs, data produce mainly through man-machine conversation's mode, based on structural data.For this affairs type data, the additions and deletions of final user to data change to look into more to be paid close attention to, and corresponding data processing is referred to as OLTP (Online Transaction Processing, Transaction Processing).Traditional Relational DataBase (RDBMS) mainly towards this Demand Design and exploitation, and occupies critical role between in the past 30 years.During this period, increaseing slowly of data, more isolated between system, traditional database can meet types of applications demand substantially.

Along with appearance and the fast development of internet, the especially in recent years develop rapidly of mobile Internet, Data Source there occurs qualitative change.Data are mostly is automatically produced by equipment, server, various application, and these data are based on destructuring, semi-structured, and growth rate is geometry level.For this categorical data (being called large data), the additions and deletions of the less execution of final user to data change operation, what more pay close attention to is obtain data with prestissimo by database, and data are arranged, alternate analysis and the degree of depth excavate, and produces report and the prediction etc. to data.Corresponding data processing is referred to as OLAP (Online AnalyticalProcessing, on-line analytical processing).

Traditional database for this kind of demand of large data analysis in technology with functionally all almost feel simply helpless, along with Data Source and the change to data processing needs, it is found that single platform meets all application demands real not, and start to select optimal product and technology according to application demand, data characteristics and magnitude.The situation that the technology path of data processing field also rules all the land from traditional database (OldSQL) has moved towards segmentation development, becomes the situation that present stage applied by OldSQL, NewSQL and NoSQL polymorphic type common support multiclass.

NewSQL types of database mainly refers to MPP (Massively Parallel Processing, massively parallel processing) the advanced database cluster of framework, the large data of emphasis Industry-oriented, adopt SharedNothing framework, by multinomial large data processing techniques such as row storage, coarseness indexes, again in conjunction with the efficient distributed computing model of MPP framework, complete the support to analysis classes application, running environment mostly is low cost PC Server, there is the feature of high-performance and high scalability, obtain in enterprise diagnosis class application and apply widely.

NoSQL type mainly refers to technological expansion based on Hadoop and encapsulation, derives relevant large data technique around Hadoop, for tackling the storage of the more unmanageable half/unstructured data of traditional Relational DataBase and calculating etc.The most typical application scenarios is exactly by expanding and encapsulating Hadoop to realize the support storing the large data of internet arena, analyze at present.For half/unstructured data process, complicated ETL (Exract-Transform-Load extracts-conversion-loading) flow process, complicated data mining and computation model, Hadoop is more good at.

In sum, cannot the problem of intercommunication for data between MPP database and Hadoop business, the invention provides a kind of method supporting MPP database and Hadoop data interchange, the wherein mode of both direct intercommunications, data transmission efficiency is very high, is that MPP database and the mixed of Hadoop two kinds of business platforms take one of prerequisite.

Summary of the invention

The problem to be solved in the present invention is for cannot the problem of intercommunication between MPP database and Hadoop business, proposes a kind of MPP database and Hadoop company-data interoperability methods and data interchange instrument.For solving the problems of the technologies described above, the technical solution used in the present invention is: a kind of MPP database and Hadoop company-data interoperability methods, comprise

(1) utilize data interchange instrument that data are directly exported to Hadoop cluster by MPP database, or data export to Hadoop cluster by MPP database by TXT transfer;

(2) utilize data interchange instrument that data are directly directed into MPP database by Hadoop cluster, or data are directed into MPP database by Hadoop cluster by TXT transfer.

Further, the step that data directly export to Hadoop cluster by MPP database is by the described data interchange instrument that utilizes:

(1) data interchange instrument start-up;

(2) status checking, data interchange instrument is to MPP data base set pocket transmission sql command, carry out status checking, after MPP data-base cluster receives status checking sql command, connect Hadoop cluster and check Hadoop cluster assigned catalogue can write state, MPP data-base cluster checks self each node state and each data fragmentation state;

(3) derive metadata, data tool sends to MPP database and derives metadata sql command, and metadata is exported to Hadoop file system assigned catalogue after receiving and deriving metadata sql command by MPP data-base cluster;

(4) obtain and treat derived table in database;

(5) by table derived data, data interchange instrument adopts derives sql command to the pocket transmission of MPP data base set concurrently by table mode, MPP data-base cluster performs statistical conversion operation, directly by the back end assigned catalogue of statistical conversion to Hadoop cluster after receiving table derivation sql command;

(6) derive successfully, normally exit;

(7) derive unsuccessfully, implementation interrupts exiting.

Further, the step that described data export to Hadoop cluster by MPP database by TXT transfer is:

(1) data interchange instrument start-up;

(2) status checking, data interchange instrument, to MPP data base set pocket transmission sql command, carries out status checking.After MPP data-base cluster receives status checking sql command, check self each node state and each data fragmentation state;

(3) derive metadata, data interchange instrument sends to MPP database and derives metadata sql command, and after MPP data-base cluster receives and derives metadata sql command, metadata is exported to the assigned catalogue of exterior storage, export form is TXT.；

(4) obtain and treat derived table in database;

(5) by table derived data, data interchange instrument adopts derives sql command to the pocket transmission of MPP data base set concurrently by table mode, MPP data-base cluster receives table and derives after sql command, performs statistical conversion operation, directly by statistical conversion to exterior storage assigned catalogue;

(6) Hadoop imports data, and the physical machine at exterior storage place is installed Hadoop client, performs-put the order of Hadoop, imports in the assigned catalogue of Hadoop by the data file of TXT form;

(7) Hadoop imports data success, normally exits;

(8) implementation interrupts exiting.

Further, the step that data are directly directed into MPP database by Hadoop cluster is by the described data interchange instrument that utilizes:

(1) data interchange instrument start-up;

(2) status checking, data interchange instrument, to MPP data base set pocket transmission sql command, carries out status checking, after MPP data-base cluster receives status checking sql command, connect Hadoop cluster and check Hadoop cluster assigned catalogue can read state; MPP data-base cluster checks self each node state simultaneously;

(3) import metadata, import tool sends to MPP database and imports metadata sql command, and metadata is imported by Hadoop file system assigned catalogue after receiving and importing metadata sql command by MPP data-base cluster;

(4) table to be imported in database is obtained;

(5) data are imported by table, data interchange instrument adopts and imports sql command to the pocket transmission of MPP data base set concurrently by table mode, after MPP data-base cluster receives table importing sql command, perform data import operations, directly the back end of access Hadoop cluster by data importing to MPP database;

(6) import successfully, normally exit;

(7) import unsuccessfully, implementation interrupts exiting.

Further, the step that described data are directed into MPP database by Hadoop cluster by TXT transfer is:

(1) Hadoop derived data, the physical machine at exterior storage place is installed Hadoop client, performs-get the order of Hadoop, derives the data file of TXT form in exterior storage assigned catalogue by the assigned catalogue of Hadoop;

(2) data interchange instrument start-up;

(3) status checking, data interchange instrument, to MPP data base set pocket transmission sql command, carries out status checking.After MPP data-base cluster receives status checking sql command, check self each node state;

(4) import metadata, data interchange instrument sends to MPP database and imports metadata sql command, after MPP data-base cluster receives and imports metadata sql command, imports metadata by the assigned catalogue of exterior storage;

(5) all tables in database are obtained;

(6) importing data are performed by table, data interchange instrument adopts and imports sql command to the pocket transmission of MPP data base set concurrently by table mode, MPP data-base cluster performs data import operation after receiving table importing sql command, imports data by the assigned catalogue of exterior storage;

(7) import successfully, normally exit;

(8) implementation interrupts exiting.

Further, support during statistical conversion in MPP database that screening is derived, the mode that screening is derived is the SQL statement of input tape where condition.

A kind of MPP database and Hadoop company-data intercommunication instrument, comprise main control module, Command Line Parsing module, connector, derivation importing scheduler, worker thread, log pattern, SQL structure module, worker thread pond; Described main control module and described Command Line Parsing module, worker thread pond, derive scheduler and be connected; Described log pattern and described Command Line Parsing module, SQL build module, worker thread, worker thread pond, connector are connected; Described derivation imports scheduler and described connector, worker thread, worker thread pond, SQL build model calling; Described connector is connected with described worker thread; Described worker thread and described SQL build model calling.

A kind of MPP database and Hadoop company-data intercommunication instrument implementation method, comprise the steps:

(1) user is by start up with command-line options instrument and input configuration information thereupon, main control module starts with instrument start-up, first main control module creates instrument running log overall situation example by log pattern after starting, and then completes other modules and initial work;

(2) main control module receives the character string forms configuration information of user's input, and imports this information into Command Line Parsing exemplary module, resolves further user's input configuration;

(3) user inputs character string form configuration information is resolved to the inner identifiable design configuration information of program by Parameter analysis of electrochemical module, and is returned to main control module;

(6) main control module starts derives importing scheduler, imports scheduler to derive (importing) work by derivation;

(7) derive importing scheduler and create main connector example, and connect MPP database by main connector;

(6) derive importing scheduler and build module construction status checking SQL by SQL, checked by main connector executing state;

(7) derive importing scheduler and build module construction derivation (importing) metadata SQL by SQL, perform derivation (importing) metadata by main connector;

(8) derive import scheduler by SQL build module construction inquiry institute need derive (importings) table SQL, by main connector perform inquiry institute need derive (importings) table, obtain institute need derivation (importing) show;

(9) derive importing scheduler and create job scheduling daily record overall situation example by log pattern;

(10) derive importing scheduler and obtain worker thread by thread pool module, quantity equals to derive the configuration of (importing) degree of parallelism, create the working connectors of respective amount, and each worker thread distributes a working connectors, then start All Jobs, (importing) operation is derived in each worker thread parallel processing; Wherein, the derivation (importing) of single table is called operation, job content comprises: the first step, MPP database is connected by working connectors, second step, build module construction by SQL and derive (importing) SQL, the 3rd step performs derivation (importing) by working connectors;

(11) derivation (importing) the Job execution situation importing scheduler and gather each worker thread is derived, arrange the implementation status after gathering for derivation (importing) execution result returns to main control module, main control module finally will be derived (importing) result and be returned to user.

The advantage that the present invention has and good effect are: achieve the data interchange between MPP database and Hadoop cluster, and derivation/lead-in mode can be selected according to actual needs flexibly: when not needing Hadoop secondary treating, efficient direct mode can be selected; Otherwise TXT form transfer mode can be selected.

Accompanying drawing explanation

Fig. 1 is the schematic diagram of a kind of MPP database and Hadoop data interchange method;

Fig. 2 is that data directly export to the schematic diagram of Hadoop cluster by MPP database;

Fig. 3 is data are directly exported to the concrete execution step of Hadoop cluster schematic diagram by MPP database;

Fig. 4 is data are exported to Hadoop cluster by TXT transfer schematic diagram by MPP database;

The schematic diagram of Fig. 5 to be data by MPP database specifically performed by the method that TXT transfer exports to Hadoop cluster step;

Fig. 6 is that data are directly directed into the schematic diagram of MPP database by Hadoop cluster;

Fig. 7 is data are directly directed into the concrete execution step of MPP database schematic diagram by Hadoop cluster;

Fig. 8 is that data are directly directed into the schematic diagram of MPP database by Hadoop cluster;

The schematic diagram of Fig. 9 to be data by Hadoop cluster specifically performed by the method that TXT transfer is directed into MPP database step;

Figure 10 is the schematic diagram of data interchange instrument.

Embodiment

And the present invention is described in detail in conjunction with example below with reference to the accompanying drawings.It should be noted that, when not conflicting, the embodiment in the application and the feature in embodiment can be combined with each other.

The invention provides a kind of MPP database and Hadoop company-data intercommunication instrument and data interchange method, comprise and utilize data interchange instrument directly carry out the method for the intercommunication of data between MPP database and Hadoop cluster and carried out the method for data interchange by TXT transfer.

1, as shown in Figure 2, data directly export to Hadoop cluster by MPP database, the computing node of MPP database accesses the back end of Hadoop cluster by data interchange instrument, data are directly exported to Hadoop cluster, without the need to passing through the storage unit transfer outside MPP database and Hadoop cluster, thus make derivation process more efficient.Its concrete implementation is as shown in Figure 3:

Step 301, data interchange instrument start-up;

Step 302, status checking.Data interchange instrument, to MPP data base set pocket transmission sql command, carries out status checking.After MPP data-base cluster receives status checking sql command, connect Hadoop cluster and check Hadoop cluster assigned catalogue can write state, MPP data-base cluster checks self each node state and each data fragmentation state; If state inspection results is not for pass through, then perform step 307, otherwise perform step 303;

Step 303, derives metadata.Data interchange instrument sends to MPP database and derives metadata sql command, and metadata is exported to Hadoop file system assigned catalogue after receiving and deriving metadata sql command by MPP data-base cluster.If perform unsuccessfully, perform step 307, otherwise perform step 304;

Step 304, obtains and treats derived table in database.Data interchange instrument is to the table query SQL order of MPP database transmit band where condition, where condition is not with to represent whole derivation, after MPP data-base cluster receives the table query SQL order of band where condition, perform the table order that inquiry meets where condition, if failure, perform step 307, otherwise the table name satisfied condition in return data intercommunication tool database; Perform step 305;

Step 305, by table derived data.Data interchange instrument adopts derives sql command to the pocket transmission of MPP data base set concurrently by table mode, and MPP data-base cluster performs statistical conversion operation, directly by the back end assigned catalogue of statistical conversion to Hadoop cluster after receiving table derivation sql command; If single table is derived unsuccessfully, then next opens table to skip this table continuation process, if N continuous (user's appointment) is table to derive unsuccessfully, then performs step 307, otherwise continuation performs to all table derivation complete, then performs step 306;

Step 306, derives successfully, normally exits;

Step 307, derives unsuccessfully, and implementation interrupts exiting.

2, as shown in Figure 4, data export to Hadoop cluster by MPP database by TXT transfer, data are exported to storage unit outside MPP database and Hadoop cluster by MPP database by data interchange instrument, by-put the mode of Hadoop client by external memory unit by the data importing of TXT textual form to Hadoop cluster, thus make Hadoop can carry out its concrete implementation of secondary treating as shown in Figure 5 to the data of TXT textual form before importing:

Step 501, data interchange instrument start-up;

Step 502, status checking.Data interchange instrument, to MPP data base set pocket transmission sql command, carries out status checking.After MPP data-base cluster receives status checking sql command, check self each node state and each data fragmentation state; If state inspection results is not for pass through, then perform step 507, otherwise perform step 503;

Step 503, derives metadata.Data interchange instrument sends to MPP database and derives metadata sql command, and after MPP data-base cluster receives and derives metadata sql command, metadata is exported to the assigned catalogue of exterior storage, export form is TXT.If perform unsuccessfully, perform step 507, otherwise perform step 504;

Step 504, obtains and treats derived table (according to specified requirements) in database.Derivation instrument is to the table query SQL order of MPP database transmit band where condition (not representing whole derivation with where condition), after MPP data-base cluster receives the table query SQL order of band where condition, perform the table order that inquiry meets where condition, if failure, perform step 507, otherwise return the table name of deriving and satisfying condition in tool database; Perform step 505;

Step 505, by table derived data.Derivation instrument adopts derives sql command to the pocket transmission of MPP data base set concurrently by table mode, and MPP data-base cluster receives table and derives after sql command, performs statistical conversion operation, directly by statistical conversion to exterior storage assigned catalogue; If single table is derived unsuccessfully, then next opens table to skip this table continuation process, if N continuous (user's appointment) is table to derive unsuccessfully, then performs step 507, otherwise continuation performs to all table derivation complete, then performs step 506;

Step 506, Hadoop imports data.The physical machine at exterior storage place is installed Hadoop client, perform-put the order of Hadoop, the data file of TXT form is imported in the assigned catalogue of Hadoop, if Hadoop imports data success, by MPP database derived data to Hadoop normal termination, perform step 507; Otherwise perform step 508;

Step 507, Hadoop imports data success, normally exits;

Step 508, implementation interrupts exiting.

3, as shown in Figure 6, data are directly directed into MPP database by Hadoop cluster, data are without the need to passing through the storage unit transfer outside MPP database and Hadoop cluster, and the computing node of MPP database directly accesses the back end of Hadoop cluster, thus make importing process more efficient.Its concrete implementation is as shown in Figure 7:

Step 701, data interchange instrument start-up;

Step 702, status checking.Import tool, to MPP data base set pocket transmission sql command, carries out status checking.After MPP data-base cluster receives status checking sql command, connect Hadoop cluster and check Hadoop cluster assigned catalogue can read state; MPP data-base cluster checks self each node state simultaneously; If state inspection results is not for pass through, then perform step 706, otherwise perform step 703;

Step 703, imports metadata.Import tool sends to MPP database and imports metadata sql command, and metadata is imported by Hadoop file system assigned catalogue after receiving and importing metadata sql command by MPP data-base cluster.If perform unsuccessfully, perform step 707, otherwise perform step 704;

Step 704, obtains all tables in database.Import tool sends the order of table query SQL to MPP database, after MPP data-base cluster receives the order of table query SQL, performs all table orders of inquiry, if failure, performs step 707, otherwise return all table names in import tool database; Perform step 705;

Step 705, performs by table and imports data.Import tool adopts and imports sql command to the pocket transmission of MPP data base set concurrently by table mode, and MPP data-base cluster receives table and imports after sql command, performs data import operation, directly the back end of access Hadoop cluster by data importing to MPP database; If single table imports unsuccessfully, then next opens table to skip this table continuation process, if N continuous (user's appointment) is table to import unsuccessfully, then performs step 706, otherwise continuation performs to all table importings complete, then performs step 706;

Step 706, imports successfully, normally exits;

Step 707, imports unsuccessfully, and implementation interrupts exiting.

4, as shown in Figure 8, data are directed into MPP database by Hadoop cluster by TXT transfer, data are exported to storage unit beyond MPP database and Hadoop cluster with TXT text mode by Hadoop cluster, then by MPP database by the data importing of TXT text mode to MPP database.Its concrete implementation is as shown in Figure 9:

Step 901, Hadoop derived data.The physical machine at exterior storage place is installed Hadoop client, performs-get the order of Hadoop, derive the data file of TXT form in exterior storage assigned catalogue by the assigned catalogue of Hadoop.If the failure of Hadoop derived data, performs step 908, otherwise performs step 902;

Step 902, data interchange instrument start-up, performs step 903;

Step 903, status checking.Data interchange instrument, to MPP data base set pocket transmission sql command, carries out status checking.After MPP data-base cluster receives status checking sql command, check self each node state; If state inspection results is not for pass through, then perform step 908, otherwise perform step 904;

Step 904, imports metadata.Data interchange instrument sends to MPP database and imports metadata sql command, after MPP data-base cluster receives and imports metadata sql command, imports metadata by the assigned catalogue of exterior storage.If perform unsuccessfully, perform step 908, otherwise perform step 905;

Step 905, obtains all tables in database.Data interchange instrument sends the order of table query SQL to MPP database, after MPP data-base cluster receives the order of table query SQL, performs all table orders of inquiry, if failure, performs step 908, otherwise all table names in return data intercommunication tool database; Perform step 906;

Step 906, performs by table and imports data.Data interchange instrument adopts and imports sql command to the pocket transmission of MPP data base set concurrently by table mode, and MPP data-base cluster performs data import operation after receiving table importing sql command, imports data by the assigned catalogue of exterior storage; If single table imports unsuccessfully, then next opens table to skip this table continuation process, if N continuous (user's appointment) is table to import unsuccessfully, then performs step 908, otherwise continuation performs to all table importings complete, then performs step 907;

Step 907, imports successfully, normally exits;

Step 908, implementation interrupts exiting.

Above embodiments of the invention have been described in detail, but described content being only preferred embodiment of the present invention, can not being considered to for limiting practical range of the present invention.All equalizations done according to the scope of the invention change and improve, and all should still belong within this patent covering scope.

Claims

1. MPP database and a Hadoop company-data interoperability methods, is characterized in that, comprise

2. a kind of MPP database according to claim 1 and Hadoop data interchange method, is characterized in that, the step that data directly export to Hadoop cluster by MPP database is by the described data interchange instrument that utilizes:

(1) data interchange instrument start-up;

(4) obtain and treat derived table in database;

(6) derive successfully, normally exit;

(7) derive unsuccessfully, implementation interrupts exiting.

3. a kind of MPP database according to claim 1 and Hadoop data interchange method, is characterized in that, the step that described data export to Hadoop cluster by MPP database by TXT transfer is:

(1) data interchange instrument start-up;

(4) obtain and treat derived table in database;

(7) Hadoop imports data success, normally exits;

(8) implementation interrupts exiting.

4. a kind of MPP database according to claim 1 and Hadoop data interchange method, is characterized in that, the step that data are directly directed into MPP database by Hadoop cluster is by the described data interchange instrument that utilizes:

(1) data interchange instrument start-up;

(4) table to be imported in database is obtained;

(6) import successfully, normally exit;

(7) import unsuccessfully, implementation interrupts exiting.

5. a kind of MPP database according to claim 1 and Hadoop data interchange method, is characterized in that, the step that described data are directed into MPP database by Hadoop cluster by TXT transfer is:

(2) data interchange instrument start-up;

(5) all tables in database are obtained;

(7) import successfully, normally exit;

(8) implementation interrupts exiting.

6. a kind of MPP database according to claim 1 and Hadoop data interchange method, is characterized in that: support during statistical conversion in MPP database that screening is derived, the mode that screening is derived is the SQL statement of input tape where condition.

7. MPP database and a Hadoop company-data intercommunication instrument, comprises main control module, Command Line Parsing module, connector, derivation importing scheduler, worker thread, log pattern, SQL structure module, worker thread pond; Described main control module and described Command Line Parsing module, worker thread pond, derive scheduler and be connected; Described log pattern and described Command Line Parsing module, SQL build module, worker thread, worker thread pond, connector are connected; Described derivation imports scheduler and described connector, worker thread, worker thread pond, SQL build model calling; Described connector is connected with described worker thread; Described worker thread and described SQL build model calling.

8. MPP database and a Hadoop company-data intercommunication instrument implementation method, is characterized in that: comprise the steps:

(4) main control module starts derives importing scheduler, imports scheduler to derive (importing) work by derivation;

(5) derive importing scheduler and create main connector example, and connect MPP database by main connector;