CN105243155A

CN105243155A - Big data extracting and exchanging system

Info

Publication number: CN105243155A
Application number: CN201510711186.6A
Authority: CN
Inventors: 姬源; 黄育松; 谢冬; 王向东
Original assignee: Electric Power Dispatch Control Center of Guizhou Power Grid Co Ltd
Current assignee: Electric Power Dispatch Control Center of Guizhou Power Grid Co Ltd
Priority date: 2015-10-29
Filing date: 2015-10-29
Publication date: 2016-01-13

Abstract

The invention discloses a big data extracting and exchanging system. The invention relates to a big data extracting and exchanging method and system. Efficient data exchange is realized by steps of combining a control and exchange centre deployed in Spark with a plurality of exchange agents, supporting data bidirectional flow between a relational database, a non-structured document and a sensor database, and an Hive system, an HBase system, and an HDFS system of a Hadoop platform correspondingly, adopting parallel task scheduling, and adopting memory for storing all intermediate data.

Description

A kind of large data pick-up and exchange system

Technical field

The present invention relates to the method and system of a kind of large data pick-up and exchange, the control being deployed in Spark platform by one with switching centre in conjunction with some clearing agents, support relational database, non-structured document, sensor database and Hadoop platform Hive, the two-way circulation of HBase, HDFS data among systems, by adopting Parallel Task Scheduling and adopting internal memory to store all intermediate data, realize efficient exchanges data.

Background technology

Along with the continuous increase of business data amount, computing machine needs data to be processed to reach TB rank from MB rank, even PB rank, individual server cannot carry out storing to all data of enterprise and analyze, and needs data pick-up to be aggregated into large data platform and carries out analyzing and processing.Enterprise's Legacy System comprises various types of data usually, comprises the business datum being stored in relational database system, is stored as various document information and the journal file of document form, also comprises the Real-time Monitoring Data etc. from large quantity sensor.How the successful first step of large data items to the collection of carrying out that these data are all efficient, real-time.

Hadoop platform is the most frequently used large data platform software at present, and Hadoop achieves the running environment of MapReduce program, supports the distributed execution of task.HDFS is a distributed file system, and this file system data can store and multiple copy, therefore has very high fault-tolerance.But HDFS does not allow to modify to file content, can only add file content.Hive is a data warehouse, and data store with HDFS with non-structured text form, and upper strata provides the query interface of similar SQL, and provides translation engine that query statement is automatically translated into MapReduce program to perform.Because data are stored in HDFS, in Hive, data also can only read and can not revise.HBase is a kind of column stored data base, and data acquisition major key accesses, and does not support SQL query, but has very high handling capacity, and HBase supported data is revised.

There is the large data acquisition system (DAS) of some single types at present, the Sqoop system of the such as Hadoop ecosystem, the data pick-up walked abreast is carried out in support from relational database, support the various Sybase such as Oracle, SQLServer, MySql at present, and supported that task is extracted in the execution walked abreast by MapReduce.Such as distributed message acquisition system kafka, be a kind of distributed post subscribe message system of high-throughput, it can process the everything flow data in the website of consumer's scale.This action (web page browsing, the action of search and other users) is a key factor of the many social functions on modern network.These data are normally solved by process daily record and log aggregation due to the requirement of handling capacity.Also this distributed crawler system of such as Nutch, can walk abreast from internet and capture data and be stored in Hadoop file system.

The instrument widespread use mutually transformed between relational database and enterprise, comprise the instrument that Oracle, SQLServer also both provide other databases of data exporting.Informatica and IBM also has Related product, supports the conversion of the structuring such as relational database, XML semi-structured data.But also there is no special system to support that in large data platform, system and traditional relational etc. exchange easily at present.Because large data system One's name is legion, also in continuous increase, only NoSQL database just has tens kinds, how to provide good system architecture that these databases are linked into exchange system, is the problem with challenge.

At present these large data acquisition system (DAS)s independently exist mutually, and the load mechanism of Hadoop is single, and the data such as extracted from relational database can only be loaded into Hive, and can not be loaded in HBase and realize some inquiry services fast.Be loaded into after in Hadoop in addition, also there is not a kind of method supported data and flow at Hadoop different sub-systems.Data in such as Hive need to carry out mass data cleaning, and the amendment of Hive not supported data itself, at this moment just need to process in data batchmove to HBase.

Summary of the invention

Because the above-mentioned defect of prior art, technical matters to be solved by this invention is to provide the system of a kind of large data pick-up and exchange, support relational database, non-structured document, sensor database and Hadoop platform Hive, the two-way circulation of HBase, HDFS data among systems, by adopting Parallel Task Scheduling and adopting internal memory to store all intermediate data, realize efficient exchanges data.

For achieving the above object, the invention provides a kind of large data pick-up and exchange system, comprise the control switching point (CSP) being deployed in Spark platform, by Yarn resource management framework, Spark platform and Hadoop platform are deployed in same cluster; Control switching point (CSP) memory object stores with Spark, and all intermediate data and different types of data model conversion task are also performed by Spark;

Comprise the relational database system, non-structured document, the sensing data that are all dispersed in different servers;

Comprise an independently large data platform of clustered deploy(ment) Hadoop, the large data platform of described Hadoop comprises HDFS, HBase, Hive subsystem, for loading the data of extraction, and provides analytic function;

Comprise and be deployed on different pieces of information origin system or the clearing agent of control switching point (CSP); For coming to carry out alternately with data source by remote interface;

Comprise the control message passage between clearing agent and interactive controlling center and data channel;

Described control switching point (CSP) comprises task scheduling modules, memory object administration module, data conversion module;

Described task scheduling modules is used for the extraction of dispatching exchange proxy data, Data import task, data model translation task, data transfer task;

Described memory object administration module is for managing storage and the renewal of intermediate data;

Described data conversion module is used for the conversion between different pieces of information model and unified memory object;

The clearing agent that described control switching point (CSP) is used for notification data source carries out data pick-up, and transfers data to control switching point (CSP); Described control switching point (CSP) is for carrying out the conversion of source data model to memory object model;

Described control switching point (CSP) is also for the United Dispatching of task:

A) for the data translation tasks of control switching point (CSP), the programming language exploitation adopting Spark to provide;

B) according to demand or according to resource utilization guiding scheduler task perform order;

When memory headroom is not enough, cannot store newly to data, control switching point (CSP) notifies clearing agent according to scheduling strategy, suspends data pick-up task, when waiting memory headroom to satisfy the demand, continues to perform data pick-up task.

Preferably, when system malfunctions, control switching point (CSP) is log before carrying out each operation, restarts system after fault, recovers the state before losing efficacy, then again extracts the data of all loss, reconstruct memory headroom.

Preferably, control to adopt unified memory object model to store the intermediate data of exchanges data with switching centre, the data of each data source realize the Mapping and Converting of data model and memory object model by clearing agent; Unified memory object model adopts SparkRDD form to store data; Data are transfer in internal memory, does not write disk.

Preferably, the described control switching point (CSP) task of being waited for by queue management.

The invention has the beneficial effects as follows: the present invention supports that in Legacy System different types of data and large data platform, dissimilar system carries out exchanges data, also supports to carry out exchanges data between different system in large data platform, can meet different disposal demand.All switching tasks are unified carries out scheduling controlling, can improve the efficiency of exchanges data.

Accompanying drawing explanation

Fig. 1 is the structural representation of the embodiment of the invention.

Fig. 2 is that relational database is to Hive systems exchange instance graph.

Fig. 3 is that sensor database is to HBase systems exchange instance graph.

Fig. 4 is that file system arrives HDFS exchange instance graph.

Embodiment

Below in conjunction with drawings and Examples, the invention will be further described: a kind of large data pick-up and exchange system, comprise the control switching point (CSP) being deployed in Spark platform, by Yarn resource management framework, Spark platform and Hadoop platform are deployed in same cluster; Control switching point (CSP) memory object stores with Spark, and all intermediate data and different types of data model conversion task are also performed by Spark;

Comprise the relational database system, non-structured document, the sensing data that are all dispersed in different servers; Extraction and the loading of data is realized by clearing agent.

Described memory object administration module is for managing storage and the renewal of intermediate data; If data have completed the conversion to target data model, just can delete original data object, reclaim memory headroom as early as possible.

A) for the data translation tasks of control switching point (CSP), the programming language exploitation adopting Spark to provide, the execution of task utilizes distributed memory;

In the present embodiment, when system malfunctions, data in EMS memory may all can be lost, control switching point (CSP) is log before carrying out each operation, restarts system after fault, recovers the state before losing efficacy, then the data of all loss are again extracted, reconstruct memory headroom.

In the present embodiment, control to adopt unified memory object model to store the intermediate data of exchanges data with switching centre, the data of each data source realize the Mapping and Converting of data model and memory object model by clearing agent; Unified memory object model adopts SparkRDD form to store data; Data are transfer in internal memory, does not write disk.

Control switching point (CSP) will take a large amount of internal memory and carry out intermediate data storage, need to carry out scheduling to task and avoid memory source not enough, in the present embodiment, and the task that described control switching point (CSP) is waited for by queue management.

Fig. 1 is an overall system composition diagram.The left side of figure represents the data resource of some existing Legacy Systems, and the right of figure represents large data platform, adopts Hadoop to build at present.The center section of figure is control switching point (CSP), disposes and Spark platform, carrys out the scheduled for executing of responsible large data pick-up and interactive task.Clearing agent is one and independently serves, and will be deployed in data source or Data import end place machine, and carry out data interaction and communication interaction with data source and Data import end, together with clearing agent also can be deployed in control switching point (CSP).

Interactive mode between two agencies that recording user is selected by control switching point (CSP), comprise the mapping ruler of metadata and data, the time that exchanges data performs and frequency, exchanges data performs full dose data or incremental data etc.Control center is responsible for sending message to agency, and order agency performs corresponding operating.Control center is responsible for carrying out task priority dispatching according to user's definition rule.

All exchanges data are carried out control agent to realize by control center.Concrete exchanges data task, needs to develop in advance, can provide exploitation by system, also can by application side's customized development.Exchanges data task refers to the conversion between the data model of data source and memory object model.The scala that all exchanges data tasks all adopt Spark platform to support or Java language exploitation, rearmost part is deployed on Spark platform to perform, thus realizes paralleling abstracting and conversion.The object done like this to utilize Spark platform distributed memory technology, carrys out the intermediate data of managing mass.

Agency has different types, for the agency that the exploitation of different data translation types is corresponding.Such as will from relational database to HBase, we develop and corresponding act on behalf of S, are responsible for from relational database extracted data, act on behalf of D and are responsible for HBase and load data.Agency is adopted to add the framework of control center, thus the extensibility of back-up system.For a kind of new data source, by the clearing agent of exploitation correspondence, the exchange of existing system in data and system can be realized.

The deployed position of agency also can be positioned at control switching point (CSP), relational database etc. is provided to the data source of service interface, and agency is positioned at data also can be carried out in control switching point (CSP) extraction by remote interface.

Next will be described by three concrete scenes.

First scene sees Fig. 2, in certain national grid subsidiary company, will analyze a large number of users electrographic recording in storage now and relational database, but traditional database cannot provide High Performance Data Query to analyze demand, therefore needs Data import to analyze to Hadoop platform Hive system.

First need the interactive agent of a selection correspondence database, such as Oracle interactive agent, realize the extraction of data, and be transferred to control switching point (CSP).Relation data is converted to memory object model by control switching point (CSP).Here corresponding memory object model, each form is exactly a class.This memory object model storage, in SparkRDD, is namely stored by distributed memory.Then memory object model conversion in RDD is the data model of Hive by control switching point (CSP), and transfers to Hive clearing agent data to be write in Hive system.

Before Hive analyzes, find to there is a large amount of dirty data, need to carry out data scrubbing, but Hive does not support to modify to data, therefore partial data is moved to HBase to clear up, and then is written back in Hive.Here the Mutual data transmission realizing Hive and HBase is needed.Respectively all four-headed arrow between agency, control switching point (CSP), large data system as we can see from the figure.Need to select Hive agency and HBase agency.Because Hive and HBase data model is also different, therefore user needs definition rule, and which shows and will be transformed in HBase which row to select Hive, and which row is as the key of HBase, and which row is by the row bunch of which kind of form as HBase.Concrete mapping method the present invention does not list in detail, can be accomplished in several ways.

For the data revised to Hive write-back, first the raw data list deletion of correspondence, more amended data can be write, also the data of amendment can be write new table, and raw data coexist in Hive system.

Second scenario sees Fig. 3, and in concrete implementation environment, a large amount of power equipment and residing environment arrange some sensors, carry out the information such as Real-time Collection equipment operational factor, temperature, humidity, and are stored in the key-value pair data storehouse of sensor server.Because cumulative data amount is large especially, need Data Migration now in HBase.Key-value pair data storehouse clearing agent can be selected to carry out data pick-up, select HBase clearing agent to realize Data import.Key-value pair type and HBase data model can be easy to map, and HBase is just reduced to a key-value pair data storehouse to be existed.

3rd scene sees Fig. 4, some servers constantly produce heap file, and the source of these files may from web crawlers, also may from server log, present needs extract some key messages in real time from these files, being stored in HDFS of real time high-speed.In order to improve data throughout, we can design special agency.This agency comprises a hook program, can the file message of capturing operation system, when server will carry out file write operation, synchronously resolve information order, obtain file content, be put in internal memory, then be sent to control switching point (CSP) by clearing agent, then transfer to HDFS clearing agent to carry out file write, all like this intermediate data are all present in internal memory, greatly can reduce disk I/O, improve data transmission efficiency.

More than describe preferred embodiment of the present invention in detail.Should be appreciated that those of ordinary skill in the art just design according to the present invention can make many modifications and variations without the need to creative work.Therefore, all technician in the art, all should by the determined protection domain of claims under this invention's idea on the basis of existing technology by the available technical scheme of logical analysis, reasoning, or a limited experiment.

Claims

1. large data pick-up and an exchange system, is characterized in that:

Comprise the control switching point (CSP) being deployed in Spark platform, by Yarn resource management framework, Spark platform and Hadoop platform are deployed in same cluster; Control switching point (CSP) memory object stores with Spark, and all intermediate data and different types of data model conversion task are also performed by Spark;

2. a kind of large data pick-up as claimed in claim 1 and exchange system, it is characterized in that: when system malfunctions, control switching point (CSP) is log before carrying out each operation, system is restarted after fault, recover the state before losing efficacy, then the data of all loss are again extracted, reconstruct memory headroom.

3. a kind of large data pick-up as claimed in claim 1 and exchange system, it is characterized in that: control to adopt unified memory object model to store the intermediate data of exchanges data with switching centre, the data of each data source realize the Mapping and Converting of data model and memory object model by clearing agent; Unified memory object model adopts SparkRDD form to store data; Data are transfer in internal memory, does not write disk.

4. the large data pick-up of one as described in claim 1 or 2 or 3 and exchange system, is characterized in that: the task that described control switching point (CSP) is waited for by queue management.