CN108256046A

CN108256046A - The implementation method of the unified access path of big data processing frame source data

Info

Publication number: CN108256046A
Application number: CN201810029082.0A
Authority: CN
Inventors: 卞信铨
Original assignee: Fujian Star Software Co Ltd
Current assignee: Fujian Star Software Co Ltd
Priority date: 2018-01-12
Filing date: 2018-01-12
Publication date: 2018-07-06

Abstract

The present invention provides a kind of implementation method of the unified access path of big data processing frame source data, is that a variety of big data processing frames have been docked in task execution module；And a unified access path is set in data acquisition module, the unified access path docking multiple data sources channel；The task execution module receives task by big data processing block bridge joint, then asks source data to the data acquisition module；When the data acquisition module receives request, by the unified access path according to institute's matched data source channels access originator data storage medium, obtain source data and perform task use for the task execution module.The present invention accesses source data to classify according to data format, provides unified source data access path and is shared for big data platform, improves efficiency.

Description

The implementation method of the unified access path of big data processing frame source data

Technical field

The present invention relates to the unified access paths of the access method of big data, more particularly to big data processing frame source data Implementation method.

Background technology

Big data processing is responsible for calculating the data in big data system (management and processing).Source data is included from holding The data read or the data being linked by modes such as message queues in system in storage long, and calculating is carried from data It wins the confidence the process of breath.In face of DB, SQL, NOSQL, MPP, Search, Streaming, Graph, MachineLearning, ETL Etc. different business and different scenes, the big data of present mainstream processing frame have Spark, Flink, Hive, Pig, GraphLab、Cassandra、MongoDB、Impala、Greenplum、HAWQ、Storm、ElasticSearch、Solr、 Hbase, MySQL etc., each big data processing frame can also be used as the data source of other processing frames to provide data, each Big data processing frame all supports a variety of data source storage modes and access mode behind.It supports to read by taking Spark as an example to deposit It is stored in the storage mediums such as HDFS, local file, S3, Hive, Hbase, Tarchyon, RDBMS, it here can be by storage medium It is classified as traditional relational data, NoSQL data, distributed storage data, memory distributed storage data, cloud platform number According to, other big data platform frame datas.Each storage mode can be divided into different data memory formats, general data again Storage format just has Json, SequenceFile, TextFile, Parquet, CSV, OrcFile, Avro etc., and also each is deposited The exclusive data memory format of storage media oneself, such as each RDBMS, NoSQL are different, count it is common not under Tens kinds, have one or two hundred kinds since more than type Spark support read, also support write data into storage, Because Spark can also be used as the data source of other frames.A big data processing frame is primary so in big data platform Task may need to access multiple data sources, to support the access of above all of data source, the data in big data platform It is exactly a great engineering that source, which accesses exploitation, and now more commonly used mode is that limitation is fixed using several inside an enterprise Data storage method and data memory format, this mode liberated some exploitation pressure, but the property of application program system Can and efficiency also have a greatly reduced quality, also a kind of mode be by by the storage mode of ETL process to setting and storage format come, Business calculating is being carried out, this mode is not only affected in timeliness, and more link complexities also improve, and go wrong Probability also improve.

Invention content

The technical problem to be solved in the present invention is to provide a kind of unified access path of big data processing frame source data Implementation method, source data is accessed and is classified according to data format, provides unified source data access path for big data Platform shares, and improves efficiency.

The invention is realized in this way：A kind of implementation method of the unified access path of big data processing frame source data, Including：

A variety of big data processing frames have been docked in task execution module；And a system is set in data acquisition module One access path, the unified access path docking multiple data sources channel；

The task execution module receives task by big data processing block bridge joint, is then asked to the data acquisition module Source data；

When the data acquisition module receives request, accessed by the unified access path according to institute's matched data source channels Source data storage medium obtains source data and performs task use for the task execution module.

Further, the big data handle frame to the data acquisition module ask source data when, only need to will be specific The object of the IP of RDBMS servers, username and password and access to be accessed is passed to the unified access path, by described Unified access path obtains source data according to institute's matched data source channels.

Further, be passed to the unified access path further includes access mode, if access mode is concurrent access, The unified access path provides two kinds of access modules：

(1) it provides to divide parallel field, and provide the maximum value, minimum value and degree of parallelism of this field, and from It is dynamic to divide concurrent access acquisition data；

(2) predicate for each carrying out obtaining source data parallel is provided, and divides concurrent access automatically and obtains data.

Further, docked in the task execution module big data processing frame include Spark, Flink, Hive, Pig、GraphLab、Cassandra、MongoDB、Impala、Greenplum、HAWQ、Storm、ElasticSearch、 Solr、Hbase、MySQL。

Further, the data source pathway species of the unified access path docking include JDBC channels, Json channels, TextFile channels, Parquet channels, SequenceFile channels, CSV channels, OrcFile channels, Avro channels and its His channel.

Further, it after the task execution module is by big data processing block bridge joint receipts task, is handled according to big data Frame information and mission bit stream obtain corresponding execution parameter, then the task context of frame is built by performing parameter, then In task context source data is asked to the data acquisition module.

Further, the data source of the source data is that big data storing framework either go back or be by other storing frameworks Big data handles frame.

The invention has the advantages that：Source data is accessed and is classified according to data format, unified access path is provided, Various data source channels are visited again by unified access path, for example, JDBC, Json, TextFile, Parquet, The data source channels such as SequenceFile, CSV, OrcFile, Avro.It is that big data platform shares that this, which unifies access path, nothing Each big data processing frame is needed to need to carry out independent realization, the side independently realized for each data source when data source The too fat to move operational efficiency that formula not only has a large amount of repeatability exploitation to also result in program is low.

Description of the drawings

The present invention is further illustrated in conjunction with the embodiments with reference to the accompanying drawings.

Fig. 1 is the method for the present invention execution flow chart.

Specific embodiment

Refering to Figure 1, the access of big data processing frame source data of the present invention performs overall flow and includes big data Processing block bridge joint receives task, and parameter needed for request task builds the context of task, and source data needed for request task obtains source Data perform task, and then output is as a result, further according to needing to carry out result set encapsulation.The relevant module of overall flow includes：Appoint Business execution module, parameter adaptation module, data acquisition module, result set package module.

Equally referring to Fig. 1, the implementation method of the unified access path of the big data processing frame source data of the present invention, packet It includes：

A variety of big data processing frames have been docked in task execution module；Including Spark, Flink, Hive, Pig, GraphLab、Cassandra、MongoDB、Impala、Greenplum、HAWQ、Storm、ElasticSearch、Solr、 Hbase、MySQL.Each big data processing frame, by message or parameter setting, to the task scheduling that sends over to pair It is performed above the frame answered.

One unified access path is set in data acquisition module, and the unified access path docking multiple data sources lead to Road；Including JDBC channels, Json channels, TextFile channels, Parquet channels, SequenceFile channels, CSV channels, OrcFile channels, Avro channels and other channels.Such as the docking of JDBC channels is exactly that all JDBC access that can provide connect The access of the data source of the RDBMS of mouth.And other channels here are a general channels, when according to mission bit stream matching less than When available channel, just by other channels, source data is obtained using the data source access interface that original frame provides.Appoint The carrier of business information is occurred in the form of JAR packets, and the task that the inside includes, which needs to perform under what big data environment, (divides Cloth, pseudo- distributed, unit), the data source that task needs, the specific service logic of task, the phases such as task action result output Close information.

The data source of the source data is big data storing framework or other storing frameworks, is obtained by data acquisition module Access evidence；Can also or frame be handled for big data, big data handles frame just into data source at this time, then needs recurrence tune Data source is obtained with big data platform, finally performs corresponding task, exports result set.

The task execution module receives task by big data processing block bridge joint, handles frame information according to big data and appoints The corresponding execution parameter of business acquisition of information, execution parameter here can specify the specific resource of process needs for performing task to match Confidence ceases, the degree of parallelism of tasks carrying, the heartbeat duration of the process with performing task, the letter of the port needed for tasks carrying process The relevant informations such as breath, the option of bottom JVM；The task context of frame is built by performing parameter again, then above and below task Data acquisition module described in Wen Zhongxiang asks source data；Context herein can be understood as a big number for operating in client According to platform program, for connecting big data platform, perform on the specific node of distribution task to big data platform, believed according to parameter Breath, specific to big data platform perform node application resource, monitor task executive condition, the HA of task, task result collection The functions such as summarize.

The technical characterstic of the unified access path is：Different big datas processing frame accesses different before such as RDBMS needs to call different API or handles the JDBC agreements of the packaged RDBMS of frame by big data to obtain number According to；And in the present invention, when the big data handles frame to data acquisition module request source data, it need to specifically will only visit The object of the IP of RDBMS servers, username and password and access asked is passed to the unified access path, by the unification Access path obtains source data according to institute's matched data source channels.The incoming unified access path further includes access mode, If access mode is concurrent access, the unified access path provides two kinds of access modules：

(1) it provides to divide parallel field, and provide the maximum value, minimum value and degree of parallelism of this field, and from It is dynamic to divide concurrent access acquisition data；Code sample is as follows：

"columnname":"id",

"lowerbound":"1",

"upperbound":"50000",

"numpartitions":"6"

(2) predicate for each carrying out obtaining source data parallel is provided, and divides concurrent access automatically and obtains data.Example generation Code is as follows：

“predicates”:Array[String]("id<=2 ", " id>=4and id<=5 "),

“predicates”:Array [String] (" id in (2,4,6,8) ", " id in (1,3,5,7) "),

Each Array elements are a Parallel districts.

To sum up, source data is accessed and classified according to data format by the present invention, is provided unified access path, is passed through unification Access path visits again various data source channels, for example, JDBC, Json, TextFile, Parquet, SequenceFile, CSV, The data source channels such as OrcFile, Avro.It is that big data platform shares that this, which unifies access path, without the processing of each big data Frame needs to carry out independent realization for each data source when data source, and the mode independently realized not only has a large amount of weight The too fat to move operational efficiency that renaturation exploitation also results in program is low.

Although specific embodiments of the present invention have been described above, those familiar with the art should manage Solution, our described specific embodiments are merely exemplary rather than for the restriction to the scope of the present invention, are familiar with this The equivalent modification and variation that the technical staff in field is made in the spirit according to the present invention, should all cover the present invention's In scope of the claimed protection.

Claims

1. a kind of implementation method of the unified access path of big data processing frame source data, it is characterised in that：Including：

A variety of big data processing frames have been docked in task execution module；And a unified visit is set in data acquisition module Ask channel, the unified access path docking multiple data sources channel；

The task execution module receives task by big data processing block bridge joint, then asks source number to the data acquisition module According to；

When the data acquisition module receives request, by the unified access path according to institute's matched data source channels access originator number According to storage medium, obtain source data and perform task use for the task execution module.

2. a kind of implementation method of the unified access path of big data processing frame source data according to claim 1, It is characterized in that：The big data handle frame to the data acquisition module ask source data when, only will need to specifically be accessed IP, username and password and the object of access of RDBMS servers are passed to the unified access path, by the unified access Channel obtains source data according to institute's matched data source channels.

3. a kind of implementation method of the unified access path of big data processing frame source data according to claim 2, It is characterized in that：The incoming unified access path further includes access mode, if access mode is concurrent access, the unification Access path provides two kinds of access modules：

(1) it provides to divide parallel field, and provide the maximum value, minimum value and degree of parallelism of this field, and draw automatically Concurrent access is divided to obtain data；

4. a kind of implementation method of the unified access path of big data processing frame source data according to claim 1, It is characterized in that：Docked in the task execution module big data processing frame include Spark, Flink, Hive, Pig, GraphLab、Cassandra、MongoDB、Impala、Greenplum、HAWQ、Storm、ElasticSearch、Solr、 Hbase、MySQL。

5. a kind of implementation method of the unified access path of big data processing frame source data according to claim 1, It is characterized in that：The data source pathway species of the unified access path docking include JDBC channels, Json channels, TextFile and lead to Road, Parquet channels, SequenceFile channels, CSV channels, OrcFile channels, Avro channels and other channels.

6. a kind of implementation method of the unified access path of big data processing frame source data according to claim 1, It is characterized in that：After the task execution module is by big data processing block bridge joint receipts task, frame information is handled according to big data Corresponding execution parameter is obtained with mission bit stream, then the task context of frame is built by performing parameter, then in task Hereinafter source data is asked to the data acquisition module.

7. a kind of implementation method of the unified access path of big data processing frame source data according to claim 1 or 4, It is characterized in that：The data source of the source data is that big data storing framework either go back or be big data by other storing frameworks Handle frame.