CN108256046A - The implementation method of the unified access path of big data processing frame source data - Google Patents

The implementation method of the unified access path of big data processing frame source data Download PDF

Info

Publication number
CN108256046A
CN108256046A CN201810029082.0A CN201810029082A CN108256046A CN 108256046 A CN108256046 A CN 108256046A CN 201810029082 A CN201810029082 A CN 201810029082A CN 108256046 A CN108256046 A CN 108256046A
Authority
CN
China
Prior art keywords
data
access path
source
big data
channels
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN201810029082.0A
Other languages
Chinese (zh)
Inventor
卞信铨
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Fujian Star Software Co Ltd
Original Assignee
Fujian Star Software Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Fujian Star Software Co Ltd filed Critical Fujian Star Software Co Ltd
Priority to CN201810029082.0A priority Critical patent/CN108256046A/en
Publication of CN108256046A publication Critical patent/CN108256046A/en
Pending legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/10File systems; File servers
    • G06F16/18File system types
    • G06F16/182Distributed file systems
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/10File systems; File servers
    • G06F16/13File access structures, e.g. distributed indices
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L67/00Network arrangements or protocols for supporting network services or applications
    • H04L67/34Network arrangements or protocols for supporting network services or applications involving the movement of software or configuration parameters 
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L67/00Network arrangements or protocols for supporting network services or applications
    • H04L67/50Network services
    • H04L67/60Scheduling or organising the servicing of application requests, e.g. requests for application data transmissions using the analysis and optimisation of the required network resources

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Computer Networks & Wireless Communication (AREA)
  • Signal Processing (AREA)
  • Data Mining & Analysis (AREA)
  • Databases & Information Systems (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The present invention provides a kind of implementation method of the unified access path of big data processing frame source data, is that a variety of big data processing frames have been docked in task execution module;And a unified access path is set in data acquisition module, the unified access path docking multiple data sources channel;The task execution module receives task by big data processing block bridge joint, then asks source data to the data acquisition module;When the data acquisition module receives request, by the unified access path according to institute's matched data source channels access originator data storage medium, obtain source data and perform task use for the task execution module.The present invention accesses source data to classify according to data format, provides unified source data access path and is shared for big data platform, improves efficiency.

Description

The implementation method of the unified access path of big data processing frame source data
Technical field
The present invention relates to the unified access paths of the access method of big data, more particularly to big data processing frame source data Implementation method.
Background technology
Big data processing is responsible for calculating the data in big data system (management and processing).Source data is included from holding The data read or the data being linked by modes such as message queues in system in storage long, and calculating is carried from data It wins the confidence the process of breath.In face of DB, SQL, NOSQL, MPP, Search, Streaming, Graph, MachineLearning, ETL Etc. different business and different scenes, the big data of present mainstream processing frame have Spark, Flink, Hive, Pig, GraphLab、Cassandra、MongoDB、Impala、Greenplum、HAWQ、Storm、ElasticSearch、Solr、 Hbase, MySQL etc., each big data processing frame can also be used as the data source of other processing frames to provide data, each Big data processing frame all supports a variety of data source storage modes and access mode behind.It supports to read by taking Spark as an example to deposit It is stored in the storage mediums such as HDFS, local file, S3, Hive, Hbase, Tarchyon, RDBMS, it here can be by storage medium It is classified as traditional relational data, NoSQL data, distributed storage data, memory distributed storage data, cloud platform number According to, other big data platform frame datas.Each storage mode can be divided into different data memory formats, general data again Storage format just has Json, SequenceFile, TextFile, Parquet, CSV, OrcFile, Avro etc., and also each is deposited The exclusive data memory format of storage media oneself, such as each RDBMS, NoSQL are different, count it is common not under Tens kinds, have one or two hundred kinds since more than type Spark support read, also support write data into storage, Because Spark can also be used as the data source of other frames.A big data processing frame is primary so in big data platform Task may need to access multiple data sources, to support the access of above all of data source, the data in big data platform It is exactly a great engineering that source, which accesses exploitation, and now more commonly used mode is that limitation is fixed using several inside an enterprise Data storage method and data memory format, this mode liberated some exploitation pressure, but the property of application program system Can and efficiency also have a greatly reduced quality, also a kind of mode be by by the storage mode of ETL process to setting and storage format come, Business calculating is being carried out, this mode is not only affected in timeliness, and more link complexities also improve, and go wrong Probability also improve.
Invention content
The technical problem to be solved in the present invention is to provide a kind of unified access path of big data processing frame source data Implementation method, source data is accessed and is classified according to data format, provides unified source data access path for big data Platform shares, and improves efficiency.
The invention is realized in this way:A kind of implementation method of the unified access path of big data processing frame source data, Including:
A variety of big data processing frames have been docked in task execution module;And a system is set in data acquisition module One access path, the unified access path docking multiple data sources channel;
The task execution module receives task by big data processing block bridge joint, is then asked to the data acquisition module Source data;
When the data acquisition module receives request, accessed by the unified access path according to institute's matched data source channels Source data storage medium obtains source data and performs task use for the task execution module.
Further, the big data handle frame to the data acquisition module ask source data when, only need to will be specific The object of the IP of RDBMS servers, username and password and access to be accessed is passed to the unified access path, by described Unified access path obtains source data according to institute's matched data source channels.
Further, be passed to the unified access path further includes access mode, if access mode is concurrent access, The unified access path provides two kinds of access modules:
(1) it provides to divide parallel field, and provide the maximum value, minimum value and degree of parallelism of this field, and from It is dynamic to divide concurrent access acquisition data;
(2) predicate for each carrying out obtaining source data parallel is provided, and divides concurrent access automatically and obtains data.
Further, docked in the task execution module big data processing frame include Spark, Flink, Hive, Pig、GraphLab、Cassandra、MongoDB、Impala、Greenplum、HAWQ、Storm、ElasticSearch、 Solr、Hbase、MySQL。
Further, the data source pathway species of the unified access path docking include JDBC channels, Json channels, TextFile channels, Parquet channels, SequenceFile channels, CSV channels, OrcFile channels, Avro channels and its His channel.
Further, it after the task execution module is by big data processing block bridge joint receipts task, is handled according to big data Frame information and mission bit stream obtain corresponding execution parameter, then the task context of frame is built by performing parameter, then In task context source data is asked to the data acquisition module.
Further, the data source of the source data is that big data storing framework either go back or be by other storing frameworks Big data handles frame.
The invention has the advantages that:Source data is accessed and is classified according to data format, unified access path is provided, Various data source channels are visited again by unified access path, for example, JDBC, Json, TextFile, Parquet, The data source channels such as SequenceFile, CSV, OrcFile, Avro.It is that big data platform shares that this, which unifies access path, nothing Each big data processing frame is needed to need to carry out independent realization, the side independently realized for each data source when data source The too fat to move operational efficiency that formula not only has a large amount of repeatability exploitation to also result in program is low.
Description of the drawings
The present invention is further illustrated in conjunction with the embodiments with reference to the accompanying drawings.
Fig. 1 is the method for the present invention execution flow chart.
Specific embodiment
Refering to Figure 1, the access of big data processing frame source data of the present invention performs overall flow and includes big data Processing block bridge joint receives task, and parameter needed for request task builds the context of task, and source data needed for request task obtains source Data perform task, and then output is as a result, further according to needing to carry out result set encapsulation.The relevant module of overall flow includes:Appoint Business execution module, parameter adaptation module, data acquisition module, result set package module.
Equally referring to Fig. 1, the implementation method of the unified access path of the big data processing frame source data of the present invention, packet It includes:
A variety of big data processing frames have been docked in task execution module;Including Spark, Flink, Hive, Pig, GraphLab、Cassandra、MongoDB、Impala、Greenplum、HAWQ、Storm、ElasticSearch、Solr、 Hbase、MySQL.Each big data processing frame, by message or parameter setting, to the task scheduling that sends over to pair It is performed above the frame answered.
One unified access path is set in data acquisition module, and the unified access path docking multiple data sources lead to Road;Including JDBC channels, Json channels, TextFile channels, Parquet channels, SequenceFile channels, CSV channels, OrcFile channels, Avro channels and other channels.Such as the docking of JDBC channels is exactly that all JDBC access that can provide connect The access of the data source of the RDBMS of mouth.And other channels here are a general channels, when according to mission bit stream matching less than When available channel, just by other channels, source data is obtained using the data source access interface that original frame provides.Appoint The carrier of business information is occurred in the form of JAR packets, and the task that the inside includes, which needs to perform under what big data environment, (divides Cloth, pseudo- distributed, unit), the data source that task needs, the specific service logic of task, the phases such as task action result output Close information.
The data source of the source data is big data storing framework or other storing frameworks, is obtained by data acquisition module Access evidence;Can also or frame be handled for big data, big data handles frame just into data source at this time, then needs recurrence tune Data source is obtained with big data platform, finally performs corresponding task, exports result set.
The task execution module receives task by big data processing block bridge joint, handles frame information according to big data and appoints The corresponding execution parameter of business acquisition of information, execution parameter here can specify the specific resource of process needs for performing task to match Confidence ceases, the degree of parallelism of tasks carrying, the heartbeat duration of the process with performing task, the letter of the port needed for tasks carrying process The relevant informations such as breath, the option of bottom JVM;The task context of frame is built by performing parameter again, then above and below task Data acquisition module described in Wen Zhongxiang asks source data;Context herein can be understood as a big number for operating in client According to platform program, for connecting big data platform, perform on the specific node of distribution task to big data platform, believed according to parameter Breath, specific to big data platform perform node application resource, monitor task executive condition, the HA of task, task result collection The functions such as summarize.
When the data acquisition module receives request, accessed by the unified access path according to institute's matched data source channels Source data storage medium obtains source data and performs task use for the task execution module.
The technical characterstic of the unified access path is:Different big datas processing frame accesses different before such as RDBMS needs to call different API or handles the JDBC agreements of the packaged RDBMS of frame by big data to obtain number According to;And in the present invention, when the big data handles frame to data acquisition module request source data, it need to specifically will only visit The object of the IP of RDBMS servers, username and password and access asked is passed to the unified access path, by the unification Access path obtains source data according to institute's matched data source channels.The incoming unified access path further includes access mode, If access mode is concurrent access, the unified access path provides two kinds of access modules:
(1) it provides to divide parallel field, and provide the maximum value, minimum value and degree of parallelism of this field, and from It is dynamic to divide concurrent access acquisition data;Code sample is as follows:
"columnname":"id",
"lowerbound":"1",
"upperbound":"50000",
"numpartitions":"6"
(2) predicate for each carrying out obtaining source data parallel is provided, and divides concurrent access automatically and obtains data.Example generation Code is as follows:
“predicates”:Array[String]("id<=2 ", " id>=4and id<=5 "),
“predicates”:Array [String] (" id in (2,4,6,8) ", " id in (1,3,5,7) "),
Each Array elements are a Parallel districts.
To sum up, source data is accessed and classified according to data format by the present invention, is provided unified access path, is passed through unification Access path visits again various data source channels, for example, JDBC, Json, TextFile, Parquet, SequenceFile, CSV, The data source channels such as OrcFile, Avro.It is that big data platform shares that this, which unifies access path, without the processing of each big data Frame needs to carry out independent realization for each data source when data source, and the mode independently realized not only has a large amount of weight The too fat to move operational efficiency that renaturation exploitation also results in program is low.
Although specific embodiments of the present invention have been described above, those familiar with the art should manage Solution, our described specific embodiments are merely exemplary rather than for the restriction to the scope of the present invention, are familiar with this The equivalent modification and variation that the technical staff in field is made in the spirit according to the present invention, should all cover the present invention's In scope of the claimed protection.

Claims (7)

1. a kind of implementation method of the unified access path of big data processing frame source data, it is characterised in that:Including:
A variety of big data processing frames have been docked in task execution module;And a unified visit is set in data acquisition module Ask channel, the unified access path docking multiple data sources channel;
The task execution module receives task by big data processing block bridge joint, then asks source number to the data acquisition module According to;
When the data acquisition module receives request, by the unified access path according to institute's matched data source channels access originator number According to storage medium, obtain source data and perform task use for the task execution module.
2. a kind of implementation method of the unified access path of big data processing frame source data according to claim 1, It is characterized in that:The big data handle frame to the data acquisition module ask source data when, only will need to specifically be accessed IP, username and password and the object of access of RDBMS servers are passed to the unified access path, by the unified access Channel obtains source data according to institute's matched data source channels.
3. a kind of implementation method of the unified access path of big data processing frame source data according to claim 2, It is characterized in that:The incoming unified access path further includes access mode, if access mode is concurrent access, the unification Access path provides two kinds of access modules:
(1) it provides to divide parallel field, and provide the maximum value, minimum value and degree of parallelism of this field, and draw automatically Concurrent access is divided to obtain data;
(2) predicate for each carrying out obtaining source data parallel is provided, and divides concurrent access automatically and obtains data.
4. a kind of implementation method of the unified access path of big data processing frame source data according to claim 1, It is characterized in that:Docked in the task execution module big data processing frame include Spark, Flink, Hive, Pig, GraphLab、Cassandra、MongoDB、Impala、Greenplum、HAWQ、Storm、ElasticSearch、Solr、 Hbase、MySQL。
5. a kind of implementation method of the unified access path of big data processing frame source data according to claim 1, It is characterized in that:The data source pathway species of the unified access path docking include JDBC channels, Json channels, TextFile and lead to Road, Parquet channels, SequenceFile channels, CSV channels, OrcFile channels, Avro channels and other channels.
6. a kind of implementation method of the unified access path of big data processing frame source data according to claim 1, It is characterized in that:After the task execution module is by big data processing block bridge joint receipts task, frame information is handled according to big data Corresponding execution parameter is obtained with mission bit stream, then the task context of frame is built by performing parameter, then in task Hereinafter source data is asked to the data acquisition module.
7. a kind of implementation method of the unified access path of big data processing frame source data according to claim 1 or 4, It is characterized in that:The data source of the source data is that big data storing framework either go back or be big data by other storing frameworks Handle frame.
CN201810029082.0A 2018-01-12 2018-01-12 The implementation method of the unified access path of big data processing frame source data Pending CN108256046A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201810029082.0A CN108256046A (en) 2018-01-12 2018-01-12 The implementation method of the unified access path of big data processing frame source data

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201810029082.0A CN108256046A (en) 2018-01-12 2018-01-12 The implementation method of the unified access path of big data processing frame source data

Publications (1)

Publication Number Publication Date
CN108256046A true CN108256046A (en) 2018-07-06

Family

ID=62727148

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201810029082.0A Pending CN108256046A (en) 2018-01-12 2018-01-12 The implementation method of the unified access path of big data processing frame source data

Country Status (1)

Country Link
CN (1) CN108256046A (en)

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109376154A (en) * 2018-10-26 2019-02-22 杭州玳数科技有限公司 Reading data, wiring method and reading data, writing system
CN109684399A (en) * 2018-12-24 2019-04-26 成都四方伟业软件股份有限公司 Data bank access method, database access device and Data Analysis Platform
CN111125013A (en) * 2019-12-26 2020-05-08 北京锐安科技有限公司 Data warehousing method, device, equipment and medium

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101944040A (en) * 2010-09-15 2011-01-12 复旦大学 Predicate-based automatic parallel optimizing method
WO2015157048A1 (en) * 2014-04-09 2015-10-15 Microsoft Technology Licensing, Llc Device policy manager
CN105045607A (en) * 2015-09-02 2015-11-11 广东创我科技发展有限公司 Method for achieving uniform interface of multiple big data calculation frames
CN106547766A (en) * 2015-09-18 2017-03-29 华为技术有限公司 A kind of data access method and device

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101944040A (en) * 2010-09-15 2011-01-12 复旦大学 Predicate-based automatic parallel optimizing method
WO2015157048A1 (en) * 2014-04-09 2015-10-15 Microsoft Technology Licensing, Llc Device policy manager
CN105045607A (en) * 2015-09-02 2015-11-11 广东创我科技发展有限公司 Method for achieving uniform interface of multiple big data calculation frames
CN106547766A (en) * 2015-09-18 2017-03-29 华为技术有限公司 A kind of data access method and device

Cited By (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109376154A (en) * 2018-10-26 2019-02-22 杭州玳数科技有限公司 Reading data, wiring method and reading data, writing system
CN109684399A (en) * 2018-12-24 2019-04-26 成都四方伟业软件股份有限公司 Data bank access method, database access device and Data Analysis Platform
CN111125013A (en) * 2019-12-26 2020-05-08 北京锐安科技有限公司 Data warehousing method, device, equipment and medium
CN111125013B (en) * 2019-12-26 2023-03-17 北京锐安科技有限公司 Data warehousing method, device, equipment and medium

Similar Documents

Publication Publication Date Title
CN102982075B (en) Support to access the system and method for heterogeneous data source
US11886429B2 (en) Persistent metadata catalog
US9152669B2 (en) System and method for distributed SQL join processing in shared-nothing relational database clusters using stationary tables
US20180329644A1 (en) Data Pipeline Architecture for Analytics Processing Stack
CN110249596A (en) The learning skill of the classification and priority ranking based on QOS for SAAS application
CN104111983B (en) A kind of open multi-source data acquiring system and method
CN109906595A (en) System and method for performing cryptographic operations across different types of processing hardware
US11176128B2 (en) Multiple access path selection by machine learning
US10225375B2 (en) Networked device management data collection
CN108256046A (en) The implementation method of the unified access path of big data processing frame source data
US9576026B2 (en) System and method for distributed SQL join processing in shared-nothing relational database clusters using self directed data streams
CN109154896A (en) System and method for service chaining load balance
JP2017518561A (en) Processing data from multiple sources
US20200379970A1 (en) Systems and methods for providing custom objects for a multi-tenant platform with microservices architecture
Zhang et al. A MapReduce based approach of scalable multidimensional anonymization for big data privacy preservation on cloud
US10146828B2 (en) System and method of storing and analyzing information
CN108353040A (en) system and method for distributed packet scheduling
Verma et al. Big Data representation for grade analysis through Hadoop framework
US11216454B1 (en) User defined functions for database query languages based on call-back functions
CN109902126B (en) Loading system supporting HIVE automatic partition and implementation method thereof
US11061964B2 (en) Techniques for processing relational data with a user-defined function (UDF)
US9363153B2 (en) Monitoring similar data in stream computing
US9141251B2 (en) Techniques for guided access to an external distributed file system from a database management system
US20180129712A1 (en) Data provenance and data pedigree tracking
US9471639B2 (en) Managing a grouping window on an operator graph

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination