CN108256046A - The implementation method of the unified access path of big data processing frame source data - Google Patents
The implementation method of the unified access path of big data processing frame source data Download PDFInfo
- Publication number
- CN108256046A CN108256046A CN201810029082.0A CN201810029082A CN108256046A CN 108256046 A CN108256046 A CN 108256046A CN 201810029082 A CN201810029082 A CN 201810029082A CN 108256046 A CN108256046 A CN 108256046A
- Authority
- CN
- China
- Prior art keywords
- data
- access path
- source
- big data
- channels
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/10—File systems; File servers
- G06F16/18—File system types
- G06F16/182—Distributed file systems
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/10—File systems; File servers
- G06F16/13—File access structures, e.g. distributed indices
-
- H—ELECTRICITY
- H04—ELECTRIC COMMUNICATION TECHNIQUE
- H04L—TRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
- H04L67/00—Network arrangements or protocols for supporting network services or applications
- H04L67/34—Network arrangements or protocols for supporting network services or applications involving the movement of software or configuration parameters
-
- H—ELECTRICITY
- H04—ELECTRIC COMMUNICATION TECHNIQUE
- H04L—TRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
- H04L67/00—Network arrangements or protocols for supporting network services or applications
- H04L67/50—Network services
- H04L67/60—Scheduling or organising the servicing of application requests, e.g. requests for application data transmissions using the analysis and optimisation of the required network resources
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Computer Networks & Wireless Communication (AREA)
- Signal Processing (AREA)
- Data Mining & Analysis (AREA)
- Databases & Information Systems (AREA)
- Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
Abstract
The present invention provides a kind of implementation method of the unified access path of big data processing frame source data, is that a variety of big data processing frames have been docked in task execution module;And a unified access path is set in data acquisition module, the unified access path docking multiple data sources channel;The task execution module receives task by big data processing block bridge joint, then asks source data to the data acquisition module;When the data acquisition module receives request, by the unified access path according to institute's matched data source channels access originator data storage medium, obtain source data and perform task use for the task execution module.The present invention accesses source data to classify according to data format, provides unified source data access path and is shared for big data platform, improves efficiency.
Description
Technical field
The present invention relates to the unified access paths of the access method of big data, more particularly to big data processing frame source data
Implementation method.
Background technology
Big data processing is responsible for calculating the data in big data system (management and processing).Source data is included from holding
The data read or the data being linked by modes such as message queues in system in storage long, and calculating is carried from data
It wins the confidence the process of breath.In face of DB, SQL, NOSQL, MPP, Search, Streaming, Graph, MachineLearning, ETL
Etc. different business and different scenes, the big data of present mainstream processing frame have Spark, Flink, Hive, Pig,
GraphLab、Cassandra、MongoDB、Impala、Greenplum、HAWQ、Storm、ElasticSearch、Solr、
Hbase, MySQL etc., each big data processing frame can also be used as the data source of other processing frames to provide data, each
Big data processing frame all supports a variety of data source storage modes and access mode behind.It supports to read by taking Spark as an example to deposit
It is stored in the storage mediums such as HDFS, local file, S3, Hive, Hbase, Tarchyon, RDBMS, it here can be by storage medium
It is classified as traditional relational data, NoSQL data, distributed storage data, memory distributed storage data, cloud platform number
According to, other big data platform frame datas.Each storage mode can be divided into different data memory formats, general data again
Storage format just has Json, SequenceFile, TextFile, Parquet, CSV, OrcFile, Avro etc., and also each is deposited
The exclusive data memory format of storage media oneself, such as each RDBMS, NoSQL are different, count it is common not under
Tens kinds, have one or two hundred kinds since more than type Spark support read, also support write data into storage,
Because Spark can also be used as the data source of other frames.A big data processing frame is primary so in big data platform
Task may need to access multiple data sources, to support the access of above all of data source, the data in big data platform
It is exactly a great engineering that source, which accesses exploitation, and now more commonly used mode is that limitation is fixed using several inside an enterprise
Data storage method and data memory format, this mode liberated some exploitation pressure, but the property of application program system
Can and efficiency also have a greatly reduced quality, also a kind of mode be by by the storage mode of ETL process to setting and storage format come,
Business calculating is being carried out, this mode is not only affected in timeliness, and more link complexities also improve, and go wrong
Probability also improve.
Invention content
The technical problem to be solved in the present invention is to provide a kind of unified access path of big data processing frame source data
Implementation method, source data is accessed and is classified according to data format, provides unified source data access path for big data
Platform shares, and improves efficiency.
The invention is realized in this way:A kind of implementation method of the unified access path of big data processing frame source data,
Including:
A variety of big data processing frames have been docked in task execution module;And a system is set in data acquisition module
One access path, the unified access path docking multiple data sources channel;
The task execution module receives task by big data processing block bridge joint, is then asked to the data acquisition module
Source data;
When the data acquisition module receives request, accessed by the unified access path according to institute's matched data source channels
Source data storage medium obtains source data and performs task use for the task execution module.
Further, the big data handle frame to the data acquisition module ask source data when, only need to will be specific
The object of the IP of RDBMS servers, username and password and access to be accessed is passed to the unified access path, by described
Unified access path obtains source data according to institute's matched data source channels.
Further, be passed to the unified access path further includes access mode, if access mode is concurrent access,
The unified access path provides two kinds of access modules:
(1) it provides to divide parallel field, and provide the maximum value, minimum value and degree of parallelism of this field, and from
It is dynamic to divide concurrent access acquisition data;
(2) predicate for each carrying out obtaining source data parallel is provided, and divides concurrent access automatically and obtains data.
Further, docked in the task execution module big data processing frame include Spark, Flink, Hive,
Pig、GraphLab、Cassandra、MongoDB、Impala、Greenplum、HAWQ、Storm、ElasticSearch、
Solr、Hbase、MySQL。
Further, the data source pathway species of the unified access path docking include JDBC channels, Json channels,
TextFile channels, Parquet channels, SequenceFile channels, CSV channels, OrcFile channels, Avro channels and its
His channel.
Further, it after the task execution module is by big data processing block bridge joint receipts task, is handled according to big data
Frame information and mission bit stream obtain corresponding execution parameter, then the task context of frame is built by performing parameter, then
In task context source data is asked to the data acquisition module.
Further, the data source of the source data is that big data storing framework either go back or be by other storing frameworks
Big data handles frame.
The invention has the advantages that:Source data is accessed and is classified according to data format, unified access path is provided,
Various data source channels are visited again by unified access path, for example, JDBC, Json, TextFile, Parquet,
The data source channels such as SequenceFile, CSV, OrcFile, Avro.It is that big data platform shares that this, which unifies access path, nothing
Each big data processing frame is needed to need to carry out independent realization, the side independently realized for each data source when data source
The too fat to move operational efficiency that formula not only has a large amount of repeatability exploitation to also result in program is low.
Description of the drawings
The present invention is further illustrated in conjunction with the embodiments with reference to the accompanying drawings.
Fig. 1 is the method for the present invention execution flow chart.
Specific embodiment
Refering to Figure 1, the access of big data processing frame source data of the present invention performs overall flow and includes big data
Processing block bridge joint receives task, and parameter needed for request task builds the context of task, and source data needed for request task obtains source
Data perform task, and then output is as a result, further according to needing to carry out result set encapsulation.The relevant module of overall flow includes:Appoint
Business execution module, parameter adaptation module, data acquisition module, result set package module.
Equally referring to Fig. 1, the implementation method of the unified access path of the big data processing frame source data of the present invention, packet
It includes:
A variety of big data processing frames have been docked in task execution module;Including Spark, Flink, Hive, Pig,
GraphLab、Cassandra、MongoDB、Impala、Greenplum、HAWQ、Storm、ElasticSearch、Solr、
Hbase、MySQL.Each big data processing frame, by message or parameter setting, to the task scheduling that sends over to pair
It is performed above the frame answered.
One unified access path is set in data acquisition module, and the unified access path docking multiple data sources lead to
Road;Including JDBC channels, Json channels, TextFile channels, Parquet channels, SequenceFile channels, CSV channels,
OrcFile channels, Avro channels and other channels.Such as the docking of JDBC channels is exactly that all JDBC access that can provide connect
The access of the data source of the RDBMS of mouth.And other channels here are a general channels, when according to mission bit stream matching less than
When available channel, just by other channels, source data is obtained using the data source access interface that original frame provides.Appoint
The carrier of business information is occurred in the form of JAR packets, and the task that the inside includes, which needs to perform under what big data environment, (divides
Cloth, pseudo- distributed, unit), the data source that task needs, the specific service logic of task, the phases such as task action result output
Close information.
The data source of the source data is big data storing framework or other storing frameworks, is obtained by data acquisition module
Access evidence;Can also or frame be handled for big data, big data handles frame just into data source at this time, then needs recurrence tune
Data source is obtained with big data platform, finally performs corresponding task, exports result set.
The task execution module receives task by big data processing block bridge joint, handles frame information according to big data and appoints
The corresponding execution parameter of business acquisition of information, execution parameter here can specify the specific resource of process needs for performing task to match
Confidence ceases, the degree of parallelism of tasks carrying, the heartbeat duration of the process with performing task, the letter of the port needed for tasks carrying process
The relevant informations such as breath, the option of bottom JVM;The task context of frame is built by performing parameter again, then above and below task
Data acquisition module described in Wen Zhongxiang asks source data;Context herein can be understood as a big number for operating in client
According to platform program, for connecting big data platform, perform on the specific node of distribution task to big data platform, believed according to parameter
Breath, specific to big data platform perform node application resource, monitor task executive condition, the HA of task, task result collection
The functions such as summarize.
When the data acquisition module receives request, accessed by the unified access path according to institute's matched data source channels
Source data storage medium obtains source data and performs task use for the task execution module.
The technical characterstic of the unified access path is:Different big datas processing frame accesses different before such as
RDBMS needs to call different API or handles the JDBC agreements of the packaged RDBMS of frame by big data to obtain number
According to;And in the present invention, when the big data handles frame to data acquisition module request source data, it need to specifically will only visit
The object of the IP of RDBMS servers, username and password and access asked is passed to the unified access path, by the unification
Access path obtains source data according to institute's matched data source channels.The incoming unified access path further includes access mode,
If access mode is concurrent access, the unified access path provides two kinds of access modules:
(1) it provides to divide parallel field, and provide the maximum value, minimum value and degree of parallelism of this field, and from
It is dynamic to divide concurrent access acquisition data;Code sample is as follows:
"columnname":"id",
"lowerbound":"1",
"upperbound":"50000",
"numpartitions":"6"
(2) predicate for each carrying out obtaining source data parallel is provided, and divides concurrent access automatically and obtains data.Example generation
Code is as follows:
“predicates”:Array[String]("id<=2 ", " id>=4and id<=5 "),
“predicates”:Array [String] (" id in (2,4,6,8) ", " id in (1,3,5,7) "),
Each Array elements are a Parallel districts.
To sum up, source data is accessed and classified according to data format by the present invention, is provided unified access path, is passed through unification
Access path visits again various data source channels, for example, JDBC, Json, TextFile, Parquet, SequenceFile, CSV,
The data source channels such as OrcFile, Avro.It is that big data platform shares that this, which unifies access path, without the processing of each big data
Frame needs to carry out independent realization for each data source when data source, and the mode independently realized not only has a large amount of weight
The too fat to move operational efficiency that renaturation exploitation also results in program is low.
Although specific embodiments of the present invention have been described above, those familiar with the art should manage
Solution, our described specific embodiments are merely exemplary rather than for the restriction to the scope of the present invention, are familiar with this
The equivalent modification and variation that the technical staff in field is made in the spirit according to the present invention, should all cover the present invention's
In scope of the claimed protection.
Claims (7)
1. a kind of implementation method of the unified access path of big data processing frame source data, it is characterised in that:Including:
A variety of big data processing frames have been docked in task execution module;And a unified visit is set in data acquisition module
Ask channel, the unified access path docking multiple data sources channel;
The task execution module receives task by big data processing block bridge joint, then asks source number to the data acquisition module
According to;
When the data acquisition module receives request, by the unified access path according to institute's matched data source channels access originator number
According to storage medium, obtain source data and perform task use for the task execution module.
2. a kind of implementation method of the unified access path of big data processing frame source data according to claim 1,
It is characterized in that:The big data handle frame to the data acquisition module ask source data when, only will need to specifically be accessed
IP, username and password and the object of access of RDBMS servers are passed to the unified access path, by the unified access
Channel obtains source data according to institute's matched data source channels.
3. a kind of implementation method of the unified access path of big data processing frame source data according to claim 2,
It is characterized in that:The incoming unified access path further includes access mode, if access mode is concurrent access, the unification
Access path provides two kinds of access modules:
(1) it provides to divide parallel field, and provide the maximum value, minimum value and degree of parallelism of this field, and draw automatically
Concurrent access is divided to obtain data;
(2) predicate for each carrying out obtaining source data parallel is provided, and divides concurrent access automatically and obtains data.
4. a kind of implementation method of the unified access path of big data processing frame source data according to claim 1,
It is characterized in that:Docked in the task execution module big data processing frame include Spark, Flink, Hive, Pig,
GraphLab、Cassandra、MongoDB、Impala、Greenplum、HAWQ、Storm、ElasticSearch、Solr、
Hbase、MySQL。
5. a kind of implementation method of the unified access path of big data processing frame source data according to claim 1,
It is characterized in that:The data source pathway species of the unified access path docking include JDBC channels, Json channels, TextFile and lead to
Road, Parquet channels, SequenceFile channels, CSV channels, OrcFile channels, Avro channels and other channels.
6. a kind of implementation method of the unified access path of big data processing frame source data according to claim 1,
It is characterized in that:After the task execution module is by big data processing block bridge joint receipts task, frame information is handled according to big data
Corresponding execution parameter is obtained with mission bit stream, then the task context of frame is built by performing parameter, then in task
Hereinafter source data is asked to the data acquisition module.
7. a kind of implementation method of the unified access path of big data processing frame source data according to claim 1 or 4,
It is characterized in that:The data source of the source data is that big data storing framework either go back or be big data by other storing frameworks
Handle frame.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201810029082.0A CN108256046A (en) | 2018-01-12 | 2018-01-12 | The implementation method of the unified access path of big data processing frame source data |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201810029082.0A CN108256046A (en) | 2018-01-12 | 2018-01-12 | The implementation method of the unified access path of big data processing frame source data |
Publications (1)
Publication Number | Publication Date |
---|---|
CN108256046A true CN108256046A (en) | 2018-07-06 |
Family
ID=62727148
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201810029082.0A Pending CN108256046A (en) | 2018-01-12 | 2018-01-12 | The implementation method of the unified access path of big data processing frame source data |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN108256046A (en) |
Cited By (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN109376154A (en) * | 2018-10-26 | 2019-02-22 | 杭州玳数科技有限公司 | Reading data, wiring method and reading data, writing system |
CN109684399A (en) * | 2018-12-24 | 2019-04-26 | 成都四方伟业软件股份有限公司 | Data bank access method, database access device and Data Analysis Platform |
CN111125013A (en) * | 2019-12-26 | 2020-05-08 | 北京锐安科技有限公司 | Data warehousing method, device, equipment and medium |
Citations (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN101944040A (en) * | 2010-09-15 | 2011-01-12 | 复旦大学 | Predicate-based automatic parallel optimizing method |
WO2015157048A1 (en) * | 2014-04-09 | 2015-10-15 | Microsoft Technology Licensing, Llc | Device policy manager |
CN105045607A (en) * | 2015-09-02 | 2015-11-11 | 广东创我科技发展有限公司 | Method for achieving uniform interface of multiple big data calculation frames |
CN106547766A (en) * | 2015-09-18 | 2017-03-29 | 华为技术有限公司 | A kind of data access method and device |
-
2018
- 2018-01-12 CN CN201810029082.0A patent/CN108256046A/en active Pending
Patent Citations (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN101944040A (en) * | 2010-09-15 | 2011-01-12 | 复旦大学 | Predicate-based automatic parallel optimizing method |
WO2015157048A1 (en) * | 2014-04-09 | 2015-10-15 | Microsoft Technology Licensing, Llc | Device policy manager |
CN105045607A (en) * | 2015-09-02 | 2015-11-11 | 广东创我科技发展有限公司 | Method for achieving uniform interface of multiple big data calculation frames |
CN106547766A (en) * | 2015-09-18 | 2017-03-29 | 华为技术有限公司 | A kind of data access method and device |
Cited By (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN109376154A (en) * | 2018-10-26 | 2019-02-22 | 杭州玳数科技有限公司 | Reading data, wiring method and reading data, writing system |
CN109684399A (en) * | 2018-12-24 | 2019-04-26 | 成都四方伟业软件股份有限公司 | Data bank access method, database access device and Data Analysis Platform |
CN111125013A (en) * | 2019-12-26 | 2020-05-08 | 北京锐安科技有限公司 | Data warehousing method, device, equipment and medium |
CN111125013B (en) * | 2019-12-26 | 2023-03-17 | 北京锐安科技有限公司 | Data warehousing method, device, equipment and medium |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN102982075B (en) | Support to access the system and method for heterogeneous data source | |
US11886429B2 (en) | Persistent metadata catalog | |
US9152669B2 (en) | System and method for distributed SQL join processing in shared-nothing relational database clusters using stationary tables | |
US20180329644A1 (en) | Data Pipeline Architecture for Analytics Processing Stack | |
CN110249596A (en) | The learning skill of the classification and priority ranking based on QOS for SAAS application | |
CN104111983B (en) | A kind of open multi-source data acquiring system and method | |
CN109906595A (en) | System and method for performing cryptographic operations across different types of processing hardware | |
US11176128B2 (en) | Multiple access path selection by machine learning | |
US10225375B2 (en) | Networked device management data collection | |
CN108256046A (en) | The implementation method of the unified access path of big data processing frame source data | |
US9576026B2 (en) | System and method for distributed SQL join processing in shared-nothing relational database clusters using self directed data streams | |
CN109154896A (en) | System and method for service chaining load balance | |
JP2017518561A (en) | Processing data from multiple sources | |
US20200379970A1 (en) | Systems and methods for providing custom objects for a multi-tenant platform with microservices architecture | |
Zhang et al. | A MapReduce based approach of scalable multidimensional anonymization for big data privacy preservation on cloud | |
US10146828B2 (en) | System and method of storing and analyzing information | |
CN108353040A (en) | system and method for distributed packet scheduling | |
Verma et al. | Big Data representation for grade analysis through Hadoop framework | |
US11216454B1 (en) | User defined functions for database query languages based on call-back functions | |
CN109902126B (en) | Loading system supporting HIVE automatic partition and implementation method thereof | |
US11061964B2 (en) | Techniques for processing relational data with a user-defined function (UDF) | |
US9363153B2 (en) | Monitoring similar data in stream computing | |
US9141251B2 (en) | Techniques for guided access to an external distributed file system from a database management system | |
US20180129712A1 (en) | Data provenance and data pedigree tracking | |
US9471639B2 (en) | Managing a grouping window on an operator graph |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination |