CN105205105A - Data ETL (Extract Transform Load) system based on storm and treatment method based on storm - Google Patents
Data ETL (Extract Transform Load) system based on storm and treatment method based on storm Download PDFInfo
- Publication number
- CN105205105A CN105205105A CN201510533323.1A CN201510533323A CN105205105A CN 105205105 A CN105205105 A CN 105205105A CN 201510533323 A CN201510533323 A CN 201510533323A CN 105205105 A CN105205105 A CN 105205105A
- Authority
- CN
- China
- Prior art keywords
- data
- storm
- etl
- controller module
- calculation engine
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Granted
Links
- 238000000034 method Methods 0.000 title claims abstract description 35
- 239000000284 extract Substances 0.000 title claims abstract description 7
- 238000004364 calculation method Methods 0.000 claims abstract description 27
- 238000013500 data storage Methods 0.000 claims abstract description 8
- 238000004140 cleaning Methods 0.000 claims description 10
- 238000000605 extraction Methods 0.000 claims description 3
- 238000007726 management method Methods 0.000 abstract 1
- 230000010354 integration Effects 0.000 description 5
- 230000005012 migration Effects 0.000 description 3
- 238000013508 migration Methods 0.000 description 3
- 238000007493 shaping process Methods 0.000 description 2
- 238000010586 diagram Methods 0.000 description 1
- 238000005516 engineering process Methods 0.000 description 1
- 230000003993 interaction Effects 0.000 description 1
- 239000000203 mixture Substances 0.000 description 1
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/20—Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
- G06F16/25—Integrating or interfacing systems involving database management systems
- G06F16/254—Extract, transform and load [ETL] procedures, e.g. ETL data flows in data warehouses
Landscapes
- Engineering & Computer Science (AREA)
- Databases & Information Systems (AREA)
- Theoretical Computer Science (AREA)
- Data Mining & Analysis (AREA)
- Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
- Stored Programmes (AREA)
- Refuse Collection And Transfer (AREA)
- Processing Of Solid Wastes (AREA)
Abstract
The invention discloses a data ETL (Extract Transform Load) system based on storm and treatment method based on storm and belongs to the technical field of data ETL management. The system is divided into a controller module, a connector module and a distributed calculation engine, wherein the controller module is used for receiving a user command, analyzing command setting and starting a data ETL task; the connector module is internally provided with connection drives of a relational database, an Hbase database and an HDFS (Hadoop Distributed File System) and can be called when the distributed calculation engine is connected with a data source and a target data storage; the storm is used as the distributed calculation engine and is used for receiving parameters set by the controller module to carry out a data ETL task. A user does not need to compile a storm code and only needs to input the command; the controller module is used for analyzing the user command, the storm is automatically set and the ETL task is issued; all the supportable connection drives of the data source and the target data storage are packaged in a connector and are automatically selected and called by a controller.
Description
Technical field
The present invention discloses a kind of ETL process system based on storm and disposal route, belongs to ETL process administrative skill field.
Background technology
Data integration is that the data of separate sources, form and feature logically or are physically organically concentrated, thus provides comprehensive data sharing, is the important component part of enterprise commerce intelligence, data warehouse.ETL is the primary solutions of enterprise data integration.That in ETL, three letters represent respectively is Extract, Transform, Load, namely extracts, changes, loads.Data pick-up is the process of extracted data from data source.Along with the continuous increase of business data amount, primitive relation type database cannot meet the demand of user, needs past data warehouse that can be extending transversely, the platform of hadoop or MPP framework such as, carry out Data Migration.From database, the mode of extracted data is generally directly derives Backup Data and passes through JDBC(JavaDataBaseConnectivity) etc. interface readings according to etc. mode.Wherein flexible by the Method compare of the interface readings such as JDBC, if but do not adopt multi-threaded parallel efficiency can be very low by the mode of the interface extracted data such as JDBC, particularly large data age often needs the situation of the database table extracting more than one hundred million data, causes JDBC efficiency not high.Consider from economy and efficiency at present, the large data platform of main flow utilizes ETL instrument to carry out the migration of data, but does not also have shaping product at present based on the ETL instrument of flow data process framework.The invention provides a kind of ETL process system based on storm and disposal route, utilize storm as the computing engines of ETL process, and utilize controller and user interactions, user is without the need to writing code, only need input command just can realize utilizing storm to carry out the ETL operation of data, and Lookup protocol storm initiate ETL task; The connection that all supported data sources and target data store is driven and all encapsulates in the connectors, for the automatic Selection and call of controller, present system and disposal route are applied to and utilize flow data framework to have very strong practicality by the data integration in relevant database to data warehouse.
Summary of the invention
The ETL instrument that the present invention is directed to based on flow data process framework does not also have shaping product at present, not can solve the problem of a large amount of flow data migration, a kind of ETL process system based on storm and disposal route are provided, are applied to and utilize flow data framework to have very strong practicality by the data integration in relevant database to data warehouse.
The concrete scheme that the present invention proposes is:
Based on an ETL process system of storm, comprise controller module, connector modules and Distributed Calculation engine;
Controller module is responsible for receiving user's input information, for the connector that Distributed Calculation engine selects data source and target data to store, arranges the ETL topological structure of Distributed Calculation engine, calls Distributed Calculation engine and initiate ETL task after being provided with;
The connection of the built-in relevant database of connector modules, Hbase database and HDFS drives, and calls when connection data source and target data store for Distributed Calculation engine;
Distributed Calculation engine adopts distributed flow data process framework storm to carry out ETL process, task by controller module configuration concurrency, is responsible for the extraction of data, cleaning and loading tasks, data is write target data and stores.
The data source that described topological structure comprises the Thread Count of execution, each thread needs extract divides, the data field etc. of needs cleaning.
A kind of ETL process disposal route based on storm, utilize described a kind of ETL process system based on storm, controller module resolves user's input information, the connection selecting data source and target to store from connector modules drives, the topological structure of the configuring distributed computing engines of controller module, distributes connection and drives and start ETL task; Distributed Calculation engine, from data source extracted data, carries out data cleansing, then data is write target data and stores.
Described user's input information comprises sql statement, data source table name, data source connection string, the connection string of target data storage, the execution Thread Count of ETL, the field separator of target data and newline etc.
The process of the topological structure of the configuring distributed computing engines of described controller module is: the total number of controller module data query source table, sql statement is rewritten, the data boundary of each ETL thread is set according to the Thread Count of user's input, the bolt of data is write in the spout of controller module extracted data from relevant database, the bolt being responsible for cleaning data and past target data storage, by connecting the bolt driving and be assigned to spout respectively and write target data, complete the configuration of the topological structure of spout and bolt of storm.
Usefulness of the present invention is:
The invention provides a kind of ETL process system based on storm and disposal route, system is divided into controller module, connector modules and Distributed Calculation engine; Controller module is responsible for receives user's, and resolve command arranges and starts ETL process task; The connection of the built-in relevant database of connector modules, Hbase database and HDFS drives, and calls when connection data source and target data store for Distributed Calculation engine; Adopt storm as Distributed Calculation engine, receive the parameter that control module is arranged, carry out ETL process task.Realizing user without the need to writing storm code, only needing input command, controller module is responsible for resolving user command, and Lookup protocol storm initiate ETL task; The connection that all supported data sources and target data store is driven and all encapsulates in the connectors, for the automatic Selection and call of controller.Native system cooperation disposal route is applied to and utilizes flow data framework to have very strong practicality by the data integration in relevant database to data warehouse.
Accompanying drawing explanation
Fig. 1 present system configuration diagram.
Embodiment
The present invention will be further described by reference to the accompanying drawings.
Based on an ETL process system of storm, comprise controller module, connector modules and Distributed Calculation engine;
Controller module is responsible for receiving user's input information, for the connector that Distributed Calculation engine selects data source and target data to store, arranges the ETL topological structure of Distributed Calculation engine, calls Distributed Calculation engine and initiate ETL task after being provided with;
The data source that topological structure comprises the Thread Count of execution, each thread needs extract divides, the data field etc. of needs cleaning.
The connection of the built-in relevant database of connector modules, Hbase database and HDFS drives, and calls when connection data source and target data store for Distributed Calculation engine;
Distributed Calculation engine adopts distributed flow data process framework storm to carry out ETL process, task by controller module configuration concurrency, is responsible for the extraction of data, cleaning and loading tasks, data is write target data and stores.In storm framework, the topological structure of a storm task is made up of spout and bolt, and spout is used for receiving data stream, and gives bolt data stream, and bolt is responsible for cleaning data, and concurrent data writes next bolt or target data stores.Spout and bolt can arrange DAG that is multiple and composition complexity to scheme.Spout and bolt forms the topological structure of storm task, and these can be generated automatically by the order of controller module according to user.
A kind of ETL process disposal route based on storm, utilize said system, controller module resolves user's input information, and user's input information comprises sql statement, data source table name, data source connection string, the connection string of target data storage, the execution Thread Count of ETL, the field separator of target data and newline etc.;
The connection that controller module selects data source and target to store from connector modules drives;
The topological structure of the configuring distributed computing engines of controller module, distributes connection and drives and start ETL task;
The process of the topological structure of the configuring distributed computing engines of controller module is: the total number of controller module data query source table, sql statement is rewritten, the data boundary of each ETL thread is set according to the Thread Count of user's input, the bolt of data is write in the spout of controller module extracted data from relevant database, the bolt being responsible for cleaning data and past target data storage, by connecting the bolt driving and be assigned to spout respectively and write target data, complete the configuration of the topological structure of spout and bolt of storm.
Controller starts the task of storm, and storm starts the ETL operation performing data: Distributed Calculation engine, from data source extracted data, carries out data cleansing, then data is write target data and stores.
Claims (5)
1., based on an ETL process system of storm, it is characterized in that comprising controller module, connector modules and Distributed Calculation engine;
Controller module is responsible for receiving user's input information, for the connector that Distributed Calculation engine selects data source and target data to store, arranges the ETL topological structure of Distributed Calculation engine, calls Distributed Calculation engine and initiate ETL task after being provided with;
The connection of the built-in relevant database of connector modules, Hbase database and HDFS drives, and calls when connection data source and target data store for Distributed Calculation engine;
Distributed Calculation engine adopts distributed flow data process framework storm to carry out ETL process, task by controller module configuration concurrency, is responsible for the extraction of data, cleaning and loading tasks, data is write target data and stores.
2. a kind of ETL process system based on storm according to claim 1, is characterized in that the data source that described topological structure comprises the Thread Count of execution, each thread needs extract divides, the data field of needs cleaning.
3. the ETL process disposal route based on storm, it is characterized in that utilizing a kind of ETL process system based on storm described in claim 1 or 2, controller module resolves user's input information, the connection selecting data source and target to store from connector modules drives, the topological structure of the configuring distributed computing engines of controller module, distributes connection and drives and start ETL task; Distributed Calculation engine, from data source extracted data, carries out data cleansing, then data is write target data and stores.
4., according to a kind of ETL process disposal route based on storm according to claim 3, it is characterized in that described user's input information comprises sql statement, data source table name, data source connection string, the connection string of target data storage, the execution Thread Count of ETL, the field separator of target data and newline.
5. according to a kind of ETL process disposal route based on storm according to claim 4, it is characterized in that the process of the topological structure of the configuring distributed computing engines of described controller module is: the total number of controller module data query source table, sql statement is rewritten, the data boundary of each ETL thread is set according to the Thread Count of user's input, the spout of controller module extracted data from relevant database, be responsible for the bolt of cleaning data and write the bolt of data toward target data storage, the bolt driving and be assigned to spout respectively and write target data will be connected, complete the configuration of the topological structure of spout and bolt of storm.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201510533323.1A CN105205105B (en) | 2015-08-27 | 2015-08-27 | A kind of ETL process system and processing method based on storm |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201510533323.1A CN105205105B (en) | 2015-08-27 | 2015-08-27 | A kind of ETL process system and processing method based on storm |
Publications (2)
Publication Number | Publication Date |
---|---|
CN105205105A true CN105205105A (en) | 2015-12-30 |
CN105205105B CN105205105B (en) | 2019-04-16 |
Family
ID=54952789
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201510533323.1A Active CN105205105B (en) | 2015-08-27 | 2015-08-27 | A kind of ETL process system and processing method based on storm |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN105205105B (en) |
Cited By (15)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN105824892A (en) * | 2016-03-11 | 2016-08-03 | 广东电网有限责任公司电力科学研究院 | Method for synchronizing and processing data by data pool |
CN106250571A (en) * | 2016-10-11 | 2016-12-21 | 北京集奥聚合科技有限公司 | The method and system that a kind of ETL data process |
CN106611046A (en) * | 2016-12-16 | 2017-05-03 | 武汉中地数码科技有限公司 | Big data technology-based space data storage processing middleware framework |
CN106649119A (en) * | 2016-12-28 | 2017-05-10 | 深圳市华傲数据技术有限公司 | Stream computing engine testing method and device |
CN106777933A (en) * | 2016-12-02 | 2017-05-31 | 郑州云海信息技术有限公司 | A kind of collecting method, apparatus and system |
CN107545014A (en) * | 2016-06-28 | 2018-01-05 | 国网天津市电力公司 | Stream calculation instant disposal system for treating based on Storm |
CN107678852A (en) * | 2017-10-26 | 2018-02-09 | 携程旅游网络技术(上海)有限公司 | Method, system, equipment and the storage medium calculated in real time based on flow data |
CN107688598A (en) * | 2017-06-25 | 2018-02-13 | 平安科技(深圳)有限公司 | Source table structure analysis method, application server and computer-readable recording medium |
CN108256045A (en) * | 2018-01-12 | 2018-07-06 | 福建星瑞格软件有限公司 | The structuring parsing of real-time streaming data, the method and computer equipment of stream calculation |
WO2018184418A1 (en) * | 2017-04-06 | 2018-10-11 | 平安科技(深圳)有限公司 | Data cleaning method, terminal and computer readable storage medium |
CN109522004A (en) * | 2018-11-09 | 2019-03-26 | 福建南威软件有限公司 | A kind of method that ETL process is run in distributed structure/architecture |
CN110442602A (en) * | 2019-07-02 | 2019-11-12 | 新华三大数据技术有限公司 | Data query method, apparatus, server and storage medium |
CN110471977A (en) * | 2019-08-22 | 2019-11-19 | 杭州数梦工场科技有限公司 | A kind of method for interchanging data, device, equipment, medium |
CN112700622A (en) * | 2020-12-21 | 2021-04-23 | 中铁二院工程集团有限责任公司 | Storm-based railway geological disaster monitoring big data preprocessing method and system |
CN114048195A (en) * | 2022-01-13 | 2022-02-15 | 合肥臻谱防务科技有限公司 | Data migration method and system and electronic equipment |
Citations (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN103714151A (en) * | 2013-12-26 | 2014-04-09 | 北京锐安科技有限公司 | One-way optical gate and method for carrying out data synchronizing between heterogeneous databases |
CN103955502A (en) * | 2014-04-24 | 2014-07-30 | 科技谷(厦门)信息技术有限公司 | Visualized on-line analytical processing (OLAP) application realizing method and system |
CN104036025A (en) * | 2014-06-27 | 2014-09-10 | 蓝盾信息安全技术有限公司 | Distribution-base mass log collection system |
CN104317928A (en) * | 2014-10-31 | 2015-01-28 | 北京思特奇信息技术股份有限公司 | Service ETL (extraction-transformation-loading) method and service ETL system both based on distributed database |
-
2015
- 2015-08-27 CN CN201510533323.1A patent/CN105205105B/en active Active
Patent Citations (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN103714151A (en) * | 2013-12-26 | 2014-04-09 | 北京锐安科技有限公司 | One-way optical gate and method for carrying out data synchronizing between heterogeneous databases |
CN103955502A (en) * | 2014-04-24 | 2014-07-30 | 科技谷(厦门)信息技术有限公司 | Visualized on-line analytical processing (OLAP) application realizing method and system |
CN104036025A (en) * | 2014-06-27 | 2014-09-10 | 蓝盾信息安全技术有限公司 | Distribution-base mass log collection system |
CN104317928A (en) * | 2014-10-31 | 2015-01-28 | 北京思特奇信息技术股份有限公司 | Service ETL (extraction-transformation-loading) method and service ETL system both based on distributed database |
Cited By (21)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN105824892A (en) * | 2016-03-11 | 2016-08-03 | 广东电网有限责任公司电力科学研究院 | Method for synchronizing and processing data by data pool |
CN107545014A (en) * | 2016-06-28 | 2018-01-05 | 国网天津市电力公司 | Stream calculation instant disposal system for treating based on Storm |
CN106250571A (en) * | 2016-10-11 | 2016-12-21 | 北京集奥聚合科技有限公司 | The method and system that a kind of ETL data process |
CN106777933A (en) * | 2016-12-02 | 2017-05-31 | 郑州云海信息技术有限公司 | A kind of collecting method, apparatus and system |
CN106611046A (en) * | 2016-12-16 | 2017-05-03 | 武汉中地数码科技有限公司 | Big data technology-based space data storage processing middleware framework |
CN106649119A (en) * | 2016-12-28 | 2017-05-10 | 深圳市华傲数据技术有限公司 | Stream computing engine testing method and device |
CN106649119B (en) * | 2016-12-28 | 2019-09-20 | 深圳市华傲数据技术有限公司 | The test method and device of stream calculation engine |
WO2018184418A1 (en) * | 2017-04-06 | 2018-10-11 | 平安科技(深圳)有限公司 | Data cleaning method, terminal and computer readable storage medium |
WO2019000628A1 (en) * | 2017-06-25 | 2019-01-03 | 平安科技(深圳)有限公司 | Source table structure parsing method and system, application server and computer-readable storage medium |
CN107688598A (en) * | 2017-06-25 | 2018-02-13 | 平安科技(深圳)有限公司 | Source table structure analysis method, application server and computer-readable recording medium |
CN107688598B (en) * | 2017-06-25 | 2021-02-09 | 平安科技(深圳)有限公司 | Source table structure analysis method, application server and computer readable storage medium |
CN107678852A (en) * | 2017-10-26 | 2018-02-09 | 携程旅游网络技术(上海)有限公司 | Method, system, equipment and the storage medium calculated in real time based on flow data |
CN107678852B (en) * | 2017-10-26 | 2021-06-22 | 携程旅游网络技术(上海)有限公司 | Method, system, equipment and storage medium based on stream data real-time calculation |
CN108256045A (en) * | 2018-01-12 | 2018-07-06 | 福建星瑞格软件有限公司 | The structuring parsing of real-time streaming data, the method and computer equipment of stream calculation |
CN109522004A (en) * | 2018-11-09 | 2019-03-26 | 福建南威软件有限公司 | A kind of method that ETL process is run in distributed structure/architecture |
CN110442602A (en) * | 2019-07-02 | 2019-11-12 | 新华三大数据技术有限公司 | Data query method, apparatus, server and storage medium |
CN110471977A (en) * | 2019-08-22 | 2019-11-19 | 杭州数梦工场科技有限公司 | A kind of method for interchanging data, device, equipment, medium |
CN110471977B (en) * | 2019-08-22 | 2022-04-22 | 杭州数梦工场科技有限公司 | Data exchange method, device, equipment and medium |
CN112700622A (en) * | 2020-12-21 | 2021-04-23 | 中铁二院工程集团有限责任公司 | Storm-based railway geological disaster monitoring big data preprocessing method and system |
CN112700622B (en) * | 2020-12-21 | 2022-05-17 | 中铁二院工程集团有限责任公司 | Storm-based railway geological disaster monitoring big data preprocessing method and system |
CN114048195A (en) * | 2022-01-13 | 2022-02-15 | 合肥臻谱防务科技有限公司 | Data migration method and system and electronic equipment |
Also Published As
Publication number | Publication date |
---|---|
CN105205105B (en) | 2019-04-16 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN105205105A (en) | Data ETL (Extract Transform Load) system based on storm and treatment method based on storm | |
CN102033748B (en) | Method for generating data processing flow codes | |
CN102663114B (en) | Database inquiry processing method facing concurrency OLAP (On Line Analytical Processing) | |
US10515118B2 (en) | Processing a data flow graph of a hybrid flow | |
CN106126601A (en) | A kind of social security distributed preprocess method of big data and system | |
CN107958057A (en) | A kind of code generating method and device for being used for Data Migration in heterogeneous database | |
CN103425762A (en) | Telecom operator mass data processing method based on Hadoop platform | |
CN102752372A (en) | File based database synchronization method | |
CN105550268A (en) | Big data process modeling analysis engine | |
WO2012151149A4 (en) | Managing data queries | |
CN103246549B (en) | A kind of method and system of data conversion storage | |
US20140101167A1 (en) | Creation of Inverted Index System, and Data Processing Method and Apparatus | |
CN103430144A (en) | Data source analytics | |
CN103186541A (en) | Generation method and device for mapping relationship | |
CN107301214A (en) | Data migration method, device and terminal device in HIVE | |
CN104915414A (en) | Data extraction method and device | |
CN112214453B (en) | Large-scale industrial data compression storage method, system and medium | |
CN104462351A (en) | Data query model and method for MapReduce pattern | |
US20140101105A1 (en) | Method and apparatus for data migration from hierarchical database of mainframe system to rehosting solution database of open system | |
CN103617273A (en) | SOL script objectification method and system | |
CN112379884A (en) | Spark and parallel memory computing-based process engine implementation method and system | |
CN104182436A (en) | Method and device for cleaning databases | |
CN102663020A (en) | CDC data distribution method and device thereof | |
CN106227799A (en) | A kind of sql statement processing method based on distributed data base | |
CN104881461A (en) | Rapid data storage method |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
C06 | Publication | ||
PB01 | Publication | ||
C10 | Entry into substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant | ||
TR01 | Transfer of patent right |
Effective date of registration: 20230320 Address after: 250000 building S02, No. 1036, Langchao Road, high tech Zone, Jinan City, Shandong Province Patentee after: Shandong Inspur Scientific Research Institute Co.,Ltd. Address before: No. 1036, Shun Ya Road, Ji'nan high tech Zone, Shandong Province Patentee before: INSPUR GROUP Co.,Ltd. |
|
TR01 | Transfer of patent right |