CN106897411A - ETL system and its method based on Spark technologies - Google Patents
ETL system and its method based on Spark technologies Download PDFInfo
- Publication number
- CN106897411A CN106897411A CN201710088150.6A CN201710088150A CN106897411A CN 106897411 A CN106897411 A CN 106897411A CN 201710088150 A CN201710088150 A CN 201710088150A CN 106897411 A CN106897411 A CN 106897411A
- Authority
- CN
- China
- Prior art keywords
- data
- module
- spark
- metadata
- day
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/20—Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
- G06F16/25—Integrating or interfacing systems involving database management systems
- G06F16/258—Data format conversion from or to a database
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/20—Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
- G06F16/25—Integrating or interfacing systems involving database management systems
- G06F16/254—Extract, transform and load [ETL] procedures, e.g. ETL data flows in data warehouses
Landscapes
- Engineering & Computer Science (AREA)
- Databases & Information Systems (AREA)
- Theoretical Computer Science (AREA)
- Data Mining & Analysis (AREA)
- Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
Abstract
The present invention discloses a kind of ETL system based on Spark technologies, and it includes data extraction module, data processing module, Data Integration module, data outputting module, metadata management module and data memory module;Data memory module includes interim data thesaurus, integral data thesaurus and metadata control file;Data extraction module is used to extract source data, multiple Spark RDD is dynamically generated on distribution node, and carry out parallel processing to it;Data processing module is used to read the Spark RDD of data extraction module generation, in being stored in interim data thesaurus after meta data match inspection and data conversion;Data Integration module is used to be stored in integral data thesaurus after carrying out Data Integration to the interim data and the integral data of upper one day on the same day;Data outputting module is changed and exported for entering row format to data after same day integration.The present invention is based on Spark technologies, can be extended with linear smoothing, and the speed of service is fast, without manual intervention, it is easy to management and maintenance.
Description
Technical field
The present invention relates to a kind of ETL system and its method, more particularly to a kind of ETL system based on Spark technologies and its
Method.
Background technology
With the development of big data, enterprise increasingly payes attention to the related development and application of data, so as to obtain more cities
Field chance, big data application be unable to do without the cleaning and processing of mass data, and enterprise is generally using the ETL of main flow(Data pick-up, turn
Change and load)Product, or direct coding using database store process carries out data processing.
At present, the ETL products of main flow are mostly based on unit framework, when mass data is processed, I/O throughput, system resource
There is bottleneck, extend difficult and expensive;On the other hand, ETL products focus on the ease for use of operation interface, each data mart modeling
Process is designed by picture, but metadata management and ETL products are separated, and phase need to be manually changed when causing data structure to change
Close design;Database store process coding there is a problem of it is same, and develop it is poorly efficient, it is difficult in maintenance.
With the appearance of Hadoop, the linear expansion and low cost of distributed platform solve unit and build number well
According to the deficiency of platform;Data mart modeling is carried out using Hive in Hadoop platform and obtains more accreditation, but Hive is to be based on
MapReduce distributed computing architectures, there is congenital deficiency, and the often step of MapReduce is calculated to be needed to read and write disk, belongs to Gao Yan
When iteration.
To sum up, it is necessary to design and a kind of drawbacks described above is made up based on the ETL system and its method of Spark technologies.
The content of the invention
The present invention proposes a kind of ETL system and its method based on Spark technologies, which solves in the prior art in treatment
During mass data, there is bottleneck in I/O throughput, system resource, extend difficult and expensive defect.The present invention is based on Spark
Technology, can be extended with linear smoothing, and the speed of service is fast, be run without manual intervention, and is easily managed and is safeguarded, can fully be met
The need for every profession and trade particularly large enterprises are in terms of the ETL process.
The technical proposal of the invention is realized in this way:
The present invention discloses a kind of ETL system based on Spark technologies, and it includes data extraction module, data processing module, data
Integrate module, data outputting module, metadata management module and data memory module;Data memory module is deposited including interim data
Bank, integral data thesaurus and metadata control file;Data extraction module is used to extract source data, and according to deblocking
Rule dynamically generates multiple Spark RDD on distribution node, then calls data processing by the multiple threads of thread pool startup
Module carries out parallel processing to each Spark RDD;Data processing module is used to read the Spark of data extraction module generation
RDD, changes, the data after being processed by meta data match inspection and volume of data, and is stored in interim data storage
In storehouse;Data Integration module is used to carry out full dose Data Integration or history to the interim data and the integral data of upper one day on the same day
Data Integration, obtains data after same day integration, and be stored in integral data thesaurus;Data outputting module is used for according to data
Requirement of the application system to data form, data enter row format and change and export after being integrated to the same day;Metadata management module is used
In the various key elements of system are carried out into Parametric Definition and management.
Wherein, data extraction module includes data access module and the first distributed data collection generation module;Data access
Module it is offline extract and in line extraction by way of directly read the data file of compressed format, and it is distributed to be sent to first
Dataset generation module carries out subsequent treatment;First distributed data collection generation module reads source number by data access module
According to, and multiple Spark RDD are dynamically generated on distribution node according to deblocking rule, pending data processing module is further located
Reason.
Wherein, data processing module includes data review module and data conversion module;Data review module passes through first number
Data audit report is checked and generated according to a paired data;Data conversion module is used to that data to be cleared up and changed.
Wherein, Data Integration module includes the second distributed data collection generation module, full dose Data Integration module and history
Data Integration module;Second distributed data collection generation module is used for interim data and the integration of upper one day on the day of reading respectively
Data, generate corresponding Spark RDD on distributed node;Full dose Data Integration module is used to read the second distributed data
The Spark RDD of collection generation module generation, increasing in interim data, delete, change mark, mutually tackle the integral data of a day
The data of identical key value are increased, are deleted, being changed operation in Spark RDD, after the completion for the treatment of, are formed interim data Spark RDD and are deposited
Enter in integral data thesaurus;Historical data integrates the interim data Spark RDD on the day of module is used to read, according to transfer
Increasing in data, delete, change mark, the data to identical key value in the integral data Spark RDD of upper one day do respective handling, place
After the completion of reason, it is deposited into integral data thesaurus.
Wherein, data outputting module includes the 3rd distributed data collection generation module and target data output module;3rd
Distributed data collection generation module is used for the integral data on the day of reading, and corresponding Spark RDD are generated on distributed node;
Target data output module is used to read the Spark RDD of the 3rd distributed data collection generation module generation, after being integrated to the same day
Data enter row format and change and export.
Wherein, metadata management module includes that metadata definition module, metadata check module and metadata export module;
Metadata definition module is used to directly carry out metadata definition and maintenance using Excel, and wherein metadata includes:Data source is believed
Breath, source data structure, target data form, target data structure, data conversion rule and expression formula;Metadata checks that module is used
Metadata is checked according to a series of metadata specification, and exports metadata audit report;Metadata export module
For the metadata in Excel to be exported as into metadata control file.
Invention additionally discloses a kind of method of the ETL system based on Spark technologies, it comprises the following steps:(S01)Extract
Source data, and multiple Spark RDD are dynamically generated on distribution node according to deblocking rule, then started by thread pool many
Individual thread carries out parallel processing calling data processing module to each Spark RDD;(S02)Read data extraction module generation
Spark RDD, changed by meta data match inspection and volume of data, the data after being processed, and be stored in transfer
In data repository;(S03)The integral data of interim data and upper one day to the same day carries out full dose Data Integration or history number
According to integration, data after same day integration are obtained, and be stored in integral data thesaurus;(S04)According to Data application system logarithm
According to the requirement of form, data enter row format and change and export after being integrated to the same day;(S05)The various key elements of system are parameterized
Definition and management.
Wherein, step(S01)Comprise the following steps:(S11)Directly read by way of offline extraction and in line extraction
The data file of compressed format, and be sent to the first distributed data collection generation module and carry out subsequent treatment;(S12)Reading source number
According to, and multiple Spark RDD are dynamically generated on distribution node according to deblocking rule, pending data processing module is further located
Reason.
Wherein, step(S02)Comprise the following steps:(S21)By meta data match data are checked and are generated with number
According to audit report;(S22)Data are cleared up and changed using the rule or expression formula of metadata definition.
Wherein, step(S03)Comprise the following steps:(S31)Interim data and the integration of upper one day on the day of reading respectively
Data, generate corresponding Spark RDD on distributed node;(S32)Read step(S31)The Spark RDD of middle generation, root
According to the increasing in interim data, delete, change mark, the data for mutually tackling identical key value in the integral data Spark RDD of a day are entered
Row increases, deletes, changing operation, after the completion for the treatment of, forms interim data Spark RDD and is deposited into integral data thesaurus;(S33)Read
Take step(S32)The interim data Spark RDD on the same day of formation, increasing in interim data, delete, change mark, to upper one
The data of identical key value do respective handling in it integral data Spark RDD, after the completion for the treatment of, are deposited into integral data storage
In storehouse.
Wherein, step(S04)Comprise the following steps:(S41)Integral data on the day of reading, generates on distributed node
Corresponding Spark RDD;(S42)Read step(S41)The Spark RDD of generation, data are entered row format and are turned after being integrated to the same day
Change and export.
Wherein, step(S05)Comprise the following steps:(S51)Metadata definition and maintenance are directly carried out using Excel, its
Middle metadata includes:Data source information, source data structure, target data form, target data structure, data conversion rule and table
Up to formula;(S52)Metadata is checked according to a series of metadata specification, and exports metadata audit report;(S53)Will
Metadata in Excel exports as metadata control file.
Compared with prior art, the invention has the advantages that:
1st, metadata management, it is easy to use, it is easy to maintenance.The present invention is using Excel tables management easy to use and configuration;First number
Directly safeguarded in Excel tables according to change, it is very clear.
2nd, without programming, out-of-the-box, automatic running.Rapid deployment of the present invention, out-of-the-box, ripe complete ETL works
Tool case, covers conventional bank data ETL demands;Metadata is once provided with, and system automatic streamline wire type service data is taken out
Take, data processing, Data Integration, the module such as data output, without manual intervention.Source data is changed, and only need to change corresponding unit
Data, without programming.
3rd, internal memory is calculated, and performance is double, linear to expand.The present invention is using Scala programming language perfect adaptations Spark;Profit
With Spark distributed memory parallel computings, results of intermediate calculations is buffered in internal memory, reduces magnetic disc i/o;Using multithreading
Concurrently operation treatment operation improves the performance and resource utilization of ETL;The present invention is compared with unit ETL tool software and Hadoop
MapReduce, the speed of service has the lifting of several times.
Brief description of the drawings
In order to illustrate more clearly about the embodiment of the present invention or technical scheme of the prior art, below will be to embodiment or existing
The accompanying drawing to be used needed for having technology description is briefly described.It should be evident that drawings in the following description are only this
Some embodiments of invention, for those of ordinary skill in the art, on the premise of not paying creative work, can be with
Other accompanying drawings are obtained according to these accompanying drawings.
Fig. 1 is the structured flowchart of ETL system of the present invention based on Spark technologies.
Fig. 2 is the structured flowchart of data extraction module of the present invention.
Fig. 3 is the structured flowchart of data processing module of the present invention.
Fig. 4 is the structured flowchart of Data Integration module of the present invention.
Fig. 5 is the structured flowchart of data outputting module of the present invention.
Fig. 6 is the structured flowchart of metadata management module of the present invention.
Specific embodiment
Below in conjunction with the accompanying drawing in the embodiment of the present invention, the technical scheme in the embodiment of the present invention is carried out clear, complete
Site preparation is described, it is clear that described embodiment is only a part of embodiment of the invention, rather than whole embodiments.It is based on
Embodiment in the present invention, it is every other that those of ordinary skill in the art are obtained under the premise of creative work is not made
Embodiment, belongs to the scope of protection of the invention.
In order to contribute to and clarifying the description of subsequent embodiment, carried out specifically to specific embodiment of the invention
Before bright, part term is explained, following explanation is applied to this specification and claims.
The ETL occurred in the present invention, is the abbreviation of English Extract-Transform-Load, for describing data
From source terminal through extraction (extract), conversion (transform), loading (load) to destination process, the words of ETL mono- compared with
Data warehouse is commonly used in, but its object is not limited to data warehouse;ETL is the important ring for building data warehouse, and user is from number
Required data are extracted according to source, by data cleansing, finally according to the data warehouse model for pre-defining, data is loaded
To in data warehouse.
The English full name of RDD is Resilient Distributed Datasets, refer to one it is read-only, can subregion
Distributed data collection, all or part of this data set can be buffered in internal memory, reused between repeatedly calculate;Spark
The Chinese of RDD is elasticity distribution formula data set.
Other English that the present invention occurs are code, without any other meaning.
Referring to figs. 1 to Fig. 6, the present invention discloses the present invention and discloses a kind of ETL system based on Spark technologies, and it includes number
According to abstraction module, data processing module, Data Integration module, data outputting module, metadata management module and data storage mould
Block;Data memory module includes interim data thesaurus, integral data thesaurus and metadata control file, revolution in the present invention
Columnar database is used according to thesaurus, integral data thesaurus(Such as Parquet), substantially reduce memory space and lifting inquiry property
Energy;Metadata control file of the present invention uses XML file Preservation Metadata, metadata management is become simple and general.
Data extraction module of the present invention supports relevant database, structured data file(It is compressible)、BigData(Such as
HDFS、Hive、JSON、Parquet)Etc. various heterogeneous data sources, it is used to extract source data, and is existed according to deblocking rule
Multiple Spark RDD are dynamically generated on distribution node, then data processing module pair is called by the multiple threads of thread pool startup
Each Spark RDD carries out parallel processing;Data processing module is used to read the Spark RDD of data extraction module generation, warp
Meta data match inspection and volume of data conversion, the data after being processed are crossed, and is stored in interim data thesaurus;Number
According to integrating, module is used to carry out full dose Data Integration to the interim data and the integral data of upper one day on the same day or historical data is whole
Close, obtain data after same day integration, and be stored in integral data thesaurus;Data outputting module is used for according to data application system
The requirement united to data form, data enter row format and change and export after being integrated to the same day, wherein, data output format supports knot
Structure data file(It is compressible), relevant database, Hive etc..
Metadata management module of the present invention is used to for the various key elements of system to carry out Parametric Definition and management;Wherein, system
Various key elements include data source information, source data structure, target data form, target data structure, data conversion rule and table
Up to formula etc..Metadata of the present invention is once provided with, and the extraction of system automatic streamline wire type service data, data processing, data are whole
The modules such as conjunction, data output, without manual intervention.And source data is changed, corresponding metadata only need to be changed, without reprogram.
Wherein, data extraction module includes data access module and the first distributed data collection generation module;Data access
Module it is offline extract and in line extraction by way of directly read the data file of compressed format, and it is distributed to be sent to first
Dataset generation module carries out subsequent treatment;First distributed data collection generation module reads source number by data access module
According to, and multiple Spark RDD are dynamically generated on distribution node according to deblocking rule, pending data processing module is further located
Reason.
Data access module of the present invention is interface layer, is the uniform data passage of present invention connection data source, by this
Passage can with high concurrent, highly reliably extract source data, data access module support relevant database, structured data file
(It is compressible)、BigData(Such as HDFS, Hive, JSON, Parquet)Etc. various heterogeneous data sources.Data access module of the present invention
Support offline extraction and in two kinds of data pick-up modes of line extraction, offline extraction mode refers to origin system generation data file, pressure
Pass through file transmitting software after contracting(Such as FTP)The system is sent to, then the system is loaded into by data access;In line extraction
Mode refers to extract source data to the system in real time online by data access.Data access module of the present invention also supports directly reading
The data file of compressed format is taken, and carries out subsequent treatment, whole process data buffer storage is relatively first solved in the past in internal memory not write magnetic disk
Then compressed file reads the traditional method of data file again, is obviously improved in speed, while decreasing data text
The network transmission of part.
The general principle of deblocking rule of the present invention is to enter Mobile state according to data total amount size and data record size
Calculate, obtain the number of Spark RDD, and by rule load data into each Spark RDD.Deblocking rule is solved
Spark is based on the deficiency of default parameter HDFS Block sizes generation Spark RDD, allow wall scroll data record intactly not by
Dividedly it is assigned in same Spark RDD, so as to reduce across Spark RDD calculate, improves the performance of data processing.
Wherein, data processing module includes data review module and data conversion module;Data review module passes through first number
Data audit report is checked and generated according to a paired data;Data conversion module is used to that data to be cleared up and changed.
Data audit report is supplied to data administrator to understand source data quality, it is also possible to be supplied to the system manager of origin system to join
Examine.
When data review module is checked meta data match, the data for reading are examined according to pre-defined metadata
Look into, including the quantity of data field, field data types, field length, field data form, whether meet business checklist and reach
The inspection of the aspects such as formula, has checked rear output data audit report, records problematic data and time, and totality inspection
Situation, such as how many of total data are looked into, wherein how many of problematic data etc..
The function of data conversion module includes:Code conversion, data form, increase field, by expression formula conversion etc.;Its
Middle code conversion supports that GBK, UTF-16, ANSI code conversion are encoded for UTF-8.Data form is turned to:According to metadata definition
Data Format Transform requirement, be converted to object format, such as date, form is converted to YYYY-MM-DD for MMDDYYYY.Increase
Field is:New field requirement according to metadata definition, generates newer field, and the value of new field is fixed value, or by
Other field combinations are calculated, and such as newly-increased age field, its value is calculated by date of birth field value.By expression formula
Be converted to:Data converting function calls expression formula engine, and the expression formula to metadata definition is parsed, and by the table after parsing
Data conversion is carried out up to formula.Expression formula support TRIM, SUBSTRING, CONCAT, REPLACE, IF-ELSE-FI decision logic,
Underlying mathematical operations, regular expressions handling function etc..
Wherein, Data Integration module includes the second distributed data collection generation module, full dose Data Integration module and history
Data Integration module;Second distributed data collection generation module is used for interim data and the integration of upper one day on the day of reading respectively
Data, generate corresponding Spark RDD on distributed node;Full dose Data Integration module is used to read the second distributed data
The Spark RDD of collection generation module generation, increasing in interim data, delete, change mark, mutually tackle the integral data of a day
The data of identical key value are increased, are deleted, being changed operation in Spark RDD, wherein deleting only do logic deletion, are deleted mark and are set to '
1 ', after the completion for the treatment of, form interim data Spark RDD and be deposited into integral data thesaurus;Historical data is integrated module and is used
Interim data Spark RDD on the day of reading, increasing in interim data, delete, change mark, to the integral data of upper one day
The data of identical key value do respective handling in Spark RDD, after the completion for the treatment of, are deposited into integral data thesaurus, wherein, delete
Mark, the deletion mark to data is set to ' 1 ', and the Expiration Date is set to the previous day on source data date;Increase mark, increase one newly
Data, the effective date is set to the source data date, and the Expiration Date is set to ' 9999-01-01 ';Change mark, change nearest one
Data content, the Expiration Date is set to the previous day on source data date, and a newly-increased data, and the effective date is set to source data day
Phase, the Expiration Date is set to ' 9999-01-01 '.
Wherein, data outputting module includes the 3rd distributed data collection generation module and target data output module;3rd
Distributed data collection generation module is used for the integral data on the day of reading, and corresponding Spark RDD are generated on distributed node;
Target data output module is used to read the Spark RDD of the 3rd distributed data collection generation module generation, according to data application
Requirement of the system to data form, data enter row format and change and export after being integrated to the same day, and its data output format supports knot
Structure data file(It is compressible), relevant database, Hive etc..
Wherein, metadata management module includes that metadata definition module, metadata check module and metadata export module;
Metadata definition module is used to directly carry out metadata definition and maintenance using Excel, easy to use, very clear, wherein unit
Data include:Data source information, source data structure, target data form, target data structure, data conversion rule and expression
Formula;Metadata checks that module is used to check metadata according to a series of metadata specification, and exports metadata inspection
Report;Metadata export module is used to for the metadata in Excel to export as metadata control file, and metadata control file is adopted
XML file Preservation Metadata is used, metadata management is become simple and general.
The present invention is built upon the ETL products of big data platform and distributed memory parallel computing, and it uses distribution
Support and storage platform, data mart modeling framework is built using Spark core components based on formula big data platform Hadoop, profit
Technology is iterated to calculate with many wheels based on internal memory of the advanced DAG enforcement engines of Spark and powerful, depth is carried out to source data
Degree processing.
Using Scala programmings, Scala is the static function formula programming language for operating in the object-oriented on JVM to the present invention
Speech, with speed is fast, succinct API, the features such as be easy to integrated with Hadoop, YARN, Spark kernels are by Scala language developments
, of the invention and Spark perfect adaptations, go directly Spark kernels, improves programming efficiency and big data process performance, while protecting
The high fault tolerance and high scalability of system are demonstrate,proved.
The present invention is with metadata definition and the various key elements of management ETL;Using Scala programming language perfect adaptations Spark;
Using Spark distributed memory parallel computings, results of intermediate calculations is buffered in internal memory, reduces magnetic disc i/o;Using multi-thread
Cheng Bingfa operation treatment operation improve ETL performance and resource utilization, therefore the present invention compared with conventional individual framework ETL products and
Hadoop MapReduce, the speed of service has the lifting of several times.
Invention additionally discloses a kind of method of the ETL system based on Spark technologies, it comprises the following steps:(S01)Extract
Source data, and multiple Spark RDD are dynamically generated on distribution node according to deblocking rule, then started by thread pool many
Individual thread carries out parallel processing calling data processing module to each Spark RDD;(S02)Read data extraction module generation
Spark RDD, changed by meta data match inspection and volume of data, the data after being processed, and be stored in transfer
In data repository;(S03)The integral data of interim data and upper one day to the same day carries out full dose Data Integration or history number
According to integration, data after same day integration are obtained, and be stored in integral data thesaurus;(S04)According to Data application system logarithm
According to the requirement of form, data enter row format and change and export after being integrated to the same day;(S05)The various key elements of system are parameterized
Definition and management.
Wherein, step(S01)Comprise the following steps:(S11)Directly read by way of offline extraction and in line extraction
The data file of compressed format, and be sent to the first distributed data collection generation module and carry out subsequent treatment;(S12)Reading source number
According to, and multiple Spark RDD are dynamically generated on distribution node according to deblocking rule, pending data processing module is further located
Reason.
Wherein, step(S02)Comprise the following steps:(S21)By meta data match data are checked and are generated with number
According to audit report;(S22)Data are cleared up and changed using the rule or expression formula of metadata definition.
Wherein, step(S03)Comprise the following steps:(S31)Interim data and the integration of upper one day on the day of reading respectively
Data, generate corresponding Spark RDD on distributed node;(S32)Read step(S31)The Spark RDD of middle generation, root
According to the increasing in interim data, delete, change mark, the data for mutually tackling identical key value in the integral data Spark RDD of a day are entered
Row increases, deletes, changing operation, after the completion for the treatment of, forms interim data Spark RDD and is deposited into integral data thesaurus;(S33)Read
Take step(S32)The interim data Spark RDD on the same day of formation, increasing in interim data, delete, change mark, to upper one
The data of identical key value do respective handling in it integral data Spark RDD, after the completion for the treatment of, are deposited into integral data storage
In storehouse.
Wherein, step(S04)Comprise the following steps:(S41)Integral data on the day of reading, generates on distributed node
Corresponding Spark RDD;(S42)Read step(S41)The Spark RDD of generation, data are entered row format and are turned after being integrated to the same day
Change and export.
Wherein, step(S05)Comprise the following steps:(S51)Metadata definition and maintenance are directly carried out using Excel, its
Middle metadata includes:Data source information, source data structure, target data form, target data structure, data conversion rule and table
Up to formula;(S52)Metadata is checked according to a series of metadata specification, and exports metadata audit report;(S53)Will
Metadata in Excel exports as metadata control file.
The present invention is built upon one kind of Hadoop Base data platforms and Spark distributed memory parallel computation frames
ETL products, its calculating process be based on internal memory many wheels iterate to calculate, the speed of service compared with conventional individual framework ETL products and
The fast several times of Hadoop MapReduce;As the smooth extension of cluster and internal memory are increased, ETL performances of the invention can obtain line
The lifting of property;In addition, the present invention is embedded in metadata management component, metadata is once provided with, system automatic streamline wire type fortune
The modules such as row data pick-up, data processing, Data Integration, data output, without manual intervention;And source data structure of the invention
Change, only need to change corresponding metadata, without reprogram, it is easy to management and maintenance.
Presently preferred embodiments of the present invention is the foregoing is only, is not intended to limit the invention, it is all in essence of the invention
Within god and principle, any modification, equivalent substitution and improvements made etc. should be included within the scope of the present invention.
Claims (10)
1. a kind of ETL system based on Spark technologies, it is characterised in that it include data extraction module, data processing module,
Data Integration module, data outputting module, metadata management module and data memory module;Data memory module includes middle revolution
File is controlled according to thesaurus, integral data thesaurus and metadata;
Data extraction module is used to extract source data, and multiple is dynamically generated on distribution node according to deblocking rule
Spark RDD, then call data processing module to locate each Spark RDD parallel by the multiple threads of thread pool startup
Reason;
Data processing module is used to read the Spark RDD of data extraction module generation, is by meta data match inspection and one
Column data is changed, the data after being processed, and is stored in interim data thesaurus;
Data Integration module is used to carry out full dose Data Integration or history to the interim data and the integral data of upper one day on the same day
Data Integration, obtains data after same day integration, and be stored in integral data thesaurus;
Data outputting module is used for the requirement according to Data application system to data form, and data enter row format after being integrated to the same day
Change and export;
Metadata management module is used to for the various key elements of system to carry out Parametric Definition and management.
2. the ETL system of Spark technologies is based on as claimed in claim 1, it is characterised in that data extraction module includes data
AM access module and the first distributed data collection generation module;
Data access module it is offline extract and in line extraction by way of directly read the data file of compressed format, and transmit
Subsequent treatment is carried out to the first distributed data collection generation module;
First distributed data collection generation module reads source data by data access module, and is being divided according to deblocking rule
Multiple Spark RDD are dynamically generated on cloth node, pending data processing module is further processed.
3. the ETL system of Spark technologies is based on as claimed in claim 2, it is characterised in that data processing module includes data
Check module and data conversion module;
Data review module is checked data by meta data match, and generates data audit report;
Data conversion module is used to that data to be cleared up and changed.
4. the ETL system of Spark technologies is based on as claimed in claim 3, it is characterised in that Data Integration module includes second
Distributed data collection generation module, full dose Data Integration module and historical data integrate module;
Second distributed data collection generation module is used for interim data and the integral data of upper one day on the day of reading respectively, is dividing
Corresponding Spark RDD are generated on cloth node;
Full dose Data Integration module is used to read the Spark RDD of the second distributed data collection generation module generation, according to transfer
Increasing in data, delete, change mark, the data of identical key value in the integral data Spark RDD of upper one day should mutually be increased,
Delete, change operation, after the completion for the treatment of, form interim data Spark RDD and be deposited into integral data thesaurus;
Historical data integrates the interim data Spark RDD on the day of module is used to read, and increasing in interim data, deletes, changes
Mark, the data to identical key value in the integral data Spark RDD of upper one day are done respective handling, after the completion for the treatment of, are deposited into
In integral data thesaurus.
5. the ETL system of Spark technologies is based on as claimed in claim 4, it is characterised in that data outputting module includes the 3rd
Distributed data collection generation module and target data output module;
3rd distributed data collection generation module is used for the integral data on the day of reading, is generated on distributed node corresponding
Spark RDD;
Target data output module is used to read the Spark RDD of the 3rd distributed data collection generation module generation, whole to the same day
Data enter row format and change and export after conjunction.
6. the ETL system of Spark technologies is based on as claimed in claim 5, it is characterised in that metadata management module includes unit
Data definition module, metadata check module and metadata export module;
Metadata definition module is used to directly carry out metadata definition and maintenance using Excel, and wherein metadata includes:Data source
Information, source data structure, target data form, target data structure, data conversion rule and expression formula;
Metadata checks that module is used to check metadata according to a series of metadata specification, and exports metadata inspection
Report;
Metadata export module is used to for the metadata in Excel to export as metadata control file.
7. a kind of method of ETL system based on Spark technologies as any one of claim 1-6, it is characterised in that
It comprises the following steps:
(S01)Source data is extracted, and multiple Spark RDD are dynamically generated on distribution node according to deblocking rule, then led to
Thread pool is crossed to start multiple threads to call the data processing module to carry out parallel processing to each Spark RDD;
(S02)The Spark RDD of data extraction module generation are read, are changed by meta data match inspection and volume of data,
Data after being processed, and be stored in interim data thesaurus;
(S03)The integral data of interim data and upper one day to the same day carries out full dose Data Integration or historical data is integrated, and obtains
Data after being integrated to the same day, and be stored in integral data thesaurus;
(S04)Requirement according to Data application system to data form, data enter row format and change and export after being integrated to the same day;
(S05)The various key elements of system are carried out into Parametric Definition and management.
8. the method for the ETL system of Spark technologies is based on as claimed in claim 7, it is characterised in that step(S01)Including
Following steps:
(S11)The data file of compressed format is directly read by way of offline extraction and in line extraction, and is sent to first
Distributed data collection generation module carries out subsequent treatment;
(S12)Source data is read, and multiple Spark RDD are dynamically generated on distribution node according to deblocking rule, wait to count
Further processed according to processing module;
Step(S02)Comprise the following steps:
(S21)By meta data match data are checked and are generated with data audit report;
(S22)Data are cleared up and changed using the rule or expression formula of metadata definition.
9. the method for the ETL system of Spark technologies is based on as claimed in claim 8, it is characterised in that step(S03)Including
Following steps:
(S31)Interim data and the integral data of upper one day on the day of reading respectively, generate corresponding on distributed node
Spark RDD;
(S32)Read step(S31)The Spark RDD of middle generation, increasing in interim data, delete, change mark, mutually in reply
The data of identical key value are increased, are deleted, being changed operation in the integral data Spark RDD of a day, after the completion for the treatment of, revolution in formation
It is deposited into integral data thesaurus according to Spark RDD;
(S33)Read step(S32)The interim data Spark RDD on the same day of formation, increasing in interim data, delete, change
Mark, the data to identical key value in the integral data Spark RDD of upper one day are done respective handling, after the completion for the treatment of, are deposited into
In integral data thesaurus;
Step(S04)Comprise the following steps:
(S41)Integral data on the day of reading, generates corresponding Spark RDD on distributed node;
(S42)Read step(S41)The Spark RDD of generation, data enter row format and change and export after being integrated to the same day.
10. the method for the ETL system of Spark technologies is based on as claimed in claim 9, it is characterised in that step(S05)Including
Following steps:
(S51)Metadata definition and maintenance are directly carried out using Excel, wherein metadata includes:Data source information, source data knot
Structure, target data form, target data structure, data conversion rule and expression formula;
(S52)Metadata is checked according to a series of metadata specification, and exports metadata audit report;
(S53)Metadata in Excel is exported as into metadata control file.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201710088150.6A CN106897411A (en) | 2017-02-20 | 2017-02-20 | ETL system and its method based on Spark technologies |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201710088150.6A CN106897411A (en) | 2017-02-20 | 2017-02-20 | ETL system and its method based on Spark technologies |
Publications (1)
Publication Number | Publication Date |
---|---|
CN106897411A true CN106897411A (en) | 2017-06-27 |
Family
ID=59184001
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201710088150.6A Pending CN106897411A (en) | 2017-02-20 | 2017-02-20 | ETL system and its method based on Spark technologies |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN106897411A (en) |
Cited By (20)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN107315726A (en) * | 2017-07-12 | 2017-11-03 | 广东奡风科技股份有限公司 | A kind of method that big data ETL overall processes based on Excel are defined |
CN107609008A (en) * | 2017-07-26 | 2018-01-19 | 郑州云海信息技术有限公司 | A kind of data importing device and method from relevant database to Kafka based on Apache Sqoop |
CN107707903A (en) * | 2017-08-22 | 2018-02-16 | 贵阳朗玛信息技术股份有限公司 | The determination method and device of user video communication quality |
CN107871013A (en) * | 2017-11-23 | 2018-04-03 | 安徽科创智慧知识产权服务有限公司 | A kind of mass data efficient decimation method |
CN107908797A (en) * | 2017-12-18 | 2018-04-13 | 上海中畅数据技术有限公司 | A kind of ETL data stream treatment technology method and systems in real time |
CN108052574A (en) * | 2017-12-08 | 2018-05-18 | 南京中新赛克科技有限责任公司 | Slave ftp server based on Kafka technologies imports the ETL system and implementation method of mass data |
CN108304538A (en) * | 2018-01-30 | 2018-07-20 | 广东奡风科技股份有限公司 | A kind of ETL system and its method based entirely on distributed memory calculating |
CN108763948A (en) * | 2018-03-16 | 2018-11-06 | 北京明朝万达科技股份有限公司 | A kind of automatic measures and procedures for the examination and approval of file and system of data-oriented anti-disclosure system |
CN109254989A (en) * | 2018-08-27 | 2019-01-22 | 北京东软望海科技有限公司 | A kind of method and device of the elastic ETL architecture design based on metadata driven |
CN109408586A (en) * | 2018-09-03 | 2019-03-01 | 中新网络信息安全股份有限公司 | A kind of polynary isomeric data fusion method of distribution |
CN109800092A (en) * | 2018-12-17 | 2019-05-24 | 华为技术有限公司 | A kind of processing method of shared data, device and server |
CN109814991A (en) * | 2018-12-25 | 2019-05-28 | 北京明略软件系统有限公司 | A kind of data administer in task management method and device |
CN109857832A (en) * | 2019-01-03 | 2019-06-07 | 中国银行股份有限公司 | A kind of preprocess method and device of payment data |
CN111211993A (en) * | 2018-11-21 | 2020-05-29 | 百度在线网络技术(北京)有限公司 | Incremental persistence method and device for streaming computation |
CN111914009A (en) * | 2020-07-07 | 2020-11-10 | 傲普(上海)新能源有限公司 | Pyspark-based energy storage data calculation and analysis method |
CN112115191A (en) * | 2020-09-22 | 2020-12-22 | 南京北斗创新应用科技研究院有限公司 | Branch optimization method executed by big data ETL model |
CN113064870A (en) * | 2021-03-22 | 2021-07-02 | 中国人民大学 | Big data processing method based on compressed data direct calculation |
CN114490525A (en) * | 2022-02-22 | 2022-05-13 | 北京科杰科技有限公司 | System and method for analyzing and putting out and putting in storage of super-large unstructured text files remotely based on hadoop |
CN115357657A (en) * | 2022-10-24 | 2022-11-18 | 成都数联云算科技有限公司 | Data processing method and device, computer equipment and storage medium |
CN116860861A (en) * | 2023-09-05 | 2023-10-10 | 杭州瞬安信息科技有限公司 | ETL data management system |
Citations (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN103092980A (en) * | 2013-01-31 | 2013-05-08 | 中国科学院自动化研究所 | Method and system of data automatic conversion and storage |
CN105243155A (en) * | 2015-10-29 | 2016-01-13 | 贵州电网有限责任公司电力调度控制中心 | Big data extracting and exchanging system |
CN105468770A (en) * | 2015-12-09 | 2016-04-06 | 合一网络技术(北京)有限公司 | Data processing method and system |
CN106202569A (en) * | 2016-08-09 | 2016-12-07 | 北京北信源软件股份有限公司 | A kind of cleaning method based on big data quantity |
CN106326457A (en) * | 2016-08-29 | 2017-01-11 | 山大地纬软件股份有限公司 | Construction method and system of human society person portfolio database on the basis of big data |
-
2017
- 2017-02-20 CN CN201710088150.6A patent/CN106897411A/en active Pending
Patent Citations (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN103092980A (en) * | 2013-01-31 | 2013-05-08 | 中国科学院自动化研究所 | Method and system of data automatic conversion and storage |
CN105243155A (en) * | 2015-10-29 | 2016-01-13 | 贵州电网有限责任公司电力调度控制中心 | Big data extracting and exchanging system |
CN105468770A (en) * | 2015-12-09 | 2016-04-06 | 合一网络技术(北京)有限公司 | Data processing method and system |
CN106202569A (en) * | 2016-08-09 | 2016-12-07 | 北京北信源软件股份有限公司 | A kind of cleaning method based on big data quantity |
CN106326457A (en) * | 2016-08-29 | 2017-01-11 | 山大地纬软件股份有限公司 | Construction method and system of human society person portfolio database on the basis of big data |
Cited By (30)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN107315726A (en) * | 2017-07-12 | 2017-11-03 | 广东奡风科技股份有限公司 | A kind of method that big data ETL overall processes based on Excel are defined |
CN107609008A (en) * | 2017-07-26 | 2018-01-19 | 郑州云海信息技术有限公司 | A kind of data importing device and method from relevant database to Kafka based on Apache Sqoop |
CN107707903A (en) * | 2017-08-22 | 2018-02-16 | 贵阳朗玛信息技术股份有限公司 | The determination method and device of user video communication quality |
CN107871013A (en) * | 2017-11-23 | 2018-04-03 | 安徽科创智慧知识产权服务有限公司 | A kind of mass data efficient decimation method |
CN108052574A (en) * | 2017-12-08 | 2018-05-18 | 南京中新赛克科技有限责任公司 | Slave ftp server based on Kafka technologies imports the ETL system and implementation method of mass data |
CN107908797A (en) * | 2017-12-18 | 2018-04-13 | 上海中畅数据技术有限公司 | A kind of ETL data stream treatment technology method and systems in real time |
CN108304538A (en) * | 2018-01-30 | 2018-07-20 | 广东奡风科技股份有限公司 | A kind of ETL system and its method based entirely on distributed memory calculating |
CN108763948B (en) * | 2018-03-16 | 2020-07-24 | 北京明朝万达科技股份有限公司 | Automatic document approval method and system for data leakage prevention system |
CN108763948A (en) * | 2018-03-16 | 2018-11-06 | 北京明朝万达科技股份有限公司 | A kind of automatic measures and procedures for the examination and approval of file and system of data-oriented anti-disclosure system |
CN109254989B (en) * | 2018-08-27 | 2020-11-20 | 望海康信(北京)科技股份公司 | Elastic ETL (extract transform load) architecture design method and device based on metadata drive |
CN109254989A (en) * | 2018-08-27 | 2019-01-22 | 北京东软望海科技有限公司 | A kind of method and device of the elastic ETL architecture design based on metadata driven |
CN109408586A (en) * | 2018-09-03 | 2019-03-01 | 中新网络信息安全股份有限公司 | A kind of polynary isomeric data fusion method of distribution |
CN111211993B (en) * | 2018-11-21 | 2023-08-11 | 百度在线网络技术(北京)有限公司 | Incremental persistence method, device and storage medium for stream computation |
CN111211993A (en) * | 2018-11-21 | 2020-05-29 | 百度在线网络技术(北京)有限公司 | Incremental persistence method and device for streaming computation |
CN109800092A (en) * | 2018-12-17 | 2019-05-24 | 华为技术有限公司 | A kind of processing method of shared data, device and server |
US11445004B2 (en) | 2018-12-17 | 2022-09-13 | Petal Cloud Technology Co., Ltd. | Method for processing shared data, apparatus, and server |
CN109814991A (en) * | 2018-12-25 | 2019-05-28 | 北京明略软件系统有限公司 | A kind of data administer in task management method and device |
CN109857832A (en) * | 2019-01-03 | 2019-06-07 | 中国银行股份有限公司 | A kind of preprocess method and device of payment data |
CN111914009A (en) * | 2020-07-07 | 2020-11-10 | 傲普(上海)新能源有限公司 | Pyspark-based energy storage data calculation and analysis method |
CN111914009B (en) * | 2020-07-07 | 2023-02-24 | 傲普(上海)新能源有限公司 | Pyspark-based energy storage data calculation and analysis method |
CN112115191A (en) * | 2020-09-22 | 2020-12-22 | 南京北斗创新应用科技研究院有限公司 | Branch optimization method executed by big data ETL model |
CN112115191B (en) * | 2020-09-22 | 2022-02-15 | 南京北斗创新应用科技研究院有限公司 | Branch optimization method executed by big data ETL model |
WO2022062751A1 (en) * | 2020-09-22 | 2022-03-31 | 南京北斗创新应用科技研究院有限公司 | Branch optimization method executed by big data etl model |
CN113064870A (en) * | 2021-03-22 | 2021-07-02 | 中国人民大学 | Big data processing method based on compressed data direct calculation |
CN113064870B (en) * | 2021-03-22 | 2021-11-30 | 中国人民大学 | Big data processing method based on compressed data direct calculation |
CN114490525B (en) * | 2022-02-22 | 2022-08-02 | 北京科杰科技有限公司 | System and method for analyzing and warehousing of ultra-large unstructured text files based on hadoop remote |
CN114490525A (en) * | 2022-02-22 | 2022-05-13 | 北京科杰科技有限公司 | System and method for analyzing and putting out and putting in storage of super-large unstructured text files remotely based on hadoop |
CN115357657A (en) * | 2022-10-24 | 2022-11-18 | 成都数联云算科技有限公司 | Data processing method and device, computer equipment and storage medium |
CN116860861A (en) * | 2023-09-05 | 2023-10-10 | 杭州瞬安信息科技有限公司 | ETL data management system |
CN116860861B (en) * | 2023-09-05 | 2023-12-15 | 杭州瞬安信息科技有限公司 | ETL data management system |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN106897411A (en) | ETL system and its method based on Spark technologies | |
US11544623B2 (en) | Consistent filtering of machine learning data | |
US11392586B2 (en) | Data protection method and device and storage medium | |
CN108304538A (en) | A kind of ETL system and its method based entirely on distributed memory calculating | |
CN111124679B (en) | Multi-source heterogeneous mass data-oriented time-limited automatic processing method | |
US11100420B2 (en) | Input processing for machine learning | |
CN106168965B (en) | Knowledge graph construction system | |
US10831747B2 (en) | Multi stage aggregation using digest order after a first stage of aggregation | |
US20150379426A1 (en) | Optimized decision tree based models | |
CN109614413B (en) | Memory flow type computing platform system | |
US11693912B2 (en) | Adapting database queries for data virtualization over combined database stores | |
CN106095878A (en) | The database manipulation device and method of table is divided based on point storehouse | |
US8688685B2 (en) | Accelerated searching of substrings | |
CN106528898A (en) | Method and device for converting data of non-relational database into relational database | |
CN105138676B (en) | Table merge querying methods are divided in point storehouse for concurrently polymerizeing calculating based on high-level language | |
CN114385760A (en) | Method and device for real-time synchronization of incremental data, computer equipment and storage medium | |
US11645281B1 (en) | Caching query plans in database systems | |
US11675515B2 (en) | Intelligent partitioning engine for cluster computing | |
CN106570151A (en) | Data collection processing method and system for mass files | |
CN116414801A (en) | Data migration method, device, computer equipment and storage medium | |
CN109829003A (en) | Database backup method and device | |
CN116760661A (en) | Data storage method, apparatus, computer device, storage medium, and program product | |
CN1897629A (en) | Mass toll-ticket fast cross rearrangement based on memory | |
CN110674173A (en) | Method and system for caching data on wind control cloud | |
US11847121B2 (en) | Compound predicate query statement transformation |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
RJ01 | Rejection of invention patent application after publication | ||
RJ01 | Rejection of invention patent application after publication |
Application publication date: 20170627 |