CN106897411A

CN106897411A - ETL system and its method based on Spark technologies

Info

Publication number: CN106897411A
Application number: CN201710088150.6A
Authority: CN
Inventors: 陈涛; 黄卓凡; 张志聪; 李笋; 林志广
Original assignee: Guangdong Strong Wind Polytron Technologies Inc
Current assignee: Guangdong Strong Wind Polytron Technologies Inc
Priority date: 2017-02-20
Filing date: 2017-02-20
Publication date: 2017-06-27

Abstract

The present invention discloses a kind of ETL system based on Spark technologies, and it includes data extraction module, data processing module, Data Integration module, data outputting module, metadata management module and data memory module；Data memory module includes interim data thesaurus, integral data thesaurus and metadata control file；Data extraction module is used to extract source data, multiple Spark RDD is dynamically generated on distribution node, and carry out parallel processing to it；Data processing module is used to read the Spark RDD of data extraction module generation, in being stored in interim data thesaurus after meta data match inspection and data conversion；Data Integration module is used to be stored in integral data thesaurus after carrying out Data Integration to the interim data and the integral data of upper one day on the same day；Data outputting module is changed and exported for entering row format to data after same day integration.The present invention is based on Spark technologies, can be extended with linear smoothing, and the speed of service is fast, without manual intervention, it is easy to management and maintenance.

Description

ETL system and its method based on Spark technologies

Technical field

The present invention relates to a kind of ETL system and its method, more particularly to a kind of ETL system based on Spark technologies and its Method.

Background technology

With the development of big data, enterprise increasingly payes attention to the related development and application of data, so as to obtain more cities Field chance, big data application be unable to do without the cleaning and processing of mass data, and enterprise is generally using the ETL of main flow（Data pick-up, turn Change and load）Product, or direct coding using database store process carries out data processing.

At present, the ETL products of main flow are mostly based on unit framework, when mass data is processed, I/O throughput, system resource There is bottleneck, extend difficult and expensive；On the other hand, ETL products focus on the ease for use of operation interface, each data mart modeling Process is designed by picture, but metadata management and ETL products are separated, and phase need to be manually changed when causing data structure to change Close design；Database store process coding there is a problem of it is same, and develop it is poorly efficient, it is difficult in maintenance.

With the appearance of Hadoop, the linear expansion and low cost of distributed platform solve unit and build number well According to the deficiency of platform；Data mart modeling is carried out using Hive in Hadoop platform and obtains more accreditation, but Hive is to be based on MapReduce distributed computing architectures, there is congenital deficiency, and the often step of MapReduce is calculated to be needed to read and write disk, belongs to Gao Yan When iteration.

To sum up, it is necessary to design and a kind of drawbacks described above is made up based on the ETL system and its method of Spark technologies.

The content of the invention

The present invention proposes a kind of ETL system and its method based on Spark technologies, which solves in the prior art in treatment During mass data, there is bottleneck in I/O throughput, system resource, extend difficult and expensive defect.The present invention is based on Spark Technology, can be extended with linear smoothing, and the speed of service is fast, be run without manual intervention, and is easily managed and is safeguarded, can fully be met The need for every profession and trade particularly large enterprises are in terms of the ETL process.

The technical proposal of the invention is realized in this way：

The present invention discloses a kind of ETL system based on Spark technologies, and it includes data extraction module, data processing module, data Integrate module, data outputting module, metadata management module and data memory module；Data memory module is deposited including interim data Bank, integral data thesaurus and metadata control file；Data extraction module is used to extract source data, and according to deblocking Rule dynamically generates multiple Spark RDD on distribution node, then calls data processing by the multiple threads of thread pool startup Module carries out parallel processing to each Spark RDD；Data processing module is used to read the Spark of data extraction module generation RDD, changes, the data after being processed by meta data match inspection and volume of data, and is stored in interim data storage In storehouse；Data Integration module is used to carry out full dose Data Integration or history to the interim data and the integral data of upper one day on the same day Data Integration, obtains data after same day integration, and be stored in integral data thesaurus；Data outputting module is used for according to data Requirement of the application system to data form, data enter row format and change and export after being integrated to the same day；Metadata management module is used In the various key elements of system are carried out into Parametric Definition and management.

Wherein, data extraction module includes data access module and the first distributed data collection generation module；Data access Module it is offline extract and in line extraction by way of directly read the data file of compressed format, and it is distributed to be sent to first Dataset generation module carries out subsequent treatment；First distributed data collection generation module reads source number by data access module According to, and multiple Spark RDD are dynamically generated on distribution node according to deblocking rule, pending data processing module is further located Reason.

Wherein, data processing module includes data review module and data conversion module；Data review module passes through first number Data audit report is checked and generated according to a paired data；Data conversion module is used to that data to be cleared up and changed.

Wherein, Data Integration module includes the second distributed data collection generation module, full dose Data Integration module and history Data Integration module；Second distributed data collection generation module is used for interim data and the integration of upper one day on the day of reading respectively Data, generate corresponding Spark RDD on distributed node；Full dose Data Integration module is used to read the second distributed data The Spark RDD of collection generation module generation, increasing in interim data, delete, change mark, mutually tackle the integral data of a day The data of identical key value are increased, are deleted, being changed operation in Spark RDD, after the completion for the treatment of, are formed interim data Spark RDD and are deposited Enter in integral data thesaurus；Historical data integrates the interim data Spark RDD on the day of module is used to read, according to transfer Increasing in data, delete, change mark, the data to identical key value in the integral data Spark RDD of upper one day do respective handling, place After the completion of reason, it is deposited into integral data thesaurus.

Wherein, data outputting module includes the 3rd distributed data collection generation module and target data output module；3rd Distributed data collection generation module is used for the integral data on the day of reading, and corresponding Spark RDD are generated on distributed node； Target data output module is used to read the Spark RDD of the 3rd distributed data collection generation module generation, after being integrated to the same day Data enter row format and change and export.

Wherein, metadata management module includes that metadata definition module, metadata check module and metadata export module； Metadata definition module is used to directly carry out metadata definition and maintenance using Excel, and wherein metadata includes：Data source is believed Breath, source data structure, target data form, target data structure, data conversion rule and expression formula；Metadata checks that module is used Metadata is checked according to a series of metadata specification, and exports metadata audit report；Metadata export module For the metadata in Excel to be exported as into metadata control file.

Invention additionally discloses a kind of method of the ETL system based on Spark technologies, it comprises the following steps：（S01）Extract Source data, and multiple Spark RDD are dynamically generated on distribution node according to deblocking rule, then started by thread pool many Individual thread carries out parallel processing calling data processing module to each Spark RDD；（S02）Read data extraction module generation Spark RDD, changed by meta data match inspection and volume of data, the data after being processed, and be stored in transfer In data repository；（S03）The integral data of interim data and upper one day to the same day carries out full dose Data Integration or history number According to integration, data after same day integration are obtained, and be stored in integral data thesaurus；（S04）According to Data application system logarithm According to the requirement of form, data enter row format and change and export after being integrated to the same day；（S05）The various key elements of system are parameterized Definition and management.

Wherein, step（S01）Comprise the following steps：（S11）Directly read by way of offline extraction and in line extraction The data file of compressed format, and be sent to the first distributed data collection generation module and carry out subsequent treatment；（S12）Reading source number According to, and multiple Spark RDD are dynamically generated on distribution node according to deblocking rule, pending data processing module is further located Reason.

Wherein, step（S02）Comprise the following steps：（S21）By meta data match data are checked and are generated with number According to audit report；（S22）Data are cleared up and changed using the rule or expression formula of metadata definition.

Wherein, step（S03）Comprise the following steps：（S31）Interim data and the integration of upper one day on the day of reading respectively Data, generate corresponding Spark RDD on distributed node；（S32）Read step（S31）The Spark RDD of middle generation, root According to the increasing in interim data, delete, change mark, the data for mutually tackling identical key value in the integral data Spark RDD of a day are entered Row increases, deletes, changing operation, after the completion for the treatment of, forms interim data Spark RDD and is deposited into integral data thesaurus；（S33）Read Take step（S32）The interim data Spark RDD on the same day of formation, increasing in interim data, delete, change mark, to upper one The data of identical key value do respective handling in it integral data Spark RDD, after the completion for the treatment of, are deposited into integral data storage In storehouse.

Wherein, step（S04）Comprise the following steps：（S41）Integral data on the day of reading, generates on distributed node Corresponding Spark RDD；（S42）Read step（S41）The Spark RDD of generation, data are entered row format and are turned after being integrated to the same day Change and export.

Wherein, step（S05）Comprise the following steps：（S51）Metadata definition and maintenance are directly carried out using Excel, its Middle metadata includes：Data source information, source data structure, target data form, target data structure, data conversion rule and table Up to formula；（S52）Metadata is checked according to a series of metadata specification, and exports metadata audit report；（S53）Will Metadata in Excel exports as metadata control file.

Compared with prior art, the invention has the advantages that：

1st, metadata management, it is easy to use, it is easy to maintenance.The present invention is using Excel tables management easy to use and configuration；First number Directly safeguarded in Excel tables according to change, it is very clear.

2nd, without programming, out-of-the-box, automatic running.Rapid deployment of the present invention, out-of-the-box, ripe complete ETL works Tool case, covers conventional bank data ETL demands；Metadata is once provided with, and system automatic streamline wire type service data is taken out Take, data processing, Data Integration, the module such as data output, without manual intervention.Source data is changed, and only need to change corresponding unit Data, without programming.

3rd, internal memory is calculated, and performance is double, linear to expand.The present invention is using Scala programming language perfect adaptations Spark；Profit With Spark distributed memory parallel computings, results of intermediate calculations is buffered in internal memory, reduces magnetic disc i/o；Using multithreading Concurrently operation treatment operation improves the performance and resource utilization of ETL；The present invention is compared with unit ETL tool software and Hadoop MapReduce, the speed of service has the lifting of several times.

Brief description of the drawings

In order to illustrate more clearly about the embodiment of the present invention or technical scheme of the prior art, below will be to embodiment or existing The accompanying drawing to be used needed for having technology description is briefly described.It should be evident that drawings in the following description are only this Some embodiments of invention, for those of ordinary skill in the art, on the premise of not paying creative work, can be with Other accompanying drawings are obtained according to these accompanying drawings.

Fig. 1 is the structured flowchart of ETL system of the present invention based on Spark technologies.

Fig. 2 is the structured flowchart of data extraction module of the present invention.

Fig. 3 is the structured flowchart of data processing module of the present invention.

Fig. 4 is the structured flowchart of Data Integration module of the present invention.

Fig. 5 is the structured flowchart of data outputting module of the present invention.

Fig. 6 is the structured flowchart of metadata management module of the present invention.

Specific embodiment

Below in conjunction with the accompanying drawing in the embodiment of the present invention, the technical scheme in the embodiment of the present invention is carried out clear, complete Site preparation is described, it is clear that described embodiment is only a part of embodiment of the invention, rather than whole embodiments.It is based on Embodiment in the present invention, it is every other that those of ordinary skill in the art are obtained under the premise of creative work is not made Embodiment, belongs to the scope of protection of the invention.

In order to contribute to and clarifying the description of subsequent embodiment, carried out specifically to specific embodiment of the invention Before bright, part term is explained, following explanation is applied to this specification and claims.

The ETL occurred in the present invention, is the abbreviation of English Extract-Transform-Load, for describing data From source terminal through extraction (extract), conversion (transform), loading (load) to destination process, the words of ETL mono- compared with Data warehouse is commonly used in, but its object is not limited to data warehouse；ETL is the important ring for building data warehouse, and user is from number Required data are extracted according to source, by data cleansing, finally according to the data warehouse model for pre-defining, data is loaded To in data warehouse.

The English full name of RDD is Resilient Distributed Datasets, refer to one it is read-only, can subregion Distributed data collection, all or part of this data set can be buffered in internal memory, reused between repeatedly calculate；Spark The Chinese of RDD is elasticity distribution formula data set.

Other English that the present invention occurs are code, without any other meaning.

Referring to figs. 1 to Fig. 6, the present invention discloses the present invention and discloses a kind of ETL system based on Spark technologies, and it includes number According to abstraction module, data processing module, Data Integration module, data outputting module, metadata management module and data storage mould Block；Data memory module includes interim data thesaurus, integral data thesaurus and metadata control file, revolution in the present invention Columnar database is used according to thesaurus, integral data thesaurus（Such as Parquet）, substantially reduce memory space and lifting inquiry property Energy；Metadata control file of the present invention uses XML file Preservation Metadata, metadata management is become simple and general.

Data extraction module of the present invention supports relevant database, structured data file（It is compressible）、BigData（Such as HDFS、Hive、JSON、Parquet）Etc. various heterogeneous data sources, it is used to extract source data, and is existed according to deblocking rule Multiple Spark RDD are dynamically generated on distribution node, then data processing module pair is called by the multiple threads of thread pool startup Each Spark RDD carries out parallel processing；Data processing module is used to read the Spark RDD of data extraction module generation, warp Meta data match inspection and volume of data conversion, the data after being processed are crossed, and is stored in interim data thesaurus；Number According to integrating, module is used to carry out full dose Data Integration to the interim data and the integral data of upper one day on the same day or historical data is whole Close, obtain data after same day integration, and be stored in integral data thesaurus；Data outputting module is used for according to data application system The requirement united to data form, data enter row format and change and export after being integrated to the same day, wherein, data output format supports knot Structure data file（It is compressible）, relevant database, Hive etc..

Metadata management module of the present invention is used to for the various key elements of system to carry out Parametric Definition and management；Wherein, system Various key elements include data source information, source data structure, target data form, target data structure, data conversion rule and table Up to formula etc..Metadata of the present invention is once provided with, and the extraction of system automatic streamline wire type service data, data processing, data are whole The modules such as conjunction, data output, without manual intervention.And source data is changed, corresponding metadata only need to be changed, without reprogram.

Data access module of the present invention is interface layer, is the uniform data passage of present invention connection data source, by this Passage can with high concurrent, highly reliably extract source data, data access module support relevant database, structured data file （It is compressible）、BigData（Such as HDFS, Hive, JSON, Parquet）Etc. various heterogeneous data sources.Data access module of the present invention Support offline extraction and in two kinds of data pick-up modes of line extraction, offline extraction mode refers to origin system generation data file, pressure Pass through file transmitting software after contracting（Such as FTP）The system is sent to, then the system is loaded into by data access；In line extraction Mode refers to extract source data to the system in real time online by data access.Data access module of the present invention also supports directly reading The data file of compressed format is taken, and carries out subsequent treatment, whole process data buffer storage is relatively first solved in the past in internal memory not write magnetic disk Then compressed file reads the traditional method of data file again, is obviously improved in speed, while decreasing data text The network transmission of part.

The general principle of deblocking rule of the present invention is to enter Mobile state according to data total amount size and data record size Calculate, obtain the number of Spark RDD, and by rule load data into each Spark RDD.Deblocking rule is solved Spark is based on the deficiency of default parameter HDFS Block sizes generation Spark RDD, allow wall scroll data record intactly not by Dividedly it is assigned in same Spark RDD, so as to reduce across Spark RDD calculate, improves the performance of data processing.

Wherein, data processing module includes data review module and data conversion module；Data review module passes through first number Data audit report is checked and generated according to a paired data；Data conversion module is used to that data to be cleared up and changed. Data audit report is supplied to data administrator to understand source data quality, it is also possible to be supplied to the system manager of origin system to join Examine.

When data review module is checked meta data match, the data for reading are examined according to pre-defined metadata Look into, including the quantity of data field, field data types, field length, field data form, whether meet business checklist and reach The inspection of the aspects such as formula, has checked rear output data audit report, records problematic data and time, and totality inspection Situation, such as how many of total data are looked into, wherein how many of problematic data etc..

The function of data conversion module includes：Code conversion, data form, increase field, by expression formula conversion etc.；Its Middle code conversion supports that GBK, UTF-16, ANSI code conversion are encoded for UTF-8.Data form is turned to：According to metadata definition Data Format Transform requirement, be converted to object format, such as date, form is converted to YYYY-MM-DD for MMDDYYYY.Increase Field is：New field requirement according to metadata definition, generates newer field, and the value of new field is fixed value, or by Other field combinations are calculated, and such as newly-increased age field, its value is calculated by date of birth field value.By expression formula Be converted to：Data converting function calls expression formula engine, and the expression formula to metadata definition is parsed, and by the table after parsing Data conversion is carried out up to formula.Expression formula support TRIM, SUBSTRING, CONCAT, REPLACE, IF-ELSE-FI decision logic, Underlying mathematical operations, regular expressions handling function etc..

Wherein, Data Integration module includes the second distributed data collection generation module, full dose Data Integration module and history Data Integration module；Second distributed data collection generation module is used for interim data and the integration of upper one day on the day of reading respectively Data, generate corresponding Spark RDD on distributed node；Full dose Data Integration module is used to read the second distributed data The Spark RDD of collection generation module generation, increasing in interim data, delete, change mark, mutually tackle the integral data of a day The data of identical key value are increased, are deleted, being changed operation in Spark RDD, wherein deleting only do logic deletion, are deleted mark and are set to ' 1 ', after the completion for the treatment of, form interim data Spark RDD and be deposited into integral data thesaurus；Historical data is integrated module and is used Interim data Spark RDD on the day of reading, increasing in interim data, delete, change mark, to the integral data of upper one day The data of identical key value do respective handling in Spark RDD, after the completion for the treatment of, are deposited into integral data thesaurus, wherein, delete Mark, the deletion mark to data is set to ' 1 ', and the Expiration Date is set to the previous day on source data date；Increase mark, increase one newly Data, the effective date is set to the source data date, and the Expiration Date is set to ' 9999-01-01 '；Change mark, change nearest one Data content, the Expiration Date is set to the previous day on source data date, and a newly-increased data, and the effective date is set to source data day Phase, the Expiration Date is set to ' 9999-01-01 '.

Wherein, data outputting module includes the 3rd distributed data collection generation module and target data output module；3rd Distributed data collection generation module is used for the integral data on the day of reading, and corresponding Spark RDD are generated on distributed node； Target data output module is used to read the Spark RDD of the 3rd distributed data collection generation module generation, according to data application Requirement of the system to data form, data enter row format and change and export after being integrated to the same day, and its data output format supports knot Structure data file（It is compressible）, relevant database, Hive etc..

Wherein, metadata management module includes that metadata definition module, metadata check module and metadata export module； Metadata definition module is used to directly carry out metadata definition and maintenance using Excel, easy to use, very clear, wherein unit Data include：Data source information, source data structure, target data form, target data structure, data conversion rule and expression Formula；Metadata checks that module is used to check metadata according to a series of metadata specification, and exports metadata inspection Report；Metadata export module is used to for the metadata in Excel to export as metadata control file, and metadata control file is adopted XML file Preservation Metadata is used, metadata management is become simple and general.

The present invention is built upon the ETL products of big data platform and distributed memory parallel computing, and it uses distribution Support and storage platform, data mart modeling framework is built using Spark core components based on formula big data platform Hadoop, profit Technology is iterated to calculate with many wheels based on internal memory of the advanced DAG enforcement engines of Spark and powerful, depth is carried out to source data Degree processing.

Using Scala programmings, Scala is the static function formula programming language for operating in the object-oriented on JVM to the present invention Speech, with speed is fast, succinct API, the features such as be easy to integrated with Hadoop, YARN, Spark kernels are by Scala language developments , of the invention and Spark perfect adaptations, go directly Spark kernels, improves programming efficiency and big data process performance, while protecting The high fault tolerance and high scalability of system are demonstrate,proved.

The present invention is with metadata definition and the various key elements of management ETL；Using Scala programming language perfect adaptations Spark； Using Spark distributed memory parallel computings, results of intermediate calculations is buffered in internal memory, reduces magnetic disc i/o；Using multi-thread Cheng Bingfa operation treatment operation improve ETL performance and resource utilization, therefore the present invention compared with conventional individual framework ETL products and Hadoop MapReduce, the speed of service has the lifting of several times.

The present invention is built upon one kind of Hadoop Base data platforms and Spark distributed memory parallel computation frames ETL products, its calculating process be based on internal memory many wheels iterate to calculate, the speed of service compared with conventional individual framework ETL products and The fast several times of Hadoop MapReduce；As the smooth extension of cluster and internal memory are increased, ETL performances of the invention can obtain line The lifting of property；In addition, the present invention is embedded in metadata management component, metadata is once provided with, system automatic streamline wire type fortune The modules such as row data pick-up, data processing, Data Integration, data output, without manual intervention；And source data structure of the invention Change, only need to change corresponding metadata, without reprogram, it is easy to management and maintenance.

Presently preferred embodiments of the present invention is the foregoing is only, is not intended to limit the invention, it is all in essence of the invention Within god and principle, any modification, equivalent substitution and improvements made etc. should be included within the scope of the present invention.

Claims

1. a kind of ETL system based on Spark technologies, it is characterised in that it include data extraction module, data processing module, Data Integration module, data outputting module, metadata management module and data memory module；Data memory module includes middle revolution File is controlled according to thesaurus, integral data thesaurus and metadata；

Data extraction module is used to extract source data, and multiple is dynamically generated on distribution node according to deblocking rule Spark RDD, then call data processing module to locate each Spark RDD parallel by the multiple threads of thread pool startup Reason；

Data processing module is used to read the Spark RDD of data extraction module generation, is by meta data match inspection and one Column data is changed, the data after being processed, and is stored in interim data thesaurus；

Data Integration module is used to carry out full dose Data Integration or history to the interim data and the integral data of upper one day on the same day Data Integration, obtains data after same day integration, and be stored in integral data thesaurus；

Data outputting module is used for the requirement according to Data application system to data form, and data enter row format after being integrated to the same day Change and export；

Metadata management module is used to for the various key elements of system to carry out Parametric Definition and management.

2. the ETL system of Spark technologies is based on as claimed in claim 1, it is characterised in that data extraction module includes data AM access module and the first distributed data collection generation module；

Data access module it is offline extract and in line extraction by way of directly read the data file of compressed format, and transmit Subsequent treatment is carried out to the first distributed data collection generation module；

First distributed data collection generation module reads source data by data access module, and is being divided according to deblocking rule Multiple Spark RDD are dynamically generated on cloth node, pending data processing module is further processed.

3. the ETL system of Spark technologies is based on as claimed in claim 2, it is characterised in that data processing module includes data Check module and data conversion module；

Data review module is checked data by meta data match, and generates data audit report；

Data conversion module is used to that data to be cleared up and changed.

4. the ETL system of Spark technologies is based on as claimed in claim 3, it is characterised in that Data Integration module includes second Distributed data collection generation module, full dose Data Integration module and historical data integrate module；

Second distributed data collection generation module is used for interim data and the integral data of upper one day on the day of reading respectively, is dividing Corresponding Spark RDD are generated on cloth node；

Full dose Data Integration module is used to read the Spark RDD of the second distributed data collection generation module generation, according to transfer Increasing in data, delete, change mark, the data of identical key value in the integral data Spark RDD of upper one day should mutually be increased, Delete, change operation, after the completion for the treatment of, form interim data Spark RDD and be deposited into integral data thesaurus；

Historical data integrates the interim data Spark RDD on the day of module is used to read, and increasing in interim data, deletes, changes Mark, the data to identical key value in the integral data Spark RDD of upper one day are done respective handling, after the completion for the treatment of, are deposited into In integral data thesaurus.

5. the ETL system of Spark technologies is based on as claimed in claim 4, it is characterised in that data outputting module includes the 3rd Distributed data collection generation module and target data output module；

3rd distributed data collection generation module is used for the integral data on the day of reading, is generated on distributed node corresponding Spark RDD；

Target data output module is used to read the Spark RDD of the 3rd distributed data collection generation module generation, whole to the same day Data enter row format and change and export after conjunction.

6. the ETL system of Spark technologies is based on as claimed in claim 5, it is characterised in that metadata management module includes unit Data definition module, metadata check module and metadata export module；

Metadata definition module is used to directly carry out metadata definition and maintenance using Excel, and wherein metadata includes：Data source Information, source data structure, target data form, target data structure, data conversion rule and expression formula；

Metadata checks that module is used to check metadata according to a series of metadata specification, and exports metadata inspection Report；

Metadata export module is used to for the metadata in Excel to export as metadata control file.

7. a kind of method of ETL system based on Spark technologies as any one of claim 1-6, it is characterised in that It comprises the following steps：

（S01）Source data is extracted, and multiple Spark RDD are dynamically generated on distribution node according to deblocking rule, then led to Thread pool is crossed to start multiple threads to call the data processing module to carry out parallel processing to each Spark RDD；

（S02）The Spark RDD of data extraction module generation are read, are changed by meta data match inspection and volume of data, Data after being processed, and be stored in interim data thesaurus；

（S03）The integral data of interim data and upper one day to the same day carries out full dose Data Integration or historical data is integrated, and obtains Data after being integrated to the same day, and be stored in integral data thesaurus；

（S04）Requirement according to Data application system to data form, data enter row format and change and export after being integrated to the same day；

（S05）The various key elements of system are carried out into Parametric Definition and management.

8. the method for the ETL system of Spark technologies is based on as claimed in claim 7, it is characterised in that step（S01）Including Following steps：

（S11）The data file of compressed format is directly read by way of offline extraction and in line extraction, and is sent to first Distributed data collection generation module carries out subsequent treatment；

（S12）Source data is read, and multiple Spark RDD are dynamically generated on distribution node according to deblocking rule, wait to count Further processed according to processing module；

Step（S02）Comprise the following steps：

（S21）By meta data match data are checked and are generated with data audit report；

（S22）Data are cleared up and changed using the rule or expression formula of metadata definition.

9. the method for the ETL system of Spark technologies is based on as claimed in claim 8, it is characterised in that step（S03）Including Following steps：

（S31）Interim data and the integral data of upper one day on the day of reading respectively, generate corresponding on distributed node Spark RDD；

（S32）Read step（S31）The Spark RDD of middle generation, increasing in interim data, delete, change mark, mutually in reply The data of identical key value are increased, are deleted, being changed operation in the integral data Spark RDD of a day, after the completion for the treatment of, revolution in formation It is deposited into integral data thesaurus according to Spark RDD；

（S33）Read step（S32）The interim data Spark RDD on the same day of formation, increasing in interim data, delete, change Mark, the data to identical key value in the integral data Spark RDD of upper one day are done respective handling, after the completion for the treatment of, are deposited into In integral data thesaurus；

Step（S04）Comprise the following steps：

（S41）Integral data on the day of reading, generates corresponding Spark RDD on distributed node；

（S42）Read step（S41）The Spark RDD of generation, data enter row format and change and export after being integrated to the same day.

10. the method for the ETL system of Spark technologies is based on as claimed in claim 9, it is characterised in that step（S05）Including Following steps：

（S51）Metadata definition and maintenance are directly carried out using Excel, wherein metadata includes：Data source information, source data knot Structure, target data form, target data structure, data conversion rule and expression formula；

（S52）Metadata is checked according to a series of metadata specification, and exports metadata audit report；

（S53）Metadata in Excel is exported as into metadata control file.