CN106897411A - ETL system and its method based on Spark technologies - Google Patents

ETL system and its method based on Spark technologies Download PDF

Info

Publication number
CN106897411A
CN106897411A CN201710088150.6A CN201710088150A CN106897411A CN 106897411 A CN106897411 A CN 106897411A CN 201710088150 A CN201710088150 A CN 201710088150A CN 106897411 A CN106897411 A CN 106897411A
Authority
CN
China
Prior art keywords
data
module
spark
metadata
day
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN201710088150.6A
Other languages
Chinese (zh)
Inventor
陈涛
黄卓凡
张志聪
李笋
林志广
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Guangdong Strong Wind Polytron Technologies Inc
Original Assignee
Guangdong Strong Wind Polytron Technologies Inc
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Guangdong Strong Wind Polytron Technologies Inc filed Critical Guangdong Strong Wind Polytron Technologies Inc
Priority to CN201710088150.6A priority Critical patent/CN106897411A/en
Publication of CN106897411A publication Critical patent/CN106897411A/en
Pending legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/25Integrating or interfacing systems involving database management systems
    • G06F16/258Data format conversion from or to a database
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/25Integrating or interfacing systems involving database management systems
    • G06F16/254Extract, transform and load [ETL] procedures, e.g. ETL data flows in data warehouses

Landscapes

  • Engineering & Computer Science (AREA)
  • Databases & Information Systems (AREA)
  • Theoretical Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The present invention discloses a kind of ETL system based on Spark technologies, and it includes data extraction module, data processing module, Data Integration module, data outputting module, metadata management module and data memory module;Data memory module includes interim data thesaurus, integral data thesaurus and metadata control file;Data extraction module is used to extract source data, multiple Spark RDD is dynamically generated on distribution node, and carry out parallel processing to it;Data processing module is used to read the Spark RDD of data extraction module generation, in being stored in interim data thesaurus after meta data match inspection and data conversion;Data Integration module is used to be stored in integral data thesaurus after carrying out Data Integration to the interim data and the integral data of upper one day on the same day;Data outputting module is changed and exported for entering row format to data after same day integration.The present invention is based on Spark technologies, can be extended with linear smoothing, and the speed of service is fast, without manual intervention, it is easy to management and maintenance.

Description

ETL system and its method based on Spark technologies
Technical field
The present invention relates to a kind of ETL system and its method, more particularly to a kind of ETL system based on Spark technologies and its Method.
Background technology
With the development of big data, enterprise increasingly payes attention to the related development and application of data, so as to obtain more cities Field chance, big data application be unable to do without the cleaning and processing of mass data, and enterprise is generally using the ETL of main flow(Data pick-up, turn Change and load)Product, or direct coding using database store process carries out data processing.
At present, the ETL products of main flow are mostly based on unit framework, when mass data is processed, I/O throughput, system resource There is bottleneck, extend difficult and expensive;On the other hand, ETL products focus on the ease for use of operation interface, each data mart modeling Process is designed by picture, but metadata management and ETL products are separated, and phase need to be manually changed when causing data structure to change Close design;Database store process coding there is a problem of it is same, and develop it is poorly efficient, it is difficult in maintenance.
With the appearance of Hadoop, the linear expansion and low cost of distributed platform solve unit and build number well According to the deficiency of platform;Data mart modeling is carried out using Hive in Hadoop platform and obtains more accreditation, but Hive is to be based on MapReduce distributed computing architectures, there is congenital deficiency, and the often step of MapReduce is calculated to be needed to read and write disk, belongs to Gao Yan When iteration.
To sum up, it is necessary to design and a kind of drawbacks described above is made up based on the ETL system and its method of Spark technologies.
The content of the invention
The present invention proposes a kind of ETL system and its method based on Spark technologies, which solves in the prior art in treatment During mass data, there is bottleneck in I/O throughput, system resource, extend difficult and expensive defect.The present invention is based on Spark Technology, can be extended with linear smoothing, and the speed of service is fast, be run without manual intervention, and is easily managed and is safeguarded, can fully be met The need for every profession and trade particularly large enterprises are in terms of the ETL process.
The technical proposal of the invention is realized in this way:
The present invention discloses a kind of ETL system based on Spark technologies, and it includes data extraction module, data processing module, data Integrate module, data outputting module, metadata management module and data memory module;Data memory module is deposited including interim data Bank, integral data thesaurus and metadata control file;Data extraction module is used to extract source data, and according to deblocking Rule dynamically generates multiple Spark RDD on distribution node, then calls data processing by the multiple threads of thread pool startup Module carries out parallel processing to each Spark RDD;Data processing module is used to read the Spark of data extraction module generation RDD, changes, the data after being processed by meta data match inspection and volume of data, and is stored in interim data storage In storehouse;Data Integration module is used to carry out full dose Data Integration or history to the interim data and the integral data of upper one day on the same day Data Integration, obtains data after same day integration, and be stored in integral data thesaurus;Data outputting module is used for according to data Requirement of the application system to data form, data enter row format and change and export after being integrated to the same day;Metadata management module is used In the various key elements of system are carried out into Parametric Definition and management.
Wherein, data extraction module includes data access module and the first distributed data collection generation module;Data access Module it is offline extract and in line extraction by way of directly read the data file of compressed format, and it is distributed to be sent to first Dataset generation module carries out subsequent treatment;First distributed data collection generation module reads source number by data access module According to, and multiple Spark RDD are dynamically generated on distribution node according to deblocking rule, pending data processing module is further located Reason.
Wherein, data processing module includes data review module and data conversion module;Data review module passes through first number Data audit report is checked and generated according to a paired data;Data conversion module is used to that data to be cleared up and changed.
Wherein, Data Integration module includes the second distributed data collection generation module, full dose Data Integration module and history Data Integration module;Second distributed data collection generation module is used for interim data and the integration of upper one day on the day of reading respectively Data, generate corresponding Spark RDD on distributed node;Full dose Data Integration module is used to read the second distributed data The Spark RDD of collection generation module generation, increasing in interim data, delete, change mark, mutually tackle the integral data of a day The data of identical key value are increased, are deleted, being changed operation in Spark RDD, after the completion for the treatment of, are formed interim data Spark RDD and are deposited Enter in integral data thesaurus;Historical data integrates the interim data Spark RDD on the day of module is used to read, according to transfer Increasing in data, delete, change mark, the data to identical key value in the integral data Spark RDD of upper one day do respective handling, place After the completion of reason, it is deposited into integral data thesaurus.
Wherein, data outputting module includes the 3rd distributed data collection generation module and target data output module;3rd Distributed data collection generation module is used for the integral data on the day of reading, and corresponding Spark RDD are generated on distributed node; Target data output module is used to read the Spark RDD of the 3rd distributed data collection generation module generation, after being integrated to the same day Data enter row format and change and export.
Wherein, metadata management module includes that metadata definition module, metadata check module and metadata export module; Metadata definition module is used to directly carry out metadata definition and maintenance using Excel, and wherein metadata includes:Data source is believed Breath, source data structure, target data form, target data structure, data conversion rule and expression formula;Metadata checks that module is used Metadata is checked according to a series of metadata specification, and exports metadata audit report;Metadata export module For the metadata in Excel to be exported as into metadata control file.
Invention additionally discloses a kind of method of the ETL system based on Spark technologies, it comprises the following steps:(S01)Extract Source data, and multiple Spark RDD are dynamically generated on distribution node according to deblocking rule, then started by thread pool many Individual thread carries out parallel processing calling data processing module to each Spark RDD;(S02)Read data extraction module generation Spark RDD, changed by meta data match inspection and volume of data, the data after being processed, and be stored in transfer In data repository;(S03)The integral data of interim data and upper one day to the same day carries out full dose Data Integration or history number According to integration, data after same day integration are obtained, and be stored in integral data thesaurus;(S04)According to Data application system logarithm According to the requirement of form, data enter row format and change and export after being integrated to the same day;(S05)The various key elements of system are parameterized Definition and management.
Wherein, step(S01)Comprise the following steps:(S11)Directly read by way of offline extraction and in line extraction The data file of compressed format, and be sent to the first distributed data collection generation module and carry out subsequent treatment;(S12)Reading source number According to, and multiple Spark RDD are dynamically generated on distribution node according to deblocking rule, pending data processing module is further located Reason.
Wherein, step(S02)Comprise the following steps:(S21)By meta data match data are checked and are generated with number According to audit report;(S22)Data are cleared up and changed using the rule or expression formula of metadata definition.
Wherein, step(S03)Comprise the following steps:(S31)Interim data and the integration of upper one day on the day of reading respectively Data, generate corresponding Spark RDD on distributed node;(S32)Read step(S31)The Spark RDD of middle generation, root According to the increasing in interim data, delete, change mark, the data for mutually tackling identical key value in the integral data Spark RDD of a day are entered Row increases, deletes, changing operation, after the completion for the treatment of, forms interim data Spark RDD and is deposited into integral data thesaurus;(S33)Read Take step(S32)The interim data Spark RDD on the same day of formation, increasing in interim data, delete, change mark, to upper one The data of identical key value do respective handling in it integral data Spark RDD, after the completion for the treatment of, are deposited into integral data storage In storehouse.
Wherein, step(S04)Comprise the following steps:(S41)Integral data on the day of reading, generates on distributed node Corresponding Spark RDD;(S42)Read step(S41)The Spark RDD of generation, data are entered row format and are turned after being integrated to the same day Change and export.
Wherein, step(S05)Comprise the following steps:(S51)Metadata definition and maintenance are directly carried out using Excel, its Middle metadata includes:Data source information, source data structure, target data form, target data structure, data conversion rule and table Up to formula;(S52)Metadata is checked according to a series of metadata specification, and exports metadata audit report;(S53)Will Metadata in Excel exports as metadata control file.
Compared with prior art, the invention has the advantages that:
1st, metadata management, it is easy to use, it is easy to maintenance.The present invention is using Excel tables management easy to use and configuration;First number Directly safeguarded in Excel tables according to change, it is very clear.
2nd, without programming, out-of-the-box, automatic running.Rapid deployment of the present invention, out-of-the-box, ripe complete ETL works Tool case, covers conventional bank data ETL demands;Metadata is once provided with, and system automatic streamline wire type service data is taken out Take, data processing, Data Integration, the module such as data output, without manual intervention.Source data is changed, and only need to change corresponding unit Data, without programming.
3rd, internal memory is calculated, and performance is double, linear to expand.The present invention is using Scala programming language perfect adaptations Spark;Profit With Spark distributed memory parallel computings, results of intermediate calculations is buffered in internal memory, reduces magnetic disc i/o;Using multithreading Concurrently operation treatment operation improves the performance and resource utilization of ETL;The present invention is compared with unit ETL tool software and Hadoop MapReduce, the speed of service has the lifting of several times.
Brief description of the drawings
In order to illustrate more clearly about the embodiment of the present invention or technical scheme of the prior art, below will be to embodiment or existing The accompanying drawing to be used needed for having technology description is briefly described.It should be evident that drawings in the following description are only this Some embodiments of invention, for those of ordinary skill in the art, on the premise of not paying creative work, can be with Other accompanying drawings are obtained according to these accompanying drawings.
Fig. 1 is the structured flowchart of ETL system of the present invention based on Spark technologies.
Fig. 2 is the structured flowchart of data extraction module of the present invention.
Fig. 3 is the structured flowchart of data processing module of the present invention.
Fig. 4 is the structured flowchart of Data Integration module of the present invention.
Fig. 5 is the structured flowchart of data outputting module of the present invention.
Fig. 6 is the structured flowchart of metadata management module of the present invention.
Specific embodiment
Below in conjunction with the accompanying drawing in the embodiment of the present invention, the technical scheme in the embodiment of the present invention is carried out clear, complete Site preparation is described, it is clear that described embodiment is only a part of embodiment of the invention, rather than whole embodiments.It is based on Embodiment in the present invention, it is every other that those of ordinary skill in the art are obtained under the premise of creative work is not made Embodiment, belongs to the scope of protection of the invention.
In order to contribute to and clarifying the description of subsequent embodiment, carried out specifically to specific embodiment of the invention Before bright, part term is explained, following explanation is applied to this specification and claims.
The ETL occurred in the present invention, is the abbreviation of English Extract-Transform-Load, for describing data From source terminal through extraction (extract), conversion (transform), loading (load) to destination process, the words of ETL mono- compared with Data warehouse is commonly used in, but its object is not limited to data warehouse;ETL is the important ring for building data warehouse, and user is from number Required data are extracted according to source, by data cleansing, finally according to the data warehouse model for pre-defining, data is loaded To in data warehouse.
The English full name of RDD is Resilient Distributed Datasets, refer to one it is read-only, can subregion Distributed data collection, all or part of this data set can be buffered in internal memory, reused between repeatedly calculate;Spark The Chinese of RDD is elasticity distribution formula data set.
Other English that the present invention occurs are code, without any other meaning.
Referring to figs. 1 to Fig. 6, the present invention discloses the present invention and discloses a kind of ETL system based on Spark technologies, and it includes number According to abstraction module, data processing module, Data Integration module, data outputting module, metadata management module and data storage mould Block;Data memory module includes interim data thesaurus, integral data thesaurus and metadata control file, revolution in the present invention Columnar database is used according to thesaurus, integral data thesaurus(Such as Parquet), substantially reduce memory space and lifting inquiry property Energy;Metadata control file of the present invention uses XML file Preservation Metadata, metadata management is become simple and general.
Data extraction module of the present invention supports relevant database, structured data file(It is compressible)、BigData(Such as HDFS、Hive、JSON、Parquet)Etc. various heterogeneous data sources, it is used to extract source data, and is existed according to deblocking rule Multiple Spark RDD are dynamically generated on distribution node, then data processing module pair is called by the multiple threads of thread pool startup Each Spark RDD carries out parallel processing;Data processing module is used to read the Spark RDD of data extraction module generation, warp Meta data match inspection and volume of data conversion, the data after being processed are crossed, and is stored in interim data thesaurus;Number According to integrating, module is used to carry out full dose Data Integration to the interim data and the integral data of upper one day on the same day or historical data is whole Close, obtain data after same day integration, and be stored in integral data thesaurus;Data outputting module is used for according to data application system The requirement united to data form, data enter row format and change and export after being integrated to the same day, wherein, data output format supports knot Structure data file(It is compressible), relevant database, Hive etc..
Metadata management module of the present invention is used to for the various key elements of system to carry out Parametric Definition and management;Wherein, system Various key elements include data source information, source data structure, target data form, target data structure, data conversion rule and table Up to formula etc..Metadata of the present invention is once provided with, and the extraction of system automatic streamline wire type service data, data processing, data are whole The modules such as conjunction, data output, without manual intervention.And source data is changed, corresponding metadata only need to be changed, without reprogram.
Wherein, data extraction module includes data access module and the first distributed data collection generation module;Data access Module it is offline extract and in line extraction by way of directly read the data file of compressed format, and it is distributed to be sent to first Dataset generation module carries out subsequent treatment;First distributed data collection generation module reads source number by data access module According to, and multiple Spark RDD are dynamically generated on distribution node according to deblocking rule, pending data processing module is further located Reason.
Data access module of the present invention is interface layer, is the uniform data passage of present invention connection data source, by this Passage can with high concurrent, highly reliably extract source data, data access module support relevant database, structured data file (It is compressible)、BigData(Such as HDFS, Hive, JSON, Parquet)Etc. various heterogeneous data sources.Data access module of the present invention Support offline extraction and in two kinds of data pick-up modes of line extraction, offline extraction mode refers to origin system generation data file, pressure Pass through file transmitting software after contracting(Such as FTP)The system is sent to, then the system is loaded into by data access;In line extraction Mode refers to extract source data to the system in real time online by data access.Data access module of the present invention also supports directly reading The data file of compressed format is taken, and carries out subsequent treatment, whole process data buffer storage is relatively first solved in the past in internal memory not write magnetic disk Then compressed file reads the traditional method of data file again, is obviously improved in speed, while decreasing data text The network transmission of part.
The general principle of deblocking rule of the present invention is to enter Mobile state according to data total amount size and data record size Calculate, obtain the number of Spark RDD, and by rule load data into each Spark RDD.Deblocking rule is solved Spark is based on the deficiency of default parameter HDFS Block sizes generation Spark RDD, allow wall scroll data record intactly not by Dividedly it is assigned in same Spark RDD, so as to reduce across Spark RDD calculate, improves the performance of data processing.
Wherein, data processing module includes data review module and data conversion module;Data review module passes through first number Data audit report is checked and generated according to a paired data;Data conversion module is used to that data to be cleared up and changed. Data audit report is supplied to data administrator to understand source data quality, it is also possible to be supplied to the system manager of origin system to join Examine.
When data review module is checked meta data match, the data for reading are examined according to pre-defined metadata Look into, including the quantity of data field, field data types, field length, field data form, whether meet business checklist and reach The inspection of the aspects such as formula, has checked rear output data audit report, records problematic data and time, and totality inspection Situation, such as how many of total data are looked into, wherein how many of problematic data etc..
The function of data conversion module includes:Code conversion, data form, increase field, by expression formula conversion etc.;Its Middle code conversion supports that GBK, UTF-16, ANSI code conversion are encoded for UTF-8.Data form is turned to:According to metadata definition Data Format Transform requirement, be converted to object format, such as date, form is converted to YYYY-MM-DD for MMDDYYYY.Increase Field is:New field requirement according to metadata definition, generates newer field, and the value of new field is fixed value, or by Other field combinations are calculated, and such as newly-increased age field, its value is calculated by date of birth field value.By expression formula Be converted to:Data converting function calls expression formula engine, and the expression formula to metadata definition is parsed, and by the table after parsing Data conversion is carried out up to formula.Expression formula support TRIM, SUBSTRING, CONCAT, REPLACE, IF-ELSE-FI decision logic, Underlying mathematical operations, regular expressions handling function etc..
Wherein, Data Integration module includes the second distributed data collection generation module, full dose Data Integration module and history Data Integration module;Second distributed data collection generation module is used for interim data and the integration of upper one day on the day of reading respectively Data, generate corresponding Spark RDD on distributed node;Full dose Data Integration module is used to read the second distributed data The Spark RDD of collection generation module generation, increasing in interim data, delete, change mark, mutually tackle the integral data of a day The data of identical key value are increased, are deleted, being changed operation in Spark RDD, wherein deleting only do logic deletion, are deleted mark and are set to ' 1 ', after the completion for the treatment of, form interim data Spark RDD and be deposited into integral data thesaurus;Historical data is integrated module and is used Interim data Spark RDD on the day of reading, increasing in interim data, delete, change mark, to the integral data of upper one day The data of identical key value do respective handling in Spark RDD, after the completion for the treatment of, are deposited into integral data thesaurus, wherein, delete Mark, the deletion mark to data is set to ' 1 ', and the Expiration Date is set to the previous day on source data date;Increase mark, increase one newly Data, the effective date is set to the source data date, and the Expiration Date is set to ' 9999-01-01 ';Change mark, change nearest one Data content, the Expiration Date is set to the previous day on source data date, and a newly-increased data, and the effective date is set to source data day Phase, the Expiration Date is set to ' 9999-01-01 '.
Wherein, data outputting module includes the 3rd distributed data collection generation module and target data output module;3rd Distributed data collection generation module is used for the integral data on the day of reading, and corresponding Spark RDD are generated on distributed node; Target data output module is used to read the Spark RDD of the 3rd distributed data collection generation module generation, according to data application Requirement of the system to data form, data enter row format and change and export after being integrated to the same day, and its data output format supports knot Structure data file(It is compressible), relevant database, Hive etc..
Wherein, metadata management module includes that metadata definition module, metadata check module and metadata export module; Metadata definition module is used to directly carry out metadata definition and maintenance using Excel, easy to use, very clear, wherein unit Data include:Data source information, source data structure, target data form, target data structure, data conversion rule and expression Formula;Metadata checks that module is used to check metadata according to a series of metadata specification, and exports metadata inspection Report;Metadata export module is used to for the metadata in Excel to export as metadata control file, and metadata control file is adopted XML file Preservation Metadata is used, metadata management is become simple and general.
The present invention is built upon the ETL products of big data platform and distributed memory parallel computing, and it uses distribution Support and storage platform, data mart modeling framework is built using Spark core components based on formula big data platform Hadoop, profit Technology is iterated to calculate with many wheels based on internal memory of the advanced DAG enforcement engines of Spark and powerful, depth is carried out to source data Degree processing.
Using Scala programmings, Scala is the static function formula programming language for operating in the object-oriented on JVM to the present invention Speech, with speed is fast, succinct API, the features such as be easy to integrated with Hadoop, YARN, Spark kernels are by Scala language developments , of the invention and Spark perfect adaptations, go directly Spark kernels, improves programming efficiency and big data process performance, while protecting The high fault tolerance and high scalability of system are demonstrate,proved.
The present invention is with metadata definition and the various key elements of management ETL;Using Scala programming language perfect adaptations Spark; Using Spark distributed memory parallel computings, results of intermediate calculations is buffered in internal memory, reduces magnetic disc i/o;Using multi-thread Cheng Bingfa operation treatment operation improve ETL performance and resource utilization, therefore the present invention compared with conventional individual framework ETL products and Hadoop MapReduce, the speed of service has the lifting of several times.
Invention additionally discloses a kind of method of the ETL system based on Spark technologies, it comprises the following steps:(S01)Extract Source data, and multiple Spark RDD are dynamically generated on distribution node according to deblocking rule, then started by thread pool many Individual thread carries out parallel processing calling data processing module to each Spark RDD;(S02)Read data extraction module generation Spark RDD, changed by meta data match inspection and volume of data, the data after being processed, and be stored in transfer In data repository;(S03)The integral data of interim data and upper one day to the same day carries out full dose Data Integration or history number According to integration, data after same day integration are obtained, and be stored in integral data thesaurus;(S04)According to Data application system logarithm According to the requirement of form, data enter row format and change and export after being integrated to the same day;(S05)The various key elements of system are parameterized Definition and management.
Wherein, step(S01)Comprise the following steps:(S11)Directly read by way of offline extraction and in line extraction The data file of compressed format, and be sent to the first distributed data collection generation module and carry out subsequent treatment;(S12)Reading source number According to, and multiple Spark RDD are dynamically generated on distribution node according to deblocking rule, pending data processing module is further located Reason.
Wherein, step(S02)Comprise the following steps:(S21)By meta data match data are checked and are generated with number According to audit report;(S22)Data are cleared up and changed using the rule or expression formula of metadata definition.
Wherein, step(S03)Comprise the following steps:(S31)Interim data and the integration of upper one day on the day of reading respectively Data, generate corresponding Spark RDD on distributed node;(S32)Read step(S31)The Spark RDD of middle generation, root According to the increasing in interim data, delete, change mark, the data for mutually tackling identical key value in the integral data Spark RDD of a day are entered Row increases, deletes, changing operation, after the completion for the treatment of, forms interim data Spark RDD and is deposited into integral data thesaurus;(S33)Read Take step(S32)The interim data Spark RDD on the same day of formation, increasing in interim data, delete, change mark, to upper one The data of identical key value do respective handling in it integral data Spark RDD, after the completion for the treatment of, are deposited into integral data storage In storehouse.
Wherein, step(S04)Comprise the following steps:(S41)Integral data on the day of reading, generates on distributed node Corresponding Spark RDD;(S42)Read step(S41)The Spark RDD of generation, data are entered row format and are turned after being integrated to the same day Change and export.
Wherein, step(S05)Comprise the following steps:(S51)Metadata definition and maintenance are directly carried out using Excel, its Middle metadata includes:Data source information, source data structure, target data form, target data structure, data conversion rule and table Up to formula;(S52)Metadata is checked according to a series of metadata specification, and exports metadata audit report;(S53)Will Metadata in Excel exports as metadata control file.
The present invention is built upon one kind of Hadoop Base data platforms and Spark distributed memory parallel computation frames ETL products, its calculating process be based on internal memory many wheels iterate to calculate, the speed of service compared with conventional individual framework ETL products and The fast several times of Hadoop MapReduce;As the smooth extension of cluster and internal memory are increased, ETL performances of the invention can obtain line The lifting of property;In addition, the present invention is embedded in metadata management component, metadata is once provided with, system automatic streamline wire type fortune The modules such as row data pick-up, data processing, Data Integration, data output, without manual intervention;And source data structure of the invention Change, only need to change corresponding metadata, without reprogram, it is easy to management and maintenance.
Presently preferred embodiments of the present invention is the foregoing is only, is not intended to limit the invention, it is all in essence of the invention Within god and principle, any modification, equivalent substitution and improvements made etc. should be included within the scope of the present invention.

Claims (10)

1. a kind of ETL system based on Spark technologies, it is characterised in that it include data extraction module, data processing module, Data Integration module, data outputting module, metadata management module and data memory module;Data memory module includes middle revolution File is controlled according to thesaurus, integral data thesaurus and metadata;
Data extraction module is used to extract source data, and multiple is dynamically generated on distribution node according to deblocking rule Spark RDD, then call data processing module to locate each Spark RDD parallel by the multiple threads of thread pool startup Reason;
Data processing module is used to read the Spark RDD of data extraction module generation, is by meta data match inspection and one Column data is changed, the data after being processed, and is stored in interim data thesaurus;
Data Integration module is used to carry out full dose Data Integration or history to the interim data and the integral data of upper one day on the same day Data Integration, obtains data after same day integration, and be stored in integral data thesaurus;
Data outputting module is used for the requirement according to Data application system to data form, and data enter row format after being integrated to the same day Change and export;
Metadata management module is used to for the various key elements of system to carry out Parametric Definition and management.
2. the ETL system of Spark technologies is based on as claimed in claim 1, it is characterised in that data extraction module includes data AM access module and the first distributed data collection generation module;
Data access module it is offline extract and in line extraction by way of directly read the data file of compressed format, and transmit Subsequent treatment is carried out to the first distributed data collection generation module;
First distributed data collection generation module reads source data by data access module, and is being divided according to deblocking rule Multiple Spark RDD are dynamically generated on cloth node, pending data processing module is further processed.
3. the ETL system of Spark technologies is based on as claimed in claim 2, it is characterised in that data processing module includes data Check module and data conversion module;
Data review module is checked data by meta data match, and generates data audit report;
Data conversion module is used to that data to be cleared up and changed.
4. the ETL system of Spark technologies is based on as claimed in claim 3, it is characterised in that Data Integration module includes second Distributed data collection generation module, full dose Data Integration module and historical data integrate module;
Second distributed data collection generation module is used for interim data and the integral data of upper one day on the day of reading respectively, is dividing Corresponding Spark RDD are generated on cloth node;
Full dose Data Integration module is used to read the Spark RDD of the second distributed data collection generation module generation, according to transfer Increasing in data, delete, change mark, the data of identical key value in the integral data Spark RDD of upper one day should mutually be increased, Delete, change operation, after the completion for the treatment of, form interim data Spark RDD and be deposited into integral data thesaurus;
Historical data integrates the interim data Spark RDD on the day of module is used to read, and increasing in interim data, deletes, changes Mark, the data to identical key value in the integral data Spark RDD of upper one day are done respective handling, after the completion for the treatment of, are deposited into In integral data thesaurus.
5. the ETL system of Spark technologies is based on as claimed in claim 4, it is characterised in that data outputting module includes the 3rd Distributed data collection generation module and target data output module;
3rd distributed data collection generation module is used for the integral data on the day of reading, is generated on distributed node corresponding Spark RDD;
Target data output module is used to read the Spark RDD of the 3rd distributed data collection generation module generation, whole to the same day Data enter row format and change and export after conjunction.
6. the ETL system of Spark technologies is based on as claimed in claim 5, it is characterised in that metadata management module includes unit Data definition module, metadata check module and metadata export module;
Metadata definition module is used to directly carry out metadata definition and maintenance using Excel, and wherein metadata includes:Data source Information, source data structure, target data form, target data structure, data conversion rule and expression formula;
Metadata checks that module is used to check metadata according to a series of metadata specification, and exports metadata inspection Report;
Metadata export module is used to for the metadata in Excel to export as metadata control file.
7. a kind of method of ETL system based on Spark technologies as any one of claim 1-6, it is characterised in that It comprises the following steps:
(S01)Source data is extracted, and multiple Spark RDD are dynamically generated on distribution node according to deblocking rule, then led to Thread pool is crossed to start multiple threads to call the data processing module to carry out parallel processing to each Spark RDD;
(S02)The Spark RDD of data extraction module generation are read, are changed by meta data match inspection and volume of data, Data after being processed, and be stored in interim data thesaurus;
(S03)The integral data of interim data and upper one day to the same day carries out full dose Data Integration or historical data is integrated, and obtains Data after being integrated to the same day, and be stored in integral data thesaurus;
(S04)Requirement according to Data application system to data form, data enter row format and change and export after being integrated to the same day;
(S05)The various key elements of system are carried out into Parametric Definition and management.
8. the method for the ETL system of Spark technologies is based on as claimed in claim 7, it is characterised in that step(S01)Including Following steps:
(S11)The data file of compressed format is directly read by way of offline extraction and in line extraction, and is sent to first Distributed data collection generation module carries out subsequent treatment;
(S12)Source data is read, and multiple Spark RDD are dynamically generated on distribution node according to deblocking rule, wait to count Further processed according to processing module;
Step(S02)Comprise the following steps:
(S21)By meta data match data are checked and are generated with data audit report;
(S22)Data are cleared up and changed using the rule or expression formula of metadata definition.
9. the method for the ETL system of Spark technologies is based on as claimed in claim 8, it is characterised in that step(S03)Including Following steps:
(S31)Interim data and the integral data of upper one day on the day of reading respectively, generate corresponding on distributed node Spark RDD;
(S32)Read step(S31)The Spark RDD of middle generation, increasing in interim data, delete, change mark, mutually in reply The data of identical key value are increased, are deleted, being changed operation in the integral data Spark RDD of a day, after the completion for the treatment of, revolution in formation It is deposited into integral data thesaurus according to Spark RDD;
(S33)Read step(S32)The interim data Spark RDD on the same day of formation, increasing in interim data, delete, change Mark, the data to identical key value in the integral data Spark RDD of upper one day are done respective handling, after the completion for the treatment of, are deposited into In integral data thesaurus;
Step(S04)Comprise the following steps:
(S41)Integral data on the day of reading, generates corresponding Spark RDD on distributed node;
(S42)Read step(S41)The Spark RDD of generation, data enter row format and change and export after being integrated to the same day.
10. the method for the ETL system of Spark technologies is based on as claimed in claim 9, it is characterised in that step(S05)Including Following steps:
(S51)Metadata definition and maintenance are directly carried out using Excel, wherein metadata includes:Data source information, source data knot Structure, target data form, target data structure, data conversion rule and expression formula;
(S52)Metadata is checked according to a series of metadata specification, and exports metadata audit report;
(S53)Metadata in Excel is exported as into metadata control file.
CN201710088150.6A 2017-02-20 2017-02-20 ETL system and its method based on Spark technologies Pending CN106897411A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201710088150.6A CN106897411A (en) 2017-02-20 2017-02-20 ETL system and its method based on Spark technologies

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201710088150.6A CN106897411A (en) 2017-02-20 2017-02-20 ETL system and its method based on Spark technologies

Publications (1)

Publication Number Publication Date
CN106897411A true CN106897411A (en) 2017-06-27

Family

ID=59184001

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201710088150.6A Pending CN106897411A (en) 2017-02-20 2017-02-20 ETL system and its method based on Spark technologies

Country Status (1)

Country Link
CN (1) CN106897411A (en)

Cited By (20)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107315726A (en) * 2017-07-12 2017-11-03 广东奡风科技股份有限公司 A kind of method that big data ETL overall processes based on Excel are defined
CN107609008A (en) * 2017-07-26 2018-01-19 郑州云海信息技术有限公司 A kind of data importing device and method from relevant database to Kafka based on Apache Sqoop
CN107707903A (en) * 2017-08-22 2018-02-16 贵阳朗玛信息技术股份有限公司 The determination method and device of user video communication quality
CN107871013A (en) * 2017-11-23 2018-04-03 安徽科创智慧知识产权服务有限公司 A kind of mass data efficient decimation method
CN107908797A (en) * 2017-12-18 2018-04-13 上海中畅数据技术有限公司 A kind of ETL data stream treatment technology method and systems in real time
CN108052574A (en) * 2017-12-08 2018-05-18 南京中新赛克科技有限责任公司 Slave ftp server based on Kafka technologies imports the ETL system and implementation method of mass data
CN108304538A (en) * 2018-01-30 2018-07-20 广东奡风科技股份有限公司 A kind of ETL system and its method based entirely on distributed memory calculating
CN108763948A (en) * 2018-03-16 2018-11-06 北京明朝万达科技股份有限公司 A kind of automatic measures and procedures for the examination and approval of file and system of data-oriented anti-disclosure system
CN109254989A (en) * 2018-08-27 2019-01-22 北京东软望海科技有限公司 A kind of method and device of the elastic ETL architecture design based on metadata driven
CN109408586A (en) * 2018-09-03 2019-03-01 中新网络信息安全股份有限公司 A kind of polynary isomeric data fusion method of distribution
CN109800092A (en) * 2018-12-17 2019-05-24 华为技术有限公司 A kind of processing method of shared data, device and server
CN109814991A (en) * 2018-12-25 2019-05-28 北京明略软件系统有限公司 A kind of data administer in task management method and device
CN109857832A (en) * 2019-01-03 2019-06-07 中国银行股份有限公司 A kind of preprocess method and device of payment data
CN111211993A (en) * 2018-11-21 2020-05-29 百度在线网络技术(北京)有限公司 Incremental persistence method and device for streaming computation
CN111914009A (en) * 2020-07-07 2020-11-10 傲普(上海)新能源有限公司 Pyspark-based energy storage data calculation and analysis method
CN112115191A (en) * 2020-09-22 2020-12-22 南京北斗创新应用科技研究院有限公司 Branch optimization method executed by big data ETL model
CN113064870A (en) * 2021-03-22 2021-07-02 中国人民大学 Big data processing method based on compressed data direct calculation
CN114490525A (en) * 2022-02-22 2022-05-13 北京科杰科技有限公司 System and method for analyzing and putting out and putting in storage of super-large unstructured text files remotely based on hadoop
CN115357657A (en) * 2022-10-24 2022-11-18 成都数联云算科技有限公司 Data processing method and device, computer equipment and storage medium
CN116860861A (en) * 2023-09-05 2023-10-10 杭州瞬安信息科技有限公司 ETL data management system

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103092980A (en) * 2013-01-31 2013-05-08 中国科学院自动化研究所 Method and system of data automatic conversion and storage
CN105243155A (en) * 2015-10-29 2016-01-13 贵州电网有限责任公司电力调度控制中心 Big data extracting and exchanging system
CN105468770A (en) * 2015-12-09 2016-04-06 合一网络技术(北京)有限公司 Data processing method and system
CN106202569A (en) * 2016-08-09 2016-12-07 北京北信源软件股份有限公司 A kind of cleaning method based on big data quantity
CN106326457A (en) * 2016-08-29 2017-01-11 山大地纬软件股份有限公司 Construction method and system of human society person portfolio database on the basis of big data

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103092980A (en) * 2013-01-31 2013-05-08 中国科学院自动化研究所 Method and system of data automatic conversion and storage
CN105243155A (en) * 2015-10-29 2016-01-13 贵州电网有限责任公司电力调度控制中心 Big data extracting and exchanging system
CN105468770A (en) * 2015-12-09 2016-04-06 合一网络技术(北京)有限公司 Data processing method and system
CN106202569A (en) * 2016-08-09 2016-12-07 北京北信源软件股份有限公司 A kind of cleaning method based on big data quantity
CN106326457A (en) * 2016-08-29 2017-01-11 山大地纬软件股份有限公司 Construction method and system of human society person portfolio database on the basis of big data

Cited By (30)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107315726A (en) * 2017-07-12 2017-11-03 广东奡风科技股份有限公司 A kind of method that big data ETL overall processes based on Excel are defined
CN107609008A (en) * 2017-07-26 2018-01-19 郑州云海信息技术有限公司 A kind of data importing device and method from relevant database to Kafka based on Apache Sqoop
CN107707903A (en) * 2017-08-22 2018-02-16 贵阳朗玛信息技术股份有限公司 The determination method and device of user video communication quality
CN107871013A (en) * 2017-11-23 2018-04-03 安徽科创智慧知识产权服务有限公司 A kind of mass data efficient decimation method
CN108052574A (en) * 2017-12-08 2018-05-18 南京中新赛克科技有限责任公司 Slave ftp server based on Kafka technologies imports the ETL system and implementation method of mass data
CN107908797A (en) * 2017-12-18 2018-04-13 上海中畅数据技术有限公司 A kind of ETL data stream treatment technology method and systems in real time
CN108304538A (en) * 2018-01-30 2018-07-20 广东奡风科技股份有限公司 A kind of ETL system and its method based entirely on distributed memory calculating
CN108763948B (en) * 2018-03-16 2020-07-24 北京明朝万达科技股份有限公司 Automatic document approval method and system for data leakage prevention system
CN108763948A (en) * 2018-03-16 2018-11-06 北京明朝万达科技股份有限公司 A kind of automatic measures and procedures for the examination and approval of file and system of data-oriented anti-disclosure system
CN109254989B (en) * 2018-08-27 2020-11-20 望海康信(北京)科技股份公司 Elastic ETL (extract transform load) architecture design method and device based on metadata drive
CN109254989A (en) * 2018-08-27 2019-01-22 北京东软望海科技有限公司 A kind of method and device of the elastic ETL architecture design based on metadata driven
CN109408586A (en) * 2018-09-03 2019-03-01 中新网络信息安全股份有限公司 A kind of polynary isomeric data fusion method of distribution
CN111211993B (en) * 2018-11-21 2023-08-11 百度在线网络技术(北京)有限公司 Incremental persistence method, device and storage medium for stream computation
CN111211993A (en) * 2018-11-21 2020-05-29 百度在线网络技术(北京)有限公司 Incremental persistence method and device for streaming computation
CN109800092A (en) * 2018-12-17 2019-05-24 华为技术有限公司 A kind of processing method of shared data, device and server
US11445004B2 (en) 2018-12-17 2022-09-13 Petal Cloud Technology Co., Ltd. Method for processing shared data, apparatus, and server
CN109814991A (en) * 2018-12-25 2019-05-28 北京明略软件系统有限公司 A kind of data administer in task management method and device
CN109857832A (en) * 2019-01-03 2019-06-07 中国银行股份有限公司 A kind of preprocess method and device of payment data
CN111914009A (en) * 2020-07-07 2020-11-10 傲普(上海)新能源有限公司 Pyspark-based energy storage data calculation and analysis method
CN111914009B (en) * 2020-07-07 2023-02-24 傲普(上海)新能源有限公司 Pyspark-based energy storage data calculation and analysis method
CN112115191A (en) * 2020-09-22 2020-12-22 南京北斗创新应用科技研究院有限公司 Branch optimization method executed by big data ETL model
CN112115191B (en) * 2020-09-22 2022-02-15 南京北斗创新应用科技研究院有限公司 Branch optimization method executed by big data ETL model
WO2022062751A1 (en) * 2020-09-22 2022-03-31 南京北斗创新应用科技研究院有限公司 Branch optimization method executed by big data etl model
CN113064870A (en) * 2021-03-22 2021-07-02 中国人民大学 Big data processing method based on compressed data direct calculation
CN113064870B (en) * 2021-03-22 2021-11-30 中国人民大学 Big data processing method based on compressed data direct calculation
CN114490525B (en) * 2022-02-22 2022-08-02 北京科杰科技有限公司 System and method for analyzing and warehousing of ultra-large unstructured text files based on hadoop remote
CN114490525A (en) * 2022-02-22 2022-05-13 北京科杰科技有限公司 System and method for analyzing and putting out and putting in storage of super-large unstructured text files remotely based on hadoop
CN115357657A (en) * 2022-10-24 2022-11-18 成都数联云算科技有限公司 Data processing method and device, computer equipment and storage medium
CN116860861A (en) * 2023-09-05 2023-10-10 杭州瞬安信息科技有限公司 ETL data management system
CN116860861B (en) * 2023-09-05 2023-12-15 杭州瞬安信息科技有限公司 ETL data management system

Similar Documents

Publication Publication Date Title
CN106897411A (en) ETL system and its method based on Spark technologies
US20230126005A1 (en) Consistent filtering of machine learning data
US11392586B2 (en) Data protection method and device and storage medium
CN108304538A (en) A kind of ETL system and its method based entirely on distributed memory calculating
US11100420B2 (en) Input processing for machine learning
US10339465B2 (en) Optimized decision tree based models
CN106168965B (en) Knowledge graph construction system
CN105824957A (en) Query engine system and query method of distributive memory column-oriented database
US10831747B2 (en) Multi stage aggregation using digest order after a first stage of aggregation
CN108536705A (en) The coding of object and operation method and database server in Database Systems
US11693912B2 (en) Adapting database queries for data virtualization over combined database stores
CN106095878A (en) The database manipulation device and method of table is divided based on point storehouse
US8688685B2 (en) Accelerated searching of substrings
CN106528898A (en) Method and device for converting data of non-relational database into relational database
US12026160B2 (en) Query plan cache in database systems
CN105138676B (en) Table merge querying methods are divided in point storehouse for concurrently polymerizeing calculating based on high-level language
CN114385760A (en) Method and device for real-time synchronization of incremental data, computer equipment and storage medium
CN116760661A (en) Data storage method, apparatus, computer device, storage medium, and program product
CN1897629A (en) Mass toll-ticket fast cross rearrangement based on memory
CN110674173A (en) Method and system for caching data on wind control cloud
US11847121B2 (en) Compound predicate query statement transformation
CN115544548A (en) Internet financial wind control incoming system interface field checking and managing system and method
CN111191106B (en) DSL construction method, system, electronic device and medium
CN116414801A (en) Data migration method, device, computer equipment and storage medium
CN109918410B (en) Spark platform based distributed big data function dependency discovery method

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
RJ01 Rejection of invention patent application after publication
RJ01 Rejection of invention patent application after publication

Application publication date: 20170627